UVa 12191 - File Recover

contents

  1. 1. Problem
  2. 2. Input
  3. 3. Output
  4. 4. Sample Input
  5. 5. Sample Output
  6. 6. Solution

Problem

Your school has a computer that is used as a web server for hosting its institutional web site, personal pages of the staff, sites for research groups, subjects, and many others.

Recently, the hard disk table was corrupted, so the organization of all the files was lost. Sadly enough, there are no backups of that information. The only hope is to look through the entire disk data and try to find out which parts correspond to each file. Fortunately, the disk was using a file system that kept each individual file contiguous, so only contiguous pieces of data need to be inspected.

The disk data is a sequence of bytes. Each byte in this particular disk can store an English alphabet letter (distinguishing lowercase and uppercase), a decimal digit, a point or a comma, making a total of 64 different characters.

While you were thinking about how to solve the problem, you suddenly remembered that the file system also maintained multiple copies of each file, so only the pieces of contiguous bytes that are repeated had a chance of being a file. Moreover, for each repetition of the same contiguous bytes, only one copy needs to be checked. For instance, if the data is ababcabb', the contiguous subsequencesa’, b' andab’ are repeated, but nothing containing c', norba’ or `bb’ is. Therefore, we have 3 pieces of contiguous bytes that need checking in this case.

You decide to write a program that computes exactly how many sequences need checking, that is, the number of different sequences of contiguous bytes that appear at least twice in the data.

Input

There are several test cases. The input of each test case is given in exactly one line, containing a non-empty string of at most 105 characters that represents the disk data. Each character in the string is either a lowercase letter, an uppercase letter, a digit, a point or a comma.

The last test case is followed by a line containing a single asterisk.

Output

For each test case output a single line with an integer, representing the number of different contiguous subsequences that appear at least twice in the input string.

Sample Input

1
2
3
4
5
6
ababcabb
mississippi
aaaaaaaaaaaaaaaaaaaaaaaaaa
012345678,abcdefg.STUVWXYZ
say.twice,say.twice
*

Sample Output

1
2
3
4
5
3
9
25
0
45

Solution

題目描述:

有多少不同的子字串符合出現至少兩次的條件。

題目解法:

建造後綴數組,得到高度數組後,根據遞增的情況進行計算不同字符串的個數。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
struct SuffixArray {
int sa[100005], h[100005], n;
int w[100005], ta[100005], tb[100005];
char str[100005];
void sort(int *x, int *y, int m) {
static int i;
for(i = 0; i < m; i++)
w[i] = 0;
for(i = 0; i < n; i++)
w[x[y[i]]]++;
for(i = 1; i < m; i++)
w[i] += w[i-1];
for(i = n-1; i >= 0; i--)
sa[--w[x[y[i]]]] = y[i];
}
bool cmp(int *r, int a, int b, int l) {
if(r[a] == r[b]) {
if(a+l >= n || b+l >= n)
return false;
return r[a+l] == r[b+l];
}
return false;
}
void build_h() {
int i, j, k;
for(i = 0; i < n; i++) ta[sa[i]] = i;
for(i = 0; i < n; i++) {
if(ta[i] == 0) {
h[ta[i]] = 0;
continue;
}
if(i == 0 || h[ta[i-1]] <= 1)
k = 0;
else
k = h[ta[i-1]]-1;
while(str[sa[ta[i]-1]+k] == str[sa[ta[i]]+k])
k++;
h[ta[i]] = k;
}
}
void build() {// x: rank, y: second key
int i, k, m = 128, p;
int *x = ta, *y = tb, *z;
n = strlen(str);
x[n] = 0;
for(i = 0; i < n; i++)
x[i] = str[i], y[i] = i;
sort(x, y, m);
for(k = 1, p = 1; p < n; k *= 2, m = p) {
for(p = 0, i = n-k; i < n; i++)
y[p++] = i;
for(i = 0; i < n; i++) {
if(sa[i] >= k) {
y[p++] = sa[i]-k;
}
}
sort(x, y, m);
z = x, x = y, y = z;
for(i = 1, p = 1, x[sa[0]] = 0; i < n; i++)
x[sa[i]] = cmp(y, sa[i-1], sa[i], k) ? p-1 : p++;
}
}
};
SuffixArray in;
int main() {
while(scanf("%s", in.str)) {
if(!strcmp("*", in.str))
break;
in.build();
in.build_h();
// for(int i = 0; i < in.n; i++)
// printf("%s %d\n", in.str + in.sa[i], in.h[i]);
long long ret = 0, base = 0;
in.h[in.n] = -1;
for(int i = 1; i <= in.n; i++) {
if(in.h[i] < in.h[i-1])
ret += in.h[i-1] - base, base = in.h[i];
}
printf("%lld\n", ret);
}
return 0;
}