2014-07-23

UVa 1592 - Database

1. Problem
2. Input
3. Output
4. Sample Input
5. Sample Output
6. Solution

Problem

Peter studies the theory of relational databases. Table in the relational database consists of values that are arranged in rows and columns.

There are different normal forms that database may adhere to. Normal forms are designed to minimize the redundancy of data in the database. For example, a database table for a library might have a row for each book and columns for book name, book author, and author’s email.

If the same author wrote several books, then this representation is clearly redundant. To formally define this kind of redundancy Peter has introduced his own normal form. A table is in Peter’s Normal Form (PNF) if and only if there is no pair of rows and a pair of columns such that the values in the corresponding columns are the same for both rows.

How to compete in ACM ICPC     Peter     peter@neerc.ifmo.ru
How to win ACM ICPC     Michael     michael@neerc.ifmo.ru
Notes from ACM ICPC champion     Michael     michael@neerc.ifmo.ru

The above table is clearly not in PNF, since values for 2rd and 3rd columns repeat in 2nd and 3rd rows. However, if we introduce unique author identifier and split this table into two tables – one containing book name and author id, and the other containing book id, author name, and author email, then both resulting tables will be in PNF.

$$\begin{center} \begin{tabular}{\vert l\vert l... ...\hline Notes from ACM ICPC champion & 2 \\ \hline \end{tabular} \end{center}} \begin{center} \begin{tabular}{\vert l\vert ... ...ine 2 & Michael & michael@neerc.ifmo.ru \\ \hline \end{tabular} \end{center}}$$

Given a table your task is to figure out whether it is in PNF or not.

Input

Input contains several datasets. The first line of each dataset contains two integer numbers n and m ( 1$\le$n$\le$10000, 1$\le$m$\le$10), the number of rows and columns in the table. The following n lines contain table rows. Each row has m column values separated by commas. Column values consist of ASCII characters from space (ASCII code 32) to tilde (ASCII code 126) with the exception of comma (ASCII code 44). Values are not empty and have no leading and trailing spaces. Each row has at most 80 characters (including separating commas).

Output

For each dataset, if the table is in PNF write to the output file a single word YES" (without quotes). If the table is not in PNF, then write three lines. On the first line write a single wordNO” (without quotes). On the second line write two integer row numbers r1 and r2 ( 1$\le$r1, r2$\le$n, r1$\ne$r2), on the third line write two integer column numbers c1 and c2 ( 1$\le$c1, c2$\le$m, c1$\ne$c2), so that values in columns c1 and c2 are the same in rows r1 and r2.

Sample Input

3 3
How to compete in ACM ICPC,Peter,peter@neerc.ifmo.ru
How to win ACM ICPC,Michael,michael@neerc.ifmo.ru
Notes from ACM ICPC champion,Michael,michael@neerc.ifmo.ru
2 3
1,Peter,peter@neerc.ifmo.ru
2,Michael,michael@neerc.ifmo.ru

Sample Output

NO
2 3
2 3
YES

Solution

題目描述：

這一題要做的是資料庫的正規化，在一個表單中，重複關係的項目都要依序剃除。
而這一條講的就是要檢查是否有某兩行中的某兩列數值相同，換句話說就是

(data(r1, c1), data(r1, c2)) === (data(r2, c1), data(r2, c2))

如果存在則輸出任何一組 r1, r2, c1, c2

題目解法：

窮舉行還是列將決定複雜度。如果窮舉行 (row) 則會達到 O(n ^ 2 m log m) 反之則是 O(m ^ 2 n log n)。

由於題目給定的 n 遠大於 m，因此窮舉列 (column) 會來得更好。為了加速字串比對速度，可以利用 hash 來完成，如果 hash 夠大，將有機會不考慮任何字串匹配 (不會發生碰撞)，這樣的速度就會非常快。

拿了不少 WA 是忘記在找到其中一組解 break; 掉。Orz

#include <stdio.h>
#include <string.h>
#include <algorithm>
using namespace std;
char data[10010][128];
int colPos[10010][16];
int hashCode[10010][16];
struct cmp {
    static int C1, C2;
    bool operator() (const pair< pair<int, int>, int>& a, 
						const pair< pair<int, int>, int>& b) const {
        if(a.first != b.first)	return a.first < b.first;
        int t;
        t = strcmp(data[a.second] + colPos[a.second][C1],
                        data[b.second] + colPos[b.second][C1]);
        if(t)	return t < 0;
        t = strcmp(data[a.second] + colPos[a.second][C2],
                        data[b.second] + colPos[b.second][C2]);
        if(t)	return t < 0;
        return false;
    }
};
int cmp::C1 = 0;
int cmp::C2 = 0;
int main() {
    int n, m;
//	freopen("in.txt", "r+t", stdin);
//	freopen("out.txt", "w+t", stdout); 
    while(scanf("%d %d", &n, &m) == 2) {
        while(getchar() != '\n');
        for(int i = 0; i < n; i++) {
            gets(data[i]);
            for(int j = 0, pos = 0; j < m; j++) {
                colPos[i][j] = pos;
                int hash = 0;
                while(data[i][pos] != ',' && data[i][pos] != '\0')
                    hash = ((hash<<15) + data[i][pos])&32767, pos++;
                hashCode[i][j] = hash;
                data[i][pos] = '\0', pos++;
            }
        }
        
        pair< pair<int, int>, int> D[10010];
        int flag = 0, r1, r2, c1, c2;
        for(int i = 0; i < m; i++) {
            for(int j = i + 1; j < m; j++) {
                for(int k = 0; k < n; k++) { // (hashCode[k][i], hashCode[k][j])
                    D[k] = make_pair(make_pair(hashCode[k][i], hashCode[k][j]), k);
                }
                cmp::C1 = i, cmp::C2 = j;
                sort(D, D + n, cmp());
                for(int k = 1; k < n; k++) {
                    if(!cmp()(D[k], D[k-1]) && !cmp()(D[k-1], D[k])) {
                        flag = 1;
                        r1 = D[k-1].second, r2 = D[k].second;
                        c1 = i, c2 = j;
                        i = j = m;
                        break;
                    }
                }
            }
        }
        puts(flag ? "NO" : "YES");
        if(flag)
            printf("%d %d\n%d %d\n", r1 + 1, r2 + 1, c1 + 1, c2 + 1);
    }
    return 0;
}

Morris' Blog