Filed under: Papers
[/author][cite]IEEE International Conference on Data Mining (ICDM), 2006[/cite].
Operational databases evolve over time to contain a great deal of heterogeneity in database column values, as the data\-base integrates an ever increasing number of applications. For example, two independently developed inventory applications may use machine host names (e.g., abc.def.edu) and IPv4 addresses (e.g., 22.214.171.124) for the equivalent task of identifying machines connected to the network; when their databases are integrated into a common database, it is feasible that the corresponding column contains both host names and IPv4 addresses. Such heterogeneity of values in a database column is symptomatic of data quality issues, often resulting in business practice degradation and monetary losses.
In this work, we focus on the task of rapid automated identification of heterogeneous columns in complex operational databases. We view the string values in a column as samples from an underlying mixture of unknown homogeneous types of values (e.g., host names, IPv4 addresses), and say that the heterogeneity of the column is high if (i) it contains many types of values, (ii) each type of value occurs significantly often in the column, and (iii) the types of values are well-separated. We formalize this intuition using a novel measure of column heterogeneity, based on the notion of cluster entropy in the presence of background context.
To compute this measure we make use of the information bottleneck, an algorithmic tool for soft clustering that draws on rate distortion theory. We perform extensive experiments on real-world datasets to demonstrate the robustness of our heterogeneity measure, and the efficiency of our approach to identify column heterogeneity.