By Thomas N. Herzog

This ebook is helping practitioners achieve a deeper realizing, at an utilized point, of the problems excited by bettering information caliber via enhancing, imputation, and list linkage. the 1st a part of the e-book bargains with tools and types. the following, we concentrate on the Fellegi-Holt edit-imputation version, the Little-Rubin multiple-imputation scheme, and the Fellegi-Sunter list linkage version. short examples are integrated to teach how those ideas work.

In the second one a part of the ebook, the authors current real-world case reports during which a number of of those ideas are used. They hide a large choice of program parts. those comprise loan warrantly coverage, clinical, biomedical, road protection, and social assurance in addition to the development of checklist frames and administrative lists.

Readers will locate this e-book a mix of functional suggestion, mathematical rigor, administration perception and philosophy. The lengthy checklist of references on the finish of the publication allows readers to delve extra deeply into the topics mentioned the following. The authors additionally speak about the software program that has been built to use the ideas defined in our text.

2. The Metrics The false match rate is the proportion of actual non-matches designated as matches: P a b ∈M a b ∈U The false non-match rate is the proportion of actual matches that are designated as non-matches: P a b ∈U a b ∈M The precision is the proportion of designated matches that are actual matches: P a b ∈M a b ∈M We note that P a b ∈ M a b ∈ M +P a b ∈ U a b ∈ M = 1 where by Bayes’ Theorem1 we can obtain P[(a, b) ∈ U (a, b) ∈ M = P[(a, b) ∈ M(a, b) ∈ U · P[(a, b) ∈ U P[(a, b) ∈ M The recall rate is the proportion of actual matches that are designated matches: P a b ∈M a b ∈M We note that the sum of the false non-match rate and the recall rate is one: P[(a, b) ∈ U (a, b) ∈ M + P[(a, b) ∈ M (a, b) ∈ M = 1 The probability function, P[·], that we use here is a relative frequency function in which all events are assumed to be equally likely to occur.

2. If-Then Test The next test we consider is of the following type: If data element X assumes a value of x, then data element Y must assume one of the values in the set y1 y2 yn . For example, if the “type of construction” of a house is “new,” then the age of the house can not be a value that is greater than “1” year. If the age of the house takes on a value that is greater than “1” year, then we must reject the pair of data element values. This is an example of an if-then test. A few other examples of this type of test, typical of data encountered in a census of population or other demographic survey, are as follows: If the relationship of one member of the household to the head of the household is given as “daughter”, then the gender of that individual must, of course, be “female”.

If the age of the wife is more than twenty years greater than the age of the husband, then check both ages. In repeated applications, the performance of edits themselves should be measured and evaluated. In many situations, if we have extensive experience analyzing similar data sources, we might decide to exclude certain edits on errors that occur rarely – for example, one time in 100,000. The example above comparing the age of the wife to the age of the husband might be an example of this. 3. Ratio Control Test We next describe a class of procedures that considers combinations of quantitative data elements.