Data quality and record linkage techniques pdf
There are two main types of linkage algorithms: deterministic and probabilistic. Both have been successfully implemented in previous research studies. With this in mind, it is important that researchers be equipped with data linkage algorithms for varying scenarios. The key is to develop algorithms to extract and make use of enough meaningful information to make sound decisions.
In this section, we will review the main algorithm types and discuss the strengths and weaknesses of each in an effort to derive a set of guidelines describing which algorithms are best in varying scenarios of data availability, data quality, and investigator goals. Match status can be assessed in a single step or in multiple steps. In a single-step strategy, records are compared all at once on the full set of identifiers. A record pair is classified as a match if the two records agree, character for character, on all identifiers and the record pair is uniquely identified no other record pair matched on the same set of values.
A record pair is classified as a nonmatch if the two records disagree on any of the identifiers or if the record pair is not uniquely identified. In a multiple-step strategy also referred to as an iterative or stepwise strategy , records are matched in a series of progressively less restrictive steps in which record pairs that do not meet a first round of match criteria are passed to a second round of match criteria for further comparison. If a record pair meets the criteria in any step, it is classified as a match.
Otherwise, it is classified as a nonmatch. While the existence of a gold standard in registry-to-claims linkages is a matter of debate, the iterative deterministic approach employed by the National Cancer Institute to create the SEER Surveillance, Epidemiology and End Results -Medicare linked dataset 81 , 82 has demonstrated high validity and reliability and has been employed successfully in multiple updates of the SEER-Medicare linked dataset. If SSN is missing or does not match, or two records fail to meet the initial match criteria, they may be declared a match if they agree on the match criteria in a second round of deterministic linkages, in which two records must match on last name, first name, month of birth, sex, and one of the following:.
In situations in which full identifiers or partial identifiers are available but may not be released or transmitted, a deterministic linkage on encrypted identifiers may be employed. Quantin and colleagues 33 , 46 — 47 have developed procedures for encrypting identifiers using cryptographic hash functions so identifiers needed for linkage can be released directly to researchers without compromising patient confidentiality.
A cryptographic hash function, such as the Secure Hash Algorithm version 2 SHA-2 , released by the National Security Agency and published by the National Institute of Standards and Technology, is a deterministic procedure that takes input and returns an output that was intentionally changed by an algorithm. Due to its deterministic attributes and the inability to reverse-engineer a hashed value, it has been widely adopted for security applications and procedures.
The deterministic approach ignores the fact that certain identifiers or certain values have more discriminatory power than others do. Probabilistic strategies have been developed to assess 1 the discriminatory power of each identifier and 2 the likelihood that two records are a true match based on whether they agree or disagree on the various identifiers. According to the model developed by Fellegi and Sunter, 33 matched record pairs can be designated as matches, possible matches, or nonmatches based on the calculation of linkage scores and the application of decision rules.
Each pair in the comparison space is either a true match or a true nonmatch. When dealing with large files e. In these situations, it is advisable to reduce the comparison space to only those matched pairs that meet certain basic criteria. For instance, the number of matched pairs to be considered may be limited to only those matched pairs that agree on clinical diagnosis or on both month of birth and county of residence. Those record pairs that do not meet the matching criteria specified in the blocking phase are automatically classified as nonmatches and removed from consideration.
To account for true matches that were not blocked together due to data issues , typically multiple passes are used so that rows that were not blocked together in one pass have the potential to be blocked and compared in another pass to avoid automatic misclassification.
Since two records cannot be matched on missing information, the variables chosen for the blocking phase should be relatively complete, having few missing values. Blocking strategies such as this reduce the set of potential matches to a more manageable number. Because blocking strategies can influence linkage success, Christen and Goiser recommend that researchers report the specific steps of their blocking strategy.
The two records in every matched pair identified in the blocking phase are compared on each linkage identifier, producing an agreement pattern. The m -probability can be estimated based on values reported in published linkage literature or by taking a random sample of pairs from the comparison space, assigning match status via manual review, and calculating the probability that two records agree on a particular identifier when they are true matches.
Calculating value-specific u -probabilities for an identifier based on the frequency of each value and the likelihood that two records would agree on a given value simply by chance yields additional information. For instance, a match on a rare surname such as Lebowski is less likely to occur by chance, and is thereby assigned greater weight than a match on a common surname such as Smith.
This principle can be applied to any linkage identifier for which values are differentially distributed. When two records agree on an identifier, an agreement weight is calculated by dividing the m -probability by the u -probability and taking the log 2 of the quotient. For example, if the probability that true matches agree on month of birth is 97 percent and the probability that false matches randomly agree on month of birth is 8.
When two records disagree on an identifier, a disagreement weight is calculated by dividing 1 minus the m -probability by 1 minus the u -probability. For example, the disagreement weight for month of birth would be calculated as log 2 [ While the method above accounts for the discriminatory power of the identifier, it does not yet take into account the degree to which records agree on a given identifier.
This is important for fields where typographical errors are likely to occur e. Assigning partial agreement weights in situations where two strings do not match character for character can account for minor typographical errors, including spelling errors in names or transposed digits in dates or SSNs. If all of the characters in a string are matched character by character across two files, then the agreement weight is maximized set at the full agreement weight.
For example, lower weights would be assigned for short names or when the first characters of the string are not matched between the two files.
The full agreement weight for the identifier can then be multiplied by the string comparator value to generate a partial agreement weight. For example, if the full agreement weight for first name is 12 and the string comparator value is 0. Once the weights, full and partial, for each identifier have been calculated, the linkage score for each matched pair is equal to the sum of the weights across all linkage identifiers.
Use of string comparator methods may significantly improve match rates if a large number of typographical errors are expected. An initial assessment of linkage quality can be gained by plotting the match scores in a histogram.
If the linkage algorithm is working properly, then the plot should show a bimodal distribution of scores, with one large peak among the lower scores for the large proportion of likely nonmatches and a second smaller peak among the higher scores for the smaller set of likely matches. Depending on the research question and the nature of the study, the initial threshold can be adjusted to be more conservative higher score or more liberal lower score.
A more conservative threshold will maximize the specificity of the linkage decision, as only those record pairs with a high score will be counted as matches. Conversely, a more liberal threshold will maximize the sensitivity of the linkage decision to possible matches. Cook and colleagues 44 define the cutoff threshold as the difference between the desired weight and the starting weight. Given two files, A and B, the starting weight for each record pair is equal to the log 2 of the odds of picking a true match by chance,.
If P is the desired probability that two records were not matched together by chance i. If the desired value of P is 0. If the computed linkage score is greater than or equal to the cutoff threshold, then the record pair is classified as a match. In the second part of the book, the authors present real-world case studies in which one or more of these techniques are used. They cover a wide variety of application areas.
These include mortgage guarantee insurance, medical, biomedical, highway safety, and social insurance as well as the construction of list frames and administrative lists. The long list of references at the end of the book enables readers to delve more deeply into the subjects discussed here. The authors also discuss the software that has been developed to apply the techniques described in our text.
Thomas N. Herzog, Ph. Department of Housing and Urban Development. He holds a Ph. Fritz J. Scheuren, Ph. He has a Ph. He is much published with over papers and monographs. William E. Winkler, Ph. Need an account? Click here to sign up. Download Free PDF. Thomas Herzog. Fritz Scheuren. A short summary of this paper. Download Download PDF. Translate PDF. The accuracy of t he Cont ent should not be relied upon and should be independent ly verified wit h prim ary sources of inform at ion.
Taylor and Francis shall not be liable for any losses, act ions, claim s, proceedings, dem ands, cost s, expenses, dam ages, and ot her liabilit ies what soever or howsoever caused arising direct ly or indirect ly in connect ion wit h, in relat ion t o or arising out of t he use of t he Cont ent.
This art icle m ay be used for research, t eaching, and privat e st udy purposes. Any subst ant ial or syst em at ic reproduct ion, redist ribut ion, reselling, loan, sub- licensing, syst em at ic supply, or dist ribut ion in any form t o anyone is expressly forbidden. The second part describes record linkage. These chapters are su- participation rates in the groups can give rise to what appears to be discrim- perbly written. Winkler Data quality and record linkage techniques.
ISBN: Psychometrika, Malik Waqas. A short summary of this paper. ISBN: Data quality and record linkage tech- niques. Data quality is multidimensional, and involves data cleaning, modelling and analysis, quality control and assurance, storage and presentation.
Data quality is related to use and cannot be assessed independently of the user. In a data set, the data have no actual quality; they only have potential value that is realized only when someone uses the data to do something useful. Data Quality and Record Linkage Techniques is one of the few books on data quality and record linkage that try to cover and discuss the possible errors in different types of data in practi- cal situations.
The intended audience consists of actuaries, economists, statisticians and computer scientists. Quantitative psycholo- gists are not explicitly mentioned amongst the target readership, but they will benefit from the first part of the book, because data quality and editing are important issues for psychometric data too.
This book is not an exhaustive text, more an overview, and will serve as an excellent reference source to guide and improve data quality and record linkage techniques.
The authors unify and formalize various problems in data quality and record linkage. The book has a good balance of theory and practical examples related to data errors in different fields. It includes a detailed review of background material, many practical examples and a variety of different methods for dealing with problems. The book is divided into four parts covering data quality chapters 2—5 , specialized tools for database improvement chapters 6—13 , record link- age studies chapters 14—17 , and confidentiality and review of record linkage software chap- ters 18—
0コメント