Searching Duplicates Concept¶
Search for duplicate records is an important part of data quality control. Clearing the system of duplicates allows you to reduce the stored data volume and organize the contents of the system, reduce the number of errors.
Duplicate records are displayed to the user as clusters, which are formed according to matching rules. The rules contain criteria by which duplicates can be identified. For example, you can compare records by 2-3 attributes and the other attributes may be different. Moreover, duplicates can be semantic, that is, they can describe the same thing but in different ways - and here it is important to correctly identify the characteristics of duplicates in order to use them in the matching rules.
Note
Only simple and code attributes are matched. To match relations and attributes of other types, you need to create a custom pipeline.
The cluster contains all entity/lookup entity records whose attributes have matching criteria. The user can compare these records and process them according to internal business rules.
Matching rules are combined into sets of rules that are assigned to a certain entity or lookup entity. Sets can be used in multiple entities, which helps to make the work easier.
The matching mechanisms trigger and form clusters of duplicates:
While adding a new record with duplicate characteristics in real time;
While launching a reindexing operation with flag “Update matching tables data”.
Example of Usage¶
Let’s take as the example the “Clients” entity, which contains information about the partners that a particular company works with.
The problem is that some departments of the company may have used the same data, but filled it out according to their own regulations. It is known that customer records may contain attributes that were used in different databases. These attributes may be:
Name.
Contact person.
Address.
The repeated records should be combined into clusters for further processing: transforming a cluster of records into a single etalon record.
To accomplish this task, the data administrator should do the following steps:
Prepare a matching table.
Create a matching rule to specify the required algorithm for duplicates detection.
Create a rule set in which the rule and matching table will be used.
Assign the set to the “Clients” entity.
Run the data check for duplicates by one of the available methods.
Data steward can search and view clusters of duplicates, compare them by differing attributes, and process records.