How to use Match Transform in BODS 4.1 for Dedupliation of Source data
The document aims to provide an insight on how to use Match transform in BODS 4.1.
The document also helps to understand various options available in Match transform of BODS and their use.
Match transform is responsible for performing matching based on the business rules we define.
The transform then sends matching and unique records on to the next transform in the data flow.
For using Match transforms,following steps should be followed .
Steps:
- Place the Input table in the Designer window.
- Select NameAddress_Match transform under Data Quality transforms
- Map ALL the input fields, passing valid values to Person_given_name1,Address Primary Name and Postal code especially .
For the fileds,which does not have any valid input,pass Name1 as the input.
- Drag important columns from Input table i.e. Source table to Output.Such as Customer/Vendor Id,
Customer Name, Customer Country to identify the record.
· Click on Options Tab and Click on Editing options to edit the options.
· After selecting Edit option, Match editor will open.
· Follow the below steps while populating individual fields.
· Option Description
Option | Description |
Perform matching | Select to add the ability to compare records and other match-related operations.
This option is not selected only in the Base_Match transform configuration. For all other configurations, this option is selected. In our case, we will select the option. |
Match set name | Enter a name for this match set. Each Match transform in your data flow
represents a match set. This option is already populated with name you chose if you used the Match wizard to generate this transform. Be sure that this name is unique within the dataflow (it does not match the name of another match set). This name is used in the match reports to differentiate data processed by one match set versus another. |
Match engine | Specifies the match engine to use, based on the type of data you will be processing. If you use the Multinational strategy in the Match wizard, this option is set to Latin1 for all match sets.
Chinese: Specifies that the Match transform will be processing Chinese data in Chinese script. Japanese: Specifies that the Match transform will be processing Japanese data in Japanese script. Korean: Specifies that the Match transform will be processing Korean data in Korean script. Latin1: Specifies that the Match transform will be processing Latin1 data. In general, this is the data used throughout the Americas, Western Europe, Oceania, and much of Africa. Other_Non_Latin1: Specifies that the Match transform will be processing non-Latin1 data, other than Chinese, Japanese, Korean, and Taiwanese, such as Russian, Greek, Hebrew, Arabic, and others. Taiwanese: Specifies that the Match transform will be processing Taiwanese data in Taiwanese script. For optimum accuracy and performance, be sure that you have filtered your multinational data to separate match transforms with the appropriate match engine selected. The Match wizard can do this for you easily, if you select the Multinational strategy. By default, the Latin1 match engine is used. If you attempt to process nonlatin1 data with the Latin1 engine, the results are unpredictable. |
Generate report data | Specifies whether to generate report data for this transform. This option is
available in every transform that generates report data. Yes: Generates report data for this transform. No: Turns off report data generation. If you do not need to generate reports (during testing of dataflows, for example), you should set this option to No. This will improve match performance. In our case,it will be kept as NO |
Logical source field | Specifies the field that contains the ID for the logical source.It will be kept Blank in our case. |
Physical source field | Specifies the field that contains the ID for the physical source (Reader). It will be kept Blank in our case. |
Run as a separate process | Yes: Splits the transform into a separate process.
No: Keeps the transform in same process as the rest of the dataflow. In our case ,it should be NO |
· Follow the screenshot.
· Select to perform data salvaging.
If two records match, data salvaging temporarily copies data from a passenger record
to the driver record after comparing the two records. The data that’s copied is data
that is found in the passenger record but is missing or incomplete in the driver record.
Data salvaging prevents blank matching or initials matching from matching records
that you may not want to match.
In our case, we will not select Data Salvaging
· Group forming allows you to group and prioritize records for better match accuracy and efficiency.
Break groups
Break groups allow you to group records based on common field values (for example, postal code).
Only records that share the same break group value will be compared with one another.
Use break groups to lower the number of comparisons needed and to increase the speed of the matching process.
In our case ,we will not change any values.
· In Matching Option, Click on Match Criteria, to add the fields that are taken as the base fields for Matching ,by assigning a weightage to each field, summing up to 100.
Match score
Specifies the minimum similarity score needed for the records to be considered
a match based on this criteria. Type a value from 0 to 101.
A value of 101 ensures that this criteria alone is not enough to consider two
records a match and that you want to consider other criteria in the comparison process.
For example, a value of 90 means that you consider this data to be important
enough that if the data in two records is 90% similar or higher, the records are
considered a match.
No match score
Specifies the maximum similarity score needed for the records to be considered
a no-match based on this criteria. Type a value from -1 to 100.
A value of -1 ensures that this criteria is not enough to consider two records
a no-match and that you want to consider other criteria in the comparison process.
For example, a value of 49 means that if the similarity between the data in two
records is less than 50%, the records do not match.
Also put the score at which the record , should be classified as a duplicate and unique.
In our case, we have taken Person name 1 and Street as the fields to be matched,with match score as 80.
Click on apply, after making changes.
· Click on Output tab, to select the output fields.
· Select Person_given_name1, Person_given_name2, Address_Primary_name1,Postcode1.
· For rest of output fields, please find below explanation ,to guide yourself.
Field name | Description |
Group_Num | Specifies the records that belong to the same match group, which share
the same group number. The group numbers start with the number 1. Unique records have a blank group number. If you are using association matching in your job, you need to map this on output, because the Associate transform uses it. |
Match_Criterion | Specifies the name of the criteria that made the decision ( if the Match_Type
None is R). Otherwise, the field is blank. |
Match_Level | Specifies the name of the match level used. |
Match_Status | The values for the Match_Status field that appear in your output are:
D: This record is a driver in a match group. P: This record is a passenger in a match group. U: This is a unique record. |
Match_Type | Describes how each record is identified as a match. Possible values are:
: The record did not match any other record. It is a unique record. D: The record was the driver record in the comparison process. R: The record was identified as matching the driver record because one of the criteria met the Match_Score. W: The record was identified as matching the driver because the total weighted score met the Weighted match score. See Matching methods in the Data Services Designer Guide for more information on weighted scores. |
Match_Score | The Match_Score field outputs the following values:
• The criteria similarity score when the Match_Type is R. • The total weighted score when the Match_Type is W. • Blank if the record is a driver record (Match_Type of D) or if the records are unique. |
Group_Count | Provides the total number of records in the match group.
Unique records have a value of 1. |
Group_Order | The master record receives a value of 1. Subordinate records receive a value of 2
through the number of records in the match group. You may control the order by including a Group Prioritization in the Post Match Operations. Unique records have a value of 0. |
Group_Rank | Specifies whether the record is a master (M) or a subordinate (S). Unique records have
an empty value. |
Conclusion:
In case of Duplication in Source data, Match transform provides an approach to classify the data in Unique and Duplicate records.
We need to know about the duplication criteria, as well a match score should be set to nominate a record as a duplicate.
For best results, the data in which you are attempting to find matches should be cleansed. Therefore, we may need to include other Data Quality transforms before the Match transform.
Hope this document makes life easier to use Match transform or to perform deduplication across system.
Regards,
Mayank Mehta
New NetWeaver Information at SAP.com
Very Helpfull