Talend tFuzzyMatch component
Problem: Let say we have a table 'Distributor_Info' and two columns 'distributor_id' and 'address1'.
We want to find the distributor_id which have matching addresses.
I used tFuzzyMatch to do this task.
Select the relevant matching algorithm among:
Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion or substitution required for an entry to match the reference entry.
Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow.
Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry.
Output Excel File (Double Metaphone):
Reference: https://help.talend.com/display/TalendComponentsReferenceGuide54EN/tFuzzyMatch
Problem: Let say we have a table 'Distributor_Info' and two columns 'distributor_id' and 'address1'.
We want to find the distributor_id which have matching addresses.
I used tFuzzyMatch to do this task.
Sample Job
Below is component settings
Unique matching check box:
Select this check box if you want to get the best match possible, in case several matches are available.
Matcing Type Drop Down button:
Select this check box if you want to get the best match possible, in case several matches are available.
Matcing Type Drop Down button:
Select the relevant matching algorithm among:
Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion or substitution required for an entry to match the reference entry.
Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow.
Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry.
Output Excel File (Levenshtein with Min. & Max. Distance 0):
Output Excel File (Double Metaphone):
Reference: https://help.talend.com/display/TalendComponentsReferenceGuide54EN/tFuzzyMatch
No comments:
Post a Comment