Wednesday, March 18, 2015

Talend tFuzzyMatch component example

https://help.talend.com/images/54/bk-components-rg-542/tFuzzyMatch.png Talend tFuzzyMatch component

Problem: Let say we have a table 'Distributor_Info' and two columns 'distributor_id' and 'address1'.



We want to find the distributor_id which have matching addresses.
I used  tFuzzyMatch to do this task.

Sample Job


Below is component settings


Unique matching check box:
Select this check box if you want to get the best match possible, in case several matches are available.

Matcing Type Drop Down button:

Select the relevant matching algorithm among:

Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion or substitution required for an entry to match the reference entry.

Metaphone:
Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow.

Double Metaphone:
a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry.


Output Excel File (Levenshtein with Min. & Max. Distance 0):


 
                   Output Excel File (Levenshtein with Min. 1 & Max. Distance 100):


                                           Output Excel File (Metaphone):

                                      Output Excel File (Double Metaphone):



 Reference: https://help.talend.com/display/TalendComponentsReferenceGuide54EN/tFuzzyMatch

No comments:

Post a Comment