Thursday, March 19, 2015

Pentaho DI arguments, parameters, and variables

Arguments
A PDI argument is a named, user-supplied, single-value input given as a command line argument (running a transformation or job manually from Pan or Kitchen, or as part of a script). Each transformation or job can have a maximum of 10 arguments.

Parameters

Parameters are like local variables; they are reusable inputs that apply only to the specific transformation that they are defined in. When defining a parameter, you can assign it a default value to use in the event that one is not fetched for it. This feature makes it unique among dynamic input types in PDI.

Note: If there is a name collision between a parameter and a variable, the parameter will take precedence.

To define a parameter, right-click on the transformation workspace and select Transformation settings from
the context menu (or just press Ctrl-T), then click on the Parameters tab.

Variables
A variable in PDI is a piece of user-supplied information that can be used dynamically and programmatically in a variety of different scopes. A variable can be local to a single step, or be available to the entire JVM that PDI is running in.

PDI variables can be used in steps in both jobs and transformations. You define variables with the Set Variable step in a transformation, by hand through the kettle.properties file, or through the Set Environment Variables dialogue in the Edit menu.

TheGet Variable step can explicitly retrieve a value from a variable, or you can use it in any PDI text field that has the diamond dollar sign icon next to it by using a metadata string in either the Unix or Windows formats:

${VARIABLE}
%%VARIABLE%%

Source: Pentaho user guide

Copy and Distribute row in Pentaho DI (Spoon)

Hi All,
When we connect a hop to a step, it shows below dialog box.



So what it means, today I'm going to explain this using a simple example transformation.

We have below sample data in data grid.



Now what happens if we connect data grid step with dummy (see below) using copy row.
It supply all rows to both dummy steps (see step metrics in bottom area)



​Let's figure out what distribute row does, in the same scenario.


​In above screenshot we can see dummy steps are getting just 5-5 rows, So if we use distribute rows option, the rows gets distributes to all connected steps.

Following are data sample of both dummy steps using distribute row.

Dummy1




Dummy 2



I hope this post will help you to understand the copy and distribute row option in Spoon.

Wednesday, March 18, 2015

Slowly changing dimension

What are Slowly Changing Dimensions?

Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimension's attribute value for given date. Example of such dimensions could be: customer, geography, employee.

There are many approaches how to deal with SCD. The most popular are:



  • Type 0 - The passive method
  • Type 1 - Overwriting the old value
  • Type 2 - Creating a new additional record
  • Type 3 - Adding a new column

  • Credit: http://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.html


    In order to manage Slowly Changing Dimensions properly and easily it is highly recommended to use Surrogate Keys in the Data Warehouse tables.
    A Surrogate Key is a technical key added to a fact table or a dimension table which is used instead of a business key (like product ID or customer ID).
    Surrogate keys are always numeric and unique on a table level which makes it easy to distinguish and track values changed over time.

    In practice, in big production Data Warehouse environments, mostly the Slowly Changing Dimensions Type 1, Type 2 and Type 3 are considered and used. It is a common practice to apply different SCD models to different dimension tables (or even columns in the same table) depending on the business reporting needs of a given type of data.













    Talend tFuzzyMatch component example

    https://help.talend.com/images/54/bk-components-rg-542/tFuzzyMatch.png Talend tFuzzyMatch component

    Problem: Let say we have a table 'Distributor_Info' and two columns 'distributor_id' and 'address1'.



    We want to find the distributor_id which have matching addresses.
    I used  tFuzzyMatch to do this task.

    Sample Job


    Below is component settings


    Unique matching check box:
    Select this check box if you want to get the best match possible, in case several matches are available.

    Matcing Type Drop Down button:

    Select the relevant matching algorithm among:

    Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion or substitution required for an entry to match the reference entry.

    Metaphone:
    Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow.

    Double Metaphone:
    a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry.


    Output Excel File (Levenshtein with Min. & Max. Distance 0):


     
                       Output Excel File (Levenshtein with Min. 1 & Max. Distance 100):


                                               Output Excel File (Metaphone):

                                          Output Excel File (Double Metaphone):



     Reference: https://help.talend.com/display/TalendComponentsReferenceGuide54EN/tFuzzyMatch

    Monday, March 16, 2015

    XML Components in Talend

    tAdvancedFileOutputXML:

    tAdvancedFileOutputXML outputs data to an XML type of file and offers an interface to deal with loop and group by elements if needed.




    tDTDValidator:
    Validates the XML input file against a DTD file and sends the validation log to the defined output.



    tXSDValidator:

    Validates an input XML file or an input XML flow against an XSD file and sends the validation log to the defined output.


    tXSLT:


    Validates an input XML file or an input XML flow against an XSL file and sends the validation log to the defined output.

    tFileInputMSXML:

    tFileInputMSXML reads and outputs multiple schema within an XML structured file.



    Thursday, March 12, 2015

    Downloading files from FTP in Talend

    Hi,
    I'm going to teach you, how we can download files from FTP server in Talend.

    Sample Job:


    Put 2 components into your workspace:
    • tFTPConnection
    • tFTPGet






    Note: We can see all files by using tFTPFileList component.

    Thursday, March 5, 2015

    Loading context values from ini files in Talend

    Hi,
    Sometime we need to load context values from some external .ini or .properties files.


    Sample ini file content:
    database=mydb
    db_user=root
    db_password=brij
    db_host=localhost


    Sample Job:




    For the same we have tFileInputProperties component.
    This component loads data from file.


    For loading these data and into context we can use tContextLoad component.


    Fetching SugarCRM data using Talend REST component

    Hi All,
    Following are the steps to get sugar-crm data by talend...

    Sample Job:



    In order to connect with sugar-crm you need o-auth token.
    ​In above settings, you have to configure client id and secret in your sugar-crm setup.



    extracting the token from JSON response.




    we are setting token here and will use this further in our job.




    we are retrieving account module data here, we may also get the date of other modules.
    select accept type drop down for JSON or XML response.

    ​we need to pass this token in advance setting.


    Download files using tFileFetch in Talend

    Hi folks,

    We can download files from any URL by using tFileFetch component.

    Let's see an example





    Tuesday, March 3, 2015

    Handling Escaped Characters in Talend

    Hi,
    We are saving our string data into database after handling Escaped Character.
    E.g. \n \t \b etc.

    traditionally we use tFilterRows or tMap expressions for this.

    But today I tell you how to use tScriptRules component for the same.
    download this component from exchange from below url -
    https://www.talendforge.org/exchange/?eid=360&product=tos&action=view&nav=1,0,0


    Source data-



    in above note the row 5-
    'Rajul', '3240 Mahindra Park\n behind hdfc bank'

    The objective is to find all addresses having \n Escaped Character.



    Use expression below in tScriptRules-

    "input_row.address_street =~ '.*(\\n).*'"

    Note: Rules -- expressions that evaluate to true or false -- are created using a Javascript-like language called JEXL.

    We can also handle email validation like finding string having yahoo.com at last by-

    "input_row.field_name =~ '.*yahoo.com$'"

    We can also use && and || expression in tScriptRules.


    The Output-







    Importing Routine from Talend Exchange

    We can download many useful demo jobs and Routine from Talend Exchange.
    Following are the steps...

    1. Go to repository area






    2. click on From Exchange Button



    3. Select Routine (we can import demo job as well) and then select your version (v. imp.)
    After selection you will see the list of available items.





    Select any one and click OK button.
    You are done, Enjoy!!

    Monday, March 2, 2015

    tForeach example in Talend

    Sample Job:


    Click on Plus icon to add values and then use below code to get the values-


    ((String)globalMap.get("tForeach_1_CURRENT_VALUE"))

    I've used this code in tJava.




    Pass data from one job to another using tBufferOutput in Talend

    Sample Data;


    Job 1:



    Schema for fix flow input:



    Now we want to pass above data into another job. So create new job and put tRunJob and tLogRow.

    Job 2:



    In RunJob select the Job1 name and go to its setting (component),
    now click on Edit Schema and copy/paste the BufferOutput same schema (see above image)


    Final Output:




    Pentaho and Talend Data Integration Comparison





    Pentaho Talend
    1 Data passing form one job to to another is simple then talend






    Comparatively complex
    2 Direct JavaScirpt support Only Java support


    Job scheduling components/step available




























    Not available
    4 Import/Export schema on component leve is not supported. Very good support




    5 Table lookup has limited functionality tMap has multiple join,filtering,reject etc funcitonality
    6 Automatic job document creation not supported Supported
    7 Multiple logging levels available for job execution


















    Single type of log detail available
    8 Limited options for job deployment Multiple deployment options available (sh, jar, web service,esb job etc)
    9 Support preview of data step by step component wise Not available
    10 Bad support of multiple database connection Very good support while working with many database connections
    11 Only Main data flow used (between components/steps) Main and Iteration both flow supported
    12 Easy to understand main flow Comparatively difficult to understand main & Iterate flow in jobs
    13 Pentaho not hangs (responding well) Talend hangs often







    Will be adding some more points...come back later