Talend Blog: March 2015

Thursday, March 19, 2015

Pentaho DI arguments, parameters, and variables

Arguments
A PDI argument is a named, user-supplied, single-value input given as a command line argument (running a transformation or job manually from Pan or Kitchen, or as part of a script). Each transformation or job can have a maximum of 10 arguments.

Parameters
Parameters are like local variables; they are reusable inputs that apply only to the specific transformation that they are defined in. When defining a parameter, you can assign it a default value to use in the event that one is not fetched for it. This feature makes it unique among dynamic input types in PDI.

Note: If there is a name collision between a parameter and a variable, the parameter will take precedence.

To define a parameter, right-click on the transformation workspace and select Transformation settings from
the context menu (or just press Ctrl-T), then click on the Parameters tab.

Variables
A variable in PDI is a piece of user-supplied information that can be used dynamically and programmatically in a variety of different scopes. A variable can be local to a single step, or be available to the entire JVM that PDI is running in.

PDI variables can be used in steps in both jobs and transformations. You define variables with the Set Variable step in a transformation, by hand through the kettle.properties file, or through the Set Environment Variables dialogue in the Edit menu.

TheGet Variable step can explicitly retrieve a value from a variable, or you can use it in any PDI text field that has the diamond dollar sign icon next to it by using a metadata string in either the Unix or Windows formats:

${VARIABLE}
%%VARIABLE%%

Source: Pentaho user guide

Copy and Distribute row in Pentaho DI (Spoon)

Hi All,

When we connect a hop to a step, it shows below dialog box.

So what it means, today I'm going to explain this using a simple example transformation.

We have below sample data in data grid.

Now what happens if we connect data grid step with dummy (see below) using copy row.

It supply all rows to both dummy steps (see step metrics in bottom area)

Let's figure out what distribute row does, in the same scenario.

In above screenshot we can see dummy steps are getting just 5-5 rows, So if we use distribute rows option, the rows gets distributes to all connected steps.

Following are data sample of both dummy steps using distribute row.

Dummy1

Dummy 2

I hope this post will help you to understand the copy and distribute row option in Spoon.

Wednesday, March 18, 2015

Slowly changing dimension

What are Slowly Changing Dimensions?

Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimension's attribute value for given date. Example of such dimensions could be: customer, geography, employee.

There are many approaches how to deal with SCD. The most popular are:

Type 0 - The passive method

Type 1 - Overwriting the old value

Type 2 - Creating a new additional record

Type 3 - Adding a new column

Credit: http://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.html

In order to manage Slowly Changing Dimensions properly and easily it is highly recommended to use Surrogate Keys in the Data Warehouse tables.
A Surrogate Key is a technical key added to a fact table or a dimension table which is used instead of a business key (like product ID or customer ID).
Surrogate keys are always numeric and unique on a table level which makes it easy to distinguish and track values changed over time.

In practice, in big production Data Warehouse environments, mostly the Slowly Changing Dimensions Type 1, Type 2 and Type 3 are considered and used. It is a common practice to apply different SCD models to different dimension tables (or even columns in the same table) depending on the business reporting needs of a given type of data.

Talend tFuzzyMatch component example

https://help.talend.com/images/54/bk-components-rg-542/tFuzzyMatch.png

Talend tFuzzyMatch component

Problem: Let say we have a table 'Distributor_Info' and two columns 'distributor_id' and 'address1'.

We want to find the distributor_id which have matching addresses.
I used tFuzzyMatch to do this task.

Sample Job

Below is component settings

Unique matching check box:
Select this check box if you want to get the best match possible, in case several matches are available.

Matcing Type Drop Down button:

Select the relevant matching algorithm among:

Levenshtein: Based on the edit distance theory. It calculates the number of insertion, deletion or substitution required for an entry to match the reference entry.

Metaphone: Based on a phonetic algorithm for indexing entries by their pronunciation. It first loads the phonetics of all entries of the lookup reference and checks all entries of the main flow against the entries of the reference flow.

Double Metaphone: a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. It can return both a primary and a secondary code for a string. This accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry.

Output Excel File (Levenshtein with Min. & Max. Distance 0):

Output Excel File (Levenshtein with Min. 1 & Max. Distance 100):

Output Excel File (Metaphone):

Output Excel File (Double Metaphone):

Reference: https://help.talend.com/display/TalendComponentsReferenceGuide54EN/tFuzzyMatch

Monday, March 16, 2015

XML Components in Talend

tAdvancedFileOutputXML:

tAdvancedFileOutputXML outputs data to an XML type of file and offers an interface to deal with loop and group by elements if needed.

tDTDValidator:
Validates the XML input file against a DTD file and sends the validation log to the defined output.

tXSDValidator:

Validates an input XML file or an input XML flow against an XSD file and sends the validation log to the defined output.

tXSLT:

Validates an input XML file or an input XML flow against an XSL file and sends the validation log to the defined output.

tFileInputMSXML:

tFileInputMSXML reads and outputs multiple schema within an XML structured file.

Thursday, March 12, 2015

Downloading files from FTP in Talend

Hi,

I'm going to teach you, how we can download files from FTP server in Talend.

Sample Job:

Put 2 components into your workspace:

tFTPConnection
tFTPGet

Note: We can see all files by using tFTPFileList component.

Thursday, March 5, 2015

Loading context values from ini files in Talend

Hi,

Sometime we need to load context values from some external .ini or .properties files.

Sample ini file content:
database=mydb
db_user=root
db_password=brij
db_host=localhost

Sample Job:

For the same we have tFileInputProperties component.
This component loads data from file.

For loading these data and into context we can use tContextLoad component.

Fetching SugarCRM data using Talend REST component

Hi All,

Following are the steps to get sugar-crm data by talend...

Sample Job:

In order to connect with sugar-crm you need o-auth token.
In above settings, you have to configure client id and secret in your sugar-crm setup.

extracting the token from JSON response.

we are setting token here and will use this further in our job.

we are retrieving account module data here, we may also get the date of other modules.

select accept type drop down for JSON or XML response.

we need to pass this token in advance setting.

Download files using tFileFetch in Talend

Hi folks,

We can download files from any URL by using tFileFetch component.

Let's see an example

Tuesday, March 3, 2015

Handling Escaped Characters in Talend

Hi,

We are saving our string data into database after handling Escaped Character.

E.g. \n \t \b etc.

traditionally we use tFilterRows or tMap expressions for this.

But today I tell you how to use tScriptRules component for the same.

download this component from exchange from below url -
https://www.talendforge.org/exchange/?eid=360&product=tos&action=view&nav=1,0,0

Source data-

in above note the row 5-
'Rajul', '3240 Mahindra Park\n behind hdfc bank'

The objective is to find all addresses having \n Escaped Character.

Use expression below in tScriptRules-

"input_row.address_street =~ '.*(\\n).*'"

Note: Rules -- expressions that evaluate to true or false -- are created using a Javascript-like language called JEXL.

We can also handle email validation like finding string having yahoo.com at last by-

"input_row.field_name =~ '.*yahoo.com$'"

We can also use && and || expression in tScriptRules.

The Output-

Importing Routine from Talend Exchange

We can download many useful demo jobs and Routine from Talend Exchange.

Following are the steps...

1. Go to repository area

2. click on From Exchange Button

3. Select Routine (we can import demo job as well) and then select your version (v. imp.)

After selection you will see the list of available items.

Select any one and click OK button.

You are done, Enjoy!!

Monday, March 2, 2015

tForeach example in Talend

Sample Job:

Click on Plus icon to add values and then use below code to get the values-

((String)globalMap.get("tForeach_1_CURRENT_VALUE"))

I've used this code in tJava.

Pass data from one job to another using tBufferOutput in Talend

Sample Data;

Job 1:

Schema for fix flow input:

Now we want to pass above data into another job. So create new job and put tRunJob and tLogRow.

Job 2:

In RunJob select the Job1 name and go to its setting (component),

now click on Edit Schema and copy/paste the BufferOutput same schema (see above image)

Final Output:

Pentaho and Talend Data Integration Comparison

	Pentaho	Talend
1	Data passing form one job to to another is simple then talend	Comparatively complex
2	Direct JavaScirpt support	Only Java support
	Job scheduling components/step available	Not available
4	Import/Export schema on component leve is not supported.	Very good support
5	Table lookup has limited functionality	tMap has multiple join,filtering,reject etc funcitonality
6	Automatic job document creation not supported	Supported
7	Multiple logging levels available for job execution	Single type of log detail available
8	Limited options for job deployment	Multiple deployment options available (sh, jar, web service,esb job etc)
9	Support preview of data step by step component wise	Not available
10	Bad support of multiple database connection	Very good support while working with many database connections
11	Only Main data flow used (between components/steps)	Main and Iteration both flow supported
12	Easy to understand main flow	Comparatively difficult to understand main & Iterate flow in jobs
13	Pentaho not hangs (responding well)	Talend hangs often

Will be adding some more points...come back later