Thursday, January 29, 2015

Talend Data Quality

Today I'm going to demonstrate the sample results produced by Talend Data Quality (DQ) Tool. This gives us an idea how is DQ useful for you data migration projects.
To save our time I added many screen-shots of resultant analysis, you will find more functional ideas of Talend DQ.
Overview:

Talend Data Quality is a data profiler tool that can help us to know our data more deeply. It supports database, csv, excel (file based) and MDM data connections for
profiling and cleansing.
Simple Use Case:
Let say you have an excel file and you have to do a data migration task with that file. One way to start the task is open that file and check row by row the data of that file, select column and apply filters to check the redundant data, that is very time consuming work.
Another way to do that, use Talend Data Quality tool and do the same task in some clicks with many other useful details.

In short DQ can do the following tasks:-


By using this Talend Data Quality we can get following details of our data:-
(a) Catalog Analysis shows total number of rows, columns, indexes, view, keys and table listing with rows, keys and index.
(b) Table Analysis show various table data details.
​Column Analysis: 


Database Compare:


Wednesday, January 28, 2015

MySQL Storage Engines



Many times we just thinking what should be the perfect storage engine for my application? So I come up with this useful topic, which I believe will certainly help you.

So question is what is storage engine:
A storage engine is what stores, handles, and retrieves information from a table. There is no "perfect" or "recommended" storage engine to use, but for most applications the MyISAM or InnoDB is fine.

It is essential to understand the features of each table type in MySQL so that you can use them effectively to maximize the performance of your databases.

For example, if you store a lot of log data you might want to use the ARCHIVE storage engine which only supports INSERT and SELECT.


Here I am listing the MySQL storage engines, you can use as per your application requirements-

MyISAM:
* Provides high-speed storage and retrieval
* Supports full text searching.
* It does table level locking (when insert,update and delete)
* Non-transactional.
* MySQL checks MyISAM tables for corruption and even repair them in case of errors.
* Manages non transactional tables.


InnoDB:

* Its design follows the ACID model, with transactions featuring commit, rollback, and  crash-recovery capabilities to protect user data.
* Supports FOREIGN KEY referential-integrity constraints.
* It also support row-level locking
* It stores data in clustered indexes which reduces I/O for queries based on primary keys.

MEMORY:
* Provides in-memory tables, formerly known as HEAP.
* It sores all data in RAM for faster access than storing data on disks.
* Useful for quick looks up of reference and other identical data.

MERGE:
* It enables users to have a collection of identical MyISAM tables to be handeled by a single table. "Identical" means that all tables have identical column and index information.

e.g. If you store sales transactions on your site in a daily table, and want to pull a report for the month, this would allow you to execute a query against a single table to pull all sales for the month.

ARCHIVE:
It is used for storing large amounts of data without indexes in a very small footprint.
This engine will only support INSERT and SELECT, and all information is compresses.
This makes it the perfect storage engine for logs, point of sale transactions, accounting, etc.

CSV:
It stores data in text files using comma-separated values.
As such this is not an ideal engine for large data storage, tables requiring indexing, etc.
The best use-case for this is transferring data to a spreadsheet for later use.

FEDERATED:
This storage engine enables data to be accessed from a remote MySQL database on a local server without using replication or cluster technology. When using a FEDERATED table, queries on the local server are automatically executed on the remote (federated) tables. No data is stored on the local tables.

BLACKHOLE:
It does not allow for any data to be stored or retrieved, is good for testing database structures, indexes, and queries. Retrievals always return an empty result.

BDB (BerkeleyDB):

The BDB handles transaction-safe tables and uses a hash based storage system. This allows for some of the quickest reading of data, especially when paired with unique keys. There are, however, many downfalls to the BDB system, including the speed on un-index rows, and this makes the BDB engine a less than perfect engine choice. Because of this, many people tend to overlook the BDB engine.

Common Linux Commands

1. Remove broken applications.
apt-get -f install
apt-get autoremove

2. Install binary package with Package manager (Ubuntu)
install .deb file in Ubuntu
dpkg -i teamviewer_amd64.deb

Without dependency
dpkg --force-depends -i teamviewer_amd64.deb

3. Print Architecture
dpkg --print-architecture

4. Download file from console
wget http://software.dakotawallender.com/Teamviewer/teamviewer_10.0.36281_amd64.deb

5. Search a file
locate teamviewer_10.0.36281_amd64.deb


6. Show disc partition
fdisk -l

7. Show who/when is logged on and what they are doing
w

8. List files
ll
ll -h (human readable)
ls -l

9. Install a program
sudo apt-get install mysql

10. Find Application Depends
cd <to the directory where your .deb file is placed)
dpkg-deb -I teamviewer_10.0.36281_amd64.deb | grep Depends

11. Shutdown
poweroff

12. Create file
touch myfile.txt

13. Service start stop and status
service apache2 start
service apache2 stop
service apache2 restart
service apache2 status

note: apache2 is a service you can use any one like mysql

14. Display Linux processes
top

15. Show running services
service --status-all

16. Show last part of a file
tail myfile.txt

17. Tell how long the system has been running
uptime -p

18. kill - send a signal to a process
kill PID (process ID)

19. report a snapshot of the current processes
ps

20. Compress file using TAR
tar -cvf /home/brij/mydir.tar mydir

21. Change permission
chmod -R 777 /home/brij/mydir.tar

22. Restart system
reboot

23. OpenSSH SSH client (remote login program)

login using password
ssh user@10.10.55.56

login using .pem (key) file
ssh -i/home/brij/Downloads/my_key.pem user@10.10.55.56

24. Un-compress file using TAR
tar -xzf /home/brij/brij.tar.gz

25. secure copy (remote file copy program)
scp source destination

26. Show file content and concatenate
cat myfile.txt

27. Display amount of free and used memory in the system
free -h



Feel free to comment, if you like this post :)

Friday, January 23, 2015

Optimize talend speed

We can speed up our talend by configuring the memory allocation.
See how...

Go to your talend directory:
cd /opt/TOS_DI-r118616-V5.5.1

and edit the INI config file:
TOS_DI-linux-gtk-x86.ini           - for 32 bit
TOS_DI-linux-gtk-x86_64.ini     - for 64 bit

For Windows use below instead:

TOS_DI-win32-x86.ini          - for 32 bit
TOS_DI-win-x86_64.ini        - for 64 bit

we have to edit the values of following variables according to our machine:


-Xms256m : is the initial memory size (change it to 512m or bigger)
-Xmx768m : is the max allocation memory size (change it to 2048m or bigger)



after changing the ini file content, restart you talend,
you will feel that your talend is running much faster than earlier.

Scroll bar is not showing in Ubuntu

Hi All,

the problem was with our OS,i.e. Ubuntu 12.04 not with Talend DI. to overcome that problem just disable overlay scroll-bar of Ubuntu. by using below command:
gsettings set org.gnome.desktop.interface ubuntu-overlay-scrollbars false
and restart your taland DI. that's it!!

Date conversion in Talend

Use Case-
Let say you want to convert string to date data type for a date value.

You can use tConvertType for this work. See below sample job...


our source sting date is "2014-11-04T10:43:26.627+0530" so for conversion we should use the date format "yyyy-MM-dd'T'HH:mm:ss.SSSZ".

here +0530 and "Z" represents the time zone.

Note: For the same date in MySQL we can use format "yyyy-MM-dd HH:mm:ss".


 
Following are the some valid general date patterns...

e.g. we have the date 2001-07-04 12:08:56, see the many variants...
Date and Time Pattern Result
"yyyy.MM.dd G 'at' HH:mm:ss z" 2001.07.04 AD at 12:08:56 PDT
"EEE, MMM d, ''yy" Wed, Jul 4, '01
"h:mm a" 12:08 PM
"hh 'o''clock' a, zzzz" 12 o'clock PM, Pacific Daylight Time
"K:mm a, z" 0:08 PM, PDT
"yyyyy.MMMMM.dd GGG hh:mm aaa" 02001.July.04 AD 12:08 PM
"EEE, d MMM yyyy HH:mm:ss Z" Wed, 4 Jul 2001 12:08:56 -0700
"yyMMddHHmmssZ" 010704120856-0700
"yyyy-MM-dd'T'HH:mm:ss.SSSZ" 2001-07-04T12:08:56.235-0700
"yyyy-MM-dd'T'HH:mm:ss.SSSXXX" 2001-07-04T12:08:56.235-07:00
"YYYY-'W'ww-u" 2001-W27-3

Thursday, January 22, 2015

Print variable values in talend

You can print value of variable using tJava, tjavaRow or tMsgbox componets.
For testing I think the tMsgbox will be better because you don't need to write
System.out.println("hello");

sample job how to use  tMsgbox-


Talend Business Model

A Business Model is a non technical pictorial implementation of a business logic.

How to create:
Go to repository click on Business Model



The main components with their name are as below-
Note- To add to work-area, double click on component from pallete.

Formatting Items: Select the item by clicking on it, then right click. See the screen-shot below.




Features:

Auto Arrange items: Right click on design area and click on Arrange All option, it will arrange all your items nicely according to their size and relationship.

Save Business Model As Image: You can save your model as an image by using below steps:
Right click on design area >> File >> Save as image file.

Assignment: This is very good feature in which you can link any source e.g. csv file, custom routine, meta-data etc. to any item.
Example: In the image below I am reading data from source excel file. To do that select the input item (read customer data), go to repository tree then drag and drop the excel file. Now you can see that file detail in Business Model window's Assignment Tab.

Wednesday, January 21, 2015

How to get SQL from tMysqlInput component

Hi,
We can get SQL by using the below code-
((String)globalMap.get("tMysqlInput_1_QUERY"))


Find the sample job below to understand how to use the code-


Below is the job output with the SQL query of MySQL input component-



Enjoy!!


How to get Current Iteration No. of FlowToIterate

We can get the Current Iteration number by using below code-

("tFlowToIterate_1_CURRENT_ITERATION"))
use the code with tJava.

The sample job is given below-

Tuesday, January 20, 2015

Data validation with tSchemaComplianceCheck component

By tSchemaComplianceCheck we can validate following data-

  • number validation - maximum length
  • string validation - maximum length
  • date pattern  validation 
To configure click on Edit Schema button of  "tSchemaComplianceCheck"



you will get the errorMessage and errorCode for the bad data. like below-



Generate data with custom function in tRowGenerator

Go to Repository
Code >> Routines >> Create DemoRoutines and call it in tRowGenerator.


Monday, January 19, 2015

tWaitForSqlData example

You can use row count as a trigger by using this component.


set the configuration as in above image.

And use the below code to access values-

System.out.println("CURRENT_ITERATION");

System.out.println(
((Integer)globalMap.get("tWaitForSqlData_1_CURRENT_ITERATION"))
);

System.out.println("ROW_COUNT");

System.out.println(
((Integer)globalMap.get("tWaitForSqlData_1_ROW_COUNT"))
);


tMysqlTableList Example

List all table name of a database-

1. put a tMysqlConnection component which database table you want to see.
2. put tMysqlTableList
3. put tIterateToFlow
4. put tLogRow
5. put tJava

use below codes with components-

to get table list-
((String)globalMap.get("tMysqlTableList_1_CURRENT_TABLE"))


to get total number of tables-
System.out.println("total no. of tables:"+
((Integer)globalMap.get("tMysqlTableList_1_NB_TABLE"))
);






View live heap status (memory status) of job execution

Hi all,
Follow the following steps to see live heap status:-
Go to Window menu >> preferences >> general, click on show heap status check-box to enable it.

after doing above, you can see your live heap status on bottom right area.


Friday, January 16, 2015

Get row count of a file

We can get row count of any file like csv or txt etc by using component
"tFileRowCount".




And we can get row count by the code-
((Integer)globalMap.get("tFileRowCount_1_COUNT"))

Please note that above code only works with iterate flow,OnComponentOk or OnSubjobOk trigger.

Thursday, January 15, 2015

tSplitRow example

Hi,
I'm going to demonstrate how to use "tSplitRow" component in talend.

We have 3 columns ID, FirstName and LastName as input and I want to output in two columns ID and FullName.

See how we can use tSplitRow for this:-

Sample data:


id first_name last_name
1 Abhay Kumar Thakur
2 Amber Sharma
3 Ajay Kumar Shukla
4 Ajay Dabhi
5 Amit Kumar Sharma
6 Amiya Majumdar
7 Amol Salvi
8 Anamika Khadwal
9 Anil Rawat
10 Animesh Kumar
11 Anirudh Swami

In the image below we created new column with name full_name in SplitRow by editing the component schema. After that in column mapping we concatenate the first_name and last_name.





Finally we will get two rows as follows-

id full_name
15001cf9-93c0-c719-ed04-4c5be60eb19f Abhay Kumar Thakur
3efb5190-1425-df71-26ce-5281e9a8a5f1 Amber Sharma
18dafed1-39ed-69c7-0d0b-4a8a996089f6 Ajay Kumar Shukla
f34a3e53-1081-4789-6a90-4c5f934a363e Ajay Dabhi
a78a358a-dd9e-1b4e-5020-4c5c13faa5fd Amit Kumar Sharma
4d9211f5-e6a8-ec46-9eb9-4e69e95ceec0 Amiya Majumdar
2632210c-cfe1-dffd-cad0-4c5f8a595306 Amol Salvi
e40dd41d-85cf-0719-c682-4ce0e5c7e5e3 Anamika Khadwal
b76a1c10-422b-60c6-e623-4ca03501bf41 Anil Rawat
4256dd3f-65c6-4ad9-12bd-4a8a99b94354 Animesh Kumar
4f2f83de-96e0-eedd-d149-4cab105f1c9c Anirudh Swami

Tuesday, January 13, 2015

How to get number of rows in my job

Let say, I have an input table component and I want to know the number of rows of that,
So I can use below syntax to get that- 
 
((Integer)globalMap.get("tMysqlInput_1_NB_LINE"));
 
Note: This code is only works OnComponentOk trigger. 

How to get currect directory location

You can get the current directory details by using -

System.getProperty("user.dir")

How to get tLoop value

You can get tLoop value by syntax-
globalMap.get("tLoop_1_CURRENT_VALUE")

How to get subjob error

You can catch detailed error of subjob by using below syntax-

((String)globalMap.get("tRunJob_2_CHILD_EXCEPTION_STACKTRACE"))

Use variables in tMap

When we are working with complex expressions we can use variable in tMap.
Open tMap, go to upper center area and click on + button near Var.
See the following image-


You can access the variable value by Var.your-variable in tMap.

Parse date

Syntax how to parse a date:
TalendDate.parseDate("MM/dd/yyyy HH:mm:ss","12/31/9999 23:59:59")