Wednesday, December 9, 2015

How to use Lookup from huge amount of source data in Talend



Problem statement: We have a file with list of 48k orders and we need to fetch order details from another database table. Normal tMap lookup fails as the lookup data is huge (1,91,69,567 Rows), so we need to pass order numbers using IN clause with table input component. Table input component allow maximum of 1000 order number at a time.
Source Table with order numbers: T1
Lookup Table with all details of orders:T2
We need to pick 1000 order numbers from T1 and pass that to T2 like
Select * from T2 where ord_no IN (10001,10002,10003…..11000);
So that logic is start with 1 to 1000 then 10001 to 2000 then 2001 to 3000 and so on.

here is the job which worked for me.

 

Steps of job-
1.       Load all data from file containing 48k order_numbers into table (here in MySQL)

2.       Write sql into brij_dqg_local table input in order to get total number of rows in the file that contains order_numbers
"SELECT count(*) as cnt FROM `usidtest` "

3.       Write code in javaRow1 which store loop last index value in context.
context.loopLimit=input_row.cnt/1000;

4.       In tLoop after onSubjobOK flow set these values-

 

5.       tJava1 has below code, which prepare WHERE clause for main lookup query, like following-

limit 1,1000
limit 1001,1000
limit 2001,1000
limit 3001,1000
...
limit 47001,1000

--------- tJava1 code------------
int n=1000;
int s=0;

if(((Integer)globalMap.get("tLoop_1_CURRENT_VALUE"))==1)
s=1001;
else
s=(n*((Integer)globalMap.get("tLoop_1_CURRENT_VALUE")))+1;

String foo = "limit "+s+","+n;

context.wherestr=foo;

6.       Read 1 to 1000 order_number from table (T1) use query in tMysqlInput_1-
"SELECT ord_no FROM `usidtest` "+context.wherestr
This will produce “SELECT ord_no FROM `usidtest`  limit 1,1000”

7.       Now we need to make ord_no concatenated string for first 1000 ord_no, for this I used tDenormalize_1 component that concatenate all 1000 ord_no.


8.       Now we use tFlowToIterate in order to pass our sql query into table (T2) Oracle lookup table 

9.       Lookup table T2 use component tOracleInput which will give us final output, I used query-
"SELECT * from T2 WHERE T2.ORDER_NO IN ('"+row3.ord_no+"') "
Which produce query
"SELECT * from T2 WHERE T2.ORDER_NO IN ('10001’,’10002’,’10003’,’11000'….) "

10.   Store all output data into buffer using tHashOutput component that will store all output data execution wise iteratively. 

11.   Use OnSubjobOk trigger in order to fetch all the output data from buffer using tHashInput. 

12.   Store final data into an excel file using tFileOutputExcel_1.

So finally you will get an excel file with all 48K orders which are available in lookup table T2.









2 comments:

  1. One may think that, why I did not used traditional talend tMap for lookup. To answer this, I already tried this method before but it was taking too much time.
    It executed 17,20,000 rows in 21 minutes and we have total 1,91,69,567 rows in lookup table. So, it means if I use that model then we may need 3.8 hours to complete the job.

    ReplyDelete
  2. This is great! Using the IN clause is the way to go. Thanks

    Since SQL might be easier for some people, trow the result of below SQL in a tFlowToIterate...


    WITH TRANSACTIONS AS (

    SELECT NK_TRANSACTION AS TXN_ID
    FROM HUB_TRANSACTION
    WHERE FK_CALENDAR_GMT > 500
    AND ROWNUM < 50000
    )

    , TRANSACTION_GROUP AS (

    SELECT CEIL(ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY TXN_ID) / 200) AS NK_GROUP
    , 1
    , TXN_ID
    FROM TRANSACTIONS
    )

    SELECT LISTAGG(TXN_ID, ',') WITHIN GROUP (ORDER BY NK_GROUP) AS NK_TRANSACTIONS
    FROM TRANSACTION_GROUP
    GROUP BY NK_GROUP

    ReplyDelete