Problem statement: We have a file with list of 48k orders
and we need to fetch order details from another database table. Normal tMap
lookup fails as the lookup data is huge (1,91,69,567 Rows), so we need to pass
order numbers using IN clause with table input component. Table input component
allow maximum of 1000 order number at a time.
Source Table with order numbers: T1
Lookup Table with all details of orders:T2
We need to pick 1000 order numbers from T1 and pass that to
T2 like
Select * from T2 where ord_no IN (10001,10002,10003…..11000);
So that logic is start with 1 to 1000 then 10001 to 2000
then 2001 to 3000 and so on.
here is the job which worked for me.
Steps of job-
1.
Load all data from file containing 48k
order_numbers into table (here in MySQL)2. Write sql into brij_dqg_local table input in order to get total number of rows in the file that contains order_numbers
"SELECT count(*) as cnt FROM `usidtest`
"
context.loopLimit=input_row.cnt/1000;
5. tJava1 has below code, which prepare WHERE clause for main lookup query, like following-
limit 1,1000
limit 1001,1000
limit 2001,1000
limit 3001,1000
...
limit 47001,1000
--------- tJava1 code------------
int n=1000;
int s=0;
if(((Integer)globalMap.get("tLoop_1_CURRENT_VALUE"))==1)
s=1001;
else
s=(n*((Integer)globalMap.get("tLoop_1_CURRENT_VALUE")))+1;
String foo = "limit
"+s+","+n;
context.wherestr=foo;
"SELECT ord_no FROM `usidtest`
"+context.wherestr
This will produce “SELECT ord_no FROM
`usidtest` limit 1,1000”
9. Lookup table T2 use component tOracleInput which will give us final output, I used query-
"SELECT * from T2 WHERE T2.ORDER_NO IN
('"+row3.ord_no+"') "
Which produce query
"SELECT * from T2 WHERE T2.ORDER_NO IN
('10001’,’10002’,’10003’,’11000'….) "
11. Use OnSubjobOk trigger in order to fetch all the output data from buffer using tHashInput.
12. Store final data into an excel file using tFileOutputExcel_1.
One may think that, why I did not used traditional talend tMap for lookup. To answer this, I already tried this method before but it was taking too much time.
ReplyDeleteIt executed 17,20,000 rows in 21 minutes and we have total 1,91,69,567 rows in lookup table. So, it means if I use that model then we may need 3.8 hours to complete the job.
This is great! Using the IN clause is the way to go. Thanks
ReplyDeleteSince SQL might be easier for some people, trow the result of below SQL in a tFlowToIterate...
WITH TRANSACTIONS AS (
SELECT NK_TRANSACTION AS TXN_ID
FROM HUB_TRANSACTION
WHERE FK_CALENDAR_GMT > 500
AND ROWNUM < 50000
)
, TRANSACTION_GROUP AS (
SELECT CEIL(ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY TXN_ID) / 200) AS NK_GROUP
, 1
, TXN_ID
FROM TRANSACTIONS
)
SELECT LISTAGG(TXN_ID, ',') WITHIN GROUP (ORDER BY NK_GROUP) AS NK_TRANSACTIONS
FROM TRANSACTION_GROUP
GROUP BY NK_GROUP