Talend Blog: February 2015

Friday, February 27, 2015

Talend Job Execution and Scheduling

JobScheduler is a batch program, running on Unix systems as a demon and on Windows as a service

SpagoBI server
you can deploy your Jobs easily on a SpagoBI server in order to execute them from your SpagoBI administrator

https://help.talend.com/display/TALENDOPENSTUDIOFORDATAINTEGRATIONUSERGUIDE52EN/5.7.1+How+to+deploy+a+Job+on+SpagoBI+server

JobScheduler:

JobScheduler (JS) is a workload automation tool. It is used to launch JS objects, such as jobs and/or orders when time, file or calendar events occur.

http://www.sos-berlin.com/modules/cjaycontent

Thursday, February 26, 2015

Add new field in main flow

We face a general problem in our Talend jobs, that is how to add new field in mid of the main flow. We will solve this problem by using SplitRow component.

In picture in FixedFlowInput we have a field named "first_name" with value "Brij" and we want to introduce one more field "last_name" in mid of our job flow.

We can use SplitRow component for the same. We generally think this component can only splits the row, but that is not correct we can add fields as well.

In SplitRow component we will add one more filed last_name and give it to any valuse (Sharma in this example).

In column mapping we need to re-assign the value for first_name by giving the code row1.first_name. We can give any expression in SplitRow component.

tMemorizeRows Example in Talend

Sample Data: In source schema we have a field PRODUCT_CODE.

PRODUCT_CODE
---------------------------
prod01
prod02
prod03
prod04
prod02
prod04

Use case:
Let say we want to find out duplicate PRODUCT_CODE by using tMemorizeRows.

In above image note the row count to memorize is set to 2. It means this component will memorize 2 rows.

tJavaRow code:

String s="";

if(PRODUCT_CODE_tMemorizeRows_1[0].equals(PRODUCT_CODE_tMemorizeRows_1[1]))
{
s=PRODUCT_CODE_tMemorizeRows_1[0];
System.out.println("OUTPUT...");
System.out.println(s);
}

PRODUCT_CODE_tMemorizeRows_1[0] will give you first index value.

Note: tMemorizeRows_1 perpend the column name.

Final Output:
OUTPUT...
prod02
OUTPUT...
prod04

tMap Inner Join Reject in Talend

We all know tMap is used to join the rows. Did you notice the "Catch lookup inner join reject" option in talend tMap component.

Today I'm going to explain its usage.

SAMPLE JOB:

In above picture LogRow1 with get join mating rows and LogRow2 will get Reject rows (the rows those are not match the inner join condition in tMap)

Notice that in above image, row2 Join Model is "Inner Join" and reject output row "Catch lookup inner join reject" option has set to TRUE.

tJavaFlex Example in Talend

In the given example we create a values array.

And give it to some value.

Run If in Talend

We can use IF in our flow in Talend.

Points to remember (in given example):

1. "tComponent_5_NB_LINE" can be work with OnSubjobOk/OnComponentOk trigger.

2. Stub tJava component is only placeholder here, there is no code inside that.

3. We can use multiple Run If.

Wednesday, February 25, 2015

Big Data and Hadoop

Following are some Big Data tools and topics-

Big Data:
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set.
source: http://en.wikipedia.org

Apache Hadoop

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.

HBase

Apache HBase is the Hadoop database, a distributed, scalable, big data store.

When Would I Use Apache HBase?

Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Features

Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

source: http://hbase.apache.org

Hive

The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

source: https://hive.apache.org

Apache Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.

Apache Sqoop

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Sqoop successfully graduated from the Incubator in March of 2012 and is now a Top-Level Apache project

YARN:

MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.

The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.
Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
Although the Hadoop framework is implemented in Java^TM, MapReduce applications need not be written in Java.
source: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Overview

HDFS
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.
source: http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction

MapR
MapR is a complete enterprise-grade distribution for Apache Hadoop. The MapR Distribution for Apache Hadoop has been engineered to improve Hadoop’s reliability, performance, and ease of use. The MapR distribution provides a full Hadoop stack that includes the MapR File System (MapR-FS), MapReduce, a complete Hadoop ecosystem, and the MapR Control System user interface. You can use MapR with Apache Hadoop, HDFS, and MapReduce APIs.
The following image displays a high-level view of the MapR Distribution for Apache Hadoop:

source: http://doc.mapr.com/display/MapR/MapR+Overview

MapR NFS

MapR NFS: A radically simpler way to get your data out of a Hadoop cluster.

Cluster
In a computer system, a cluster is a group of servers and other resources that act like a single system and enable high availability and, in some cases, load balancing and parallel processing

Hadoop cluster
A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.

Find out even and odd numbers using tMap in Talend

We have field called id with valuse 1,2,3... 10 and we want to find out all even and odd id by only using tMap in Talend.

Let see how can we do that.

Sample Job-

Generate the id values using Row-generator component-

Now the main part of job is using tMap-

We can do the same thing using "Catch output reject" option of tMap-

Final output-

Tuesday, February 24, 2015

How to Pass the values from one job to another in Talend

Hi,

This post tells you, how we can transfer the current job data to another job.

Let say we have following sample data (in admin table input component)-

login_id	password
bishu	bishwajeetsingh
ashu	ashusaxena
abhi	abhimanyusingh
priya	priyaranjansingh
rahul	rahulsingh

and we want to pass this data to job2 (another job)

so, we create a new job with name "job2" and then create the context variable into that (as in below picture).

Now we link logRow component (see the above first picture) to our job2 job .

Right click on job2 component.

Assign the context variables value with row2.login_id and row2.password.

Thats It, now you can find your all values in job2.

Wednesday, February 18, 2015

Presentation on Talend Data Quality

Hello All,
hope you are doing well.

Some days back I have given presentation on Talend Data Quality in my office.

Now I want to share that with you all.

I've created a presentation for you which describe you the Power of Talend Data Quality.

Hope this will help you to understand the use case of Talend Data Quality.

below is the link to download the presentation

https://drive.google.com/file/d/0B6Vi7dcHQ2QHTHZmS2psT3Q5MHc/view?usp=sharing

Parse JSON in Talend

Sample source (JSON)

{
   "next_offset": -1,
   "records": [
      {
         "my_favorite": false,
         "following": true,
         "id": "766f5e38-7bae-5881-0d7c-5425054c7efe",
         "name": "moto",
         "date_entered": "2014-09-26T11:48:51+05:30",
         "date_modified": "2014-09-26T11:48:51+05:30",
         "modified_user_id": "caea1f89-7517-58bb-c5ed-53fdad8ab84c",
         "modified_by_name": "Brij",
         "created_by": "caea1f89-7517-58bb-c5ed-53fdad8ab84c",
         "created_by_name": "Brij",
         "doc_owner": "",
         "user_favorites": "",
         "description": "hello moto crm 7 test by Brij",
         "deleted": false,
         "assigned_user_id": "caea1f89-7517-58bb-c5ed-53fdad8ab84c",
         "assigned_user_name": "Brij",
         "team_count": "",
         "team_name": [
            {
               "id": "1",
               "name": "Global",
               "name_2": "",
               "primary": true
            }
         ],
         "email": [
            {
               "email_address": "brijbhushansh@gmail.com",
               "primary_address": true,
               "reply_to_address": false,
               "invalid_email": false,
               "opt_out": false
            }
         ],
         "email1": "brijbhushansh@gmail.com",
         "email2": "",
         "invalid_email": false,
         "email_opt_out": false,
         "email_addresses_non_primary": "",
         "facebook": "",
         "twitter": "",
         "googleplus": "",
         "account_type": "Analyst",
         "industry": "Engineering",
         "annual_revenue": "222",
         "phone_fax": "23332232",
         "billing_address_street": "A42/6",
         "billing_address_street_2": "",
         "billing_address_street_3": "",
         "billing_address_street_4": "",
         "billing_address_city": "noida",
         "billing_address_state": "up",
         "billing_address_postalcode": "201301",
         "billing_address_country": "India",
         "rating": "8",
         "phone_office": "011-1022094",
         "phone_alternate": "9989999",
         "website": "motorola.com",
         "ownership": "ownshp",
         "employees": "oss",
         "ticker_symbol": "2112",
         "shipping_address_street": "A42/6",
         "shipping_address_street_2": "",
         "shipping_address_street_3": "",
         "shipping_address_street_4": "",
         "shipping_address_city": "noida",
         "shipping_address_state": "up",
         "shipping_address_postalcode": "201301",
         "shipping_address_country": "India",
         "parent_id": "7e51715b-85be-49ce-aa48-53e9adcb8277",
         "sic_code": "1211212",
         "duns_num": "",
         "parent_name": "BNP Paribas",
         "campaign_id": "",
         "campaign_name": "",
         "_acl": {
            "fields": {}
         },
         "_module": "Accounts"
      },
      {
         "my_favorite": false,
         "following": true,
         "id": "72c43463-4b61-8c05-fe21-542176653bbd",
         "name": "bbs a/c",
         "date_entered": "2014-09-23T19:03:45+05:30",
         "date_modified": "2014-09-23T19:03:45+05:30",
         "modified_user_id": "caea1f89-7517-58bb-c5ed-53fdad8ab84c",
         "modified_by_name": "Brij",
         "created_by": "caea1f89-7517-58bb-c5ed-53fdad8ab84c",
         "created_by_name": "Brij",
         "doc_owner": "",
         "user_favorites": "",
         "description": "",
         "deleted": false,
         "assigned_user_id": "caea1f89-7517-58bb-c5ed-53fdad8ab84c",
         "assigned_user_name": "Brij",
         "team_count": "",
         "team_name": [
            {
               "id": "1",
               "name": "Global",
               "name_2": "",
               "primary": true
            }
         ],
         "email": [],
         "email1": "",
         "email2": "",
         "invalid_email": "",
         "email_opt_out": "",
         "email_addresses_non_primary": "",
         "facebook": "",
         "twitter": "",
         "googleplus": "",
         "account_type": "",
         "industry": "",
         "annual_revenue": "",
         "phone_fax": "",
         "billing_address_street": "",
         "billing_address_street_2": "",
         "billing_address_street_3": "",
         "billing_address_street_4": "",
         "billing_address_city": "",
         "billing_address_state": "",
         "billing_address_postalcode": "",
         "billing_address_country": "",
         "rating": "",
         "phone_office": "",
         "phone_alternate": "",
         "website": "",
         "ownership": "",
         "employees": "",
         "ticker_symbol": "",
         "shipping_address_street": "",
         "shipping_address_street_2": "",
         "shipping_address_street_3": "",
         "shipping_address_street_4": "",
         "shipping_address_city": "",
         "shipping_address_state": "",
         "shipping_address_postalcode": "",
         "shipping_address_country": "",
         "parent_id": "",
         "sic_code": "",
         "duns_num": "",
         "parent_name": "",
         "campaign_id": "",
         "campaign_name": "",
         "_acl": {
            "fields": {}
         },
         "_module": "Accounts"
      }
   ]
}

Solution:

We will use tExtractJSONFields for fetching data-

Tuesday, February 17, 2015

Talend Java code for Timezone conversion (UTC To GMT)

We can use below Java code in tJavaRow component-

In component Advance Setting we must use following code-

import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import java.util.TimeZone;

In main code-

long ts = System.currentTimeMillis();
Date localTime = new Date(ts);
String offset = input_row.date_entered.substring(19);

// converting simple date to UTC timezone

DateFormat sdf = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss");
sdf.setTimeZone(TimeZone.getTimeZone("UTC"+offset));

Date gmtTime=null;
// "date_entered":"2014-09-26T11:48:51+05:30"
String yourString=input_row.date_entered;
StringBuilder b3 = new StringBuilder(yourString);
b3.replace(yourString.lastIndexOf(":"), yourString.lastIndexOf(":") + 1, "" );

yourString = b3.toString();
gmtTime = new Date(sdf.format(TalendDate.parseDate("yyyy-MM-dd'T'HH:mm:ssZ",yourString)));
output_row.date_entered = TalendDate.formatDate("yyyy-MM-dd HH:mm:ss",gmtTime);

Friday, February 13, 2015

Match Analysis in Talend Data Quality

Hi Friends,

We can explore duplicate records effectively by using match analysis.

open Talend and select Profiling perspective as showed in below image.

To create new analysis go to Data profiling > Analysis > New Analysis

See below picture, we ICO1(EmployeeID) and Country Code. One ICO1 have multipale data. We want to find the duplicate country code with respect to ICO1.

Duplicate country code groups are separated by different colors.

So in above image we can see country POLAND have 4 records for ICO1 6392.

GRP_SIZE column (first in each group) telling the duplicate group count.

Above chart will also help you to understand matching coutnry code and group count.

Below are the steps to run this analysis:-

first click on Select Matching Key button then click on column on which you want do match

analysis.

Scroll down and click on Chart button near Matching Key tab.

You will see the output. Enjoy!!

Know the Pattern Frequency Stats of your data using Taland Data Quality

Hi all,
I'm going to tell you how you can get data patterns using Taland Data Quality tool.

We have a column named "nric_no" and we want to know the various data pattern of this column.

Sample data:

nric_no

761229-03-5225

641119-01-5411

A0869179

A0138399

A2900392

AA288814

A 1487544

So here are the steps to execute the pattern analysis using Talend DQ.

Step 1.

Step 2.

Select columns (nric_no)

Select indicators (Pattern frequency table)

Step 3.

Click the Run button.

Output:

Learn Talend Data Integration Basics by a presentation

Hi All,
I have made a presentation for Talend in one of my session in few year back and now I want to share that with you.

I believe this will help you lot to understand Talend Data Integration.

Download Presentation

Wednesday, February 4, 2015

Coding Standards

Coding standards are important in any development project, but they are particularly important when many developers are working on the same project. Coding standards help ensure that the code is high quality, has fewer bugs, and can be easily maintained.

Following rules can be used for Core Development Projects (for Framework or Core PHP Development)

1. Indentation

Indentation should consist of 4 spaces. Tabs are not allowed.

Why

Gives the author, not the editor, control over the visual indentation.
It's important to emphasize the difference between indentation (tabs) and alignment (space). Trouble begins when developers use tabs for alignment.
Tabs may look the same as spaces in the editor but they do not behave the same.
Tabs may not necessarily behave the same across different editors, but spaces always will.
Consistent code viewing on any platform: web, desktop or print
It makes code easy to copy and paste for online discussion and sharing since most online viewers render Tab Characters as 4 spaces or more.

To clarify Take a quick look at the following images

Code with tabs (single character per tab equal to 4 spaces):

Code with tabs (single character per tab equal to 2 spaces): Nothing lineup anymore

2. Lines

2.1 Line Length

Target line length should be maximum 80 characters. However, longer lines are acceptable in some (rare) circumstances. Maximum length of any line of PHP code is 120 chars. Line longer then 80 characters SHOULD be split into multiple subsequent line.

Why

This is nothing to do with the monitor screen or editors limitations to display lengthy code lines but some valid reasons to get more details read the following interesting article

http://paul-m-jones.com/archives/276

2.2 There MUST NOT be trailing whitespace at the end of non-blank lines.

2.3 There MUST NOT be more than one statement per line.

2.4 Blank lines MAY be added to improve readability and to indicate related blocks of code.

3. PHP Tag

Only <?php ?> is allowed to delimit the PHP code.

Why

Other style of tags are either deperacted or configuration dependend

Tip: Closing tag '?>' can be omitted in pure PHP script to avoide any unecesary white space injunction.

4. Naming Conventions

4.1 Constants

Constants should always be all-uppercase with underscore to seperate words. Prefix constant name with the uppercased name of the class / package they are used in.

To define global constant use

define('<CONSTANT_NAME>', <'value'>);

To define class constant use

class MyClass {

const <CONSTANT_NAME> = <'value'>;

}

e.g

const DB_DATASOURCE_NAME = 'mysql'; // constant of Db class

const SERVICES_AMAZON_S3_LICENSEKEY = 'xxxx'; // constant of Service class

define('PAGE_LIMIT', 10); // global constant

4.2 Variables

Variable name must start with a lowercase letter and follow the camelCaps convention.

For global, local and public class variables following rules apply.

Variables names may only contain alphanumeric characters. Underscore are not permitted. Numbers are permitted but discouraged.

For private and protected class variables first character must be a underscore "_" rest will be alphanumeric and follow camelCaps convention.

Verbosity is generally encouraged. Variables should always be as verbose as practical to describe the data that the developer intends to store in them. Terse variable names such as "$i" and "$n" are discouraged for all but the smallest loop contexts.

e.g.

$featuredNews = xxxx; // normal variable

private $isAbnormal; // class variable

4.3 Class Name

Classes should be given descriptive names. Avoid using abbreviations where possible. Class names should always begin with an uppercase letter. Each new word must be capitalized. The class hierarchy is also reflected in the class name, each level of the hierarchy separated with a single underscore

e.g

Log // single word class without package

ContentMetadata // multi word class

Vehical_Car //

Vehical_Bike //

4.4 Function Name

Function names may only contain alphanumeric characters. Underscores are not permitted. Numbers are permitted in function names but are discouraged in most cases.

Function names must always start with a lowercase letter. When a function name consists of more than one word, the first letter of each new word must be capitalized (camelCase).

For private and protected class functions name must start with an underscore. This is the only acceptable exceptions in a function name.

Verbosity is generally encouraged. Function names should be as verbose as is practical to fully describe their purpose and behavior.

Functions in the global scope (a.k.a "floating functions") are permitted but discouraged in most cases. Consider wrapping these functions in a static class.

e.g.

filterInput()

getElementById()

widgetFactory()

public doCalculations($param);

5. Class Definitions

Classes must be named according to naming conventions (refer 4.3)

Brace should always be on the line underneath the class name.
Every class must have a documentation block that confirms to the PHPDocumentor standard.
Only one class is permitted in each PHP file.
Placing additional code in class files is not permitted except includes.

e.g.

/**

documentation block

class SampleClass

{

// content of class

}

Classes that extend other classes or which implement interfaces

class SampleClass extends FooAbstract implements BarInterface

{

// content of class

}

If as a result of such declarations, the line length exceeds the maximum line length, break the line before the "extends" and/or "implements" keywords, and pad those lines by one indentation level.

class SampleClass

extends FooAbstract

implements BarInterface

{

// content of class

}

6. Function Definition

Functions must be named according to naming convention (refer 4.4)
Functions / Methods inside classes must always declare there visibility by using private, protected or public modifiers.
the brace should always be written on the line underneath the function name
Space between the function name and the opening parenthesis for the arguments is not permitted.
Arguments with default values go at the end of argument list.
Always attempt to return a meaningful value.
Function arguments should be seperated by a single trailing space after the comma delimiter.

e.g.

function connect($dsn, $persistent = false)

{

if (is_array($dsn)) {

$dsninfo = &$dsn;

} else {

$dsninfo = DB::parseDSN($dsn);

}

if (!$dsninfo || !$dsninfo['phptype']) {

return $this->raiseError();

}

return true;

}

Functions with many parameters may need to be split onto several lines to keep the 80 characters/line limit. The first parameters may be put onto the same line as the function name if there is enough space. Subsequent parameters on following lines are to be indented 4 spaces. The closing parenthesis and the opening brace are to be put onto the next line, on the same indentation level as the "function" keyword.

function someFunctionWithAVeryLongName($firstParameter = 'something', $secondParameter = 'booooo',

$third = null, $fourthParameter = false, $fifthParameter = 123.12,

$sixthParam = true

) {

//....

}

7. Class Member Variables

Member variables must be named according to naming conventions (refer 4.2)
var construct is not permitted
Member variables declare there visibility by using private, protected or public modifiers.
Any variable declared in a class must be listed at the top of the class, above the declaration of any methods
Giving access to member variables directly by declaring them as public is permitted but discouraged in favor of accessor methods (set & get).

8. Function Calls

Functions should be called with no spaces between the function name, the opening parenthesis, and the first parameter; spaces between commas and each parameter, and no space between the last parameter, the closing parenthesis, and the semicolon.

e.g.

$var = foo($bar, $baz, $quux);

The CS require lines to have a maximum length of 80 chars. Calling functions or methods with many parameters while adhering to CS is impossible in that cases. It is allowed to split parameters in function calls onto several lines.

$this->someObject->subObject->callThisFunctionWithALongName(

$parameterOne, $parameterTwo,

$aVeryLongParameterThree

);

$this->someObject->subObject->callThisFunctionWithALongName(

$this->someOtherFunc(

$this->someEvenOtherFunc(

'Help me!',

array(

'foo' => 'bar',

'spam' => 'eggs',

$this->someEvenOtherFunc()

$this->wowowowowow(12)

);

9. Control Structure

The general style rules for control structures are as follows:

There MUST be one space after the control structure keyword

There MUST NOT be a space after the opening parenthesis

There MUST NOT be a space before the closing parenthesis

There MUST be one space between the closing parenthesis and the opening brace

The structure body MUST be indented once

The closing brace MUST be on the next line after the body

The body of each structure MUST be enclosed by braces. This standardizes how the structures look, and reduces the likelihood of introducing errors as new lines get added to the body.

9.1 if, elseif, else

Note the placement of parentheses, spaces, and braces; and that else and elseif are on the same line as the closing brace from the earlier body.

<?php

if ($expr1) {

// if body

} elseif ($expr2) {

// elseif body

} else {

// else body;

}

9.2 switch, case

A switch structure looks like the following. Note the placement of parentheses, spaces, and braces. The case statement MUST be indented once from switch, and the break keyword (or other terminating keyword) MUST be indented at the same level as the case body. There MUST be a comment such as // no break when fall-through is intentional in a non-empty case body.

<?php

switch ($expr) {

case 0:

echo 'First case, with a break';

break;

case 1:

echo 'Second case, which falls through';

// no break

case 2:

case 3:

case 4:

echo 'Third case, return instead of break';

return;

default:

echo 'Default case';

break;

}

9.3 while, do while

A while statement looks like the following. Note the placement of parentheses, spaces, and braces.

<?php

while ($expr) {

// structure body

}

Similarly, a do while statement looks like the following. Note the placement of parentheses, spaces, and braces.

<?php

do {

// structure body;

} while ($expr);

9.4 for

A for statement looks like the following. Note the placement of parentheses, spaces, and braces.

<?php

for ($i = 0; $i < 10; $i++) {

// for body

}

9.5 foreach

A foreach statement looks like the following. Note the placement of parentheses, spaces, and braces.

<?php

foreach ($iterable as $key => $value) {

// foreach body

}

9.6 try, catch

A try catch block looks like the following. Note the placement of parentheses, spaces, and braces.

<?php

try {

// try body

} catch (FirstExceptionType $e) {

// catch body

} catch (OtherExceptionType $e) {

// catch body

}

Note:

Long if statements may be split onto several lines when the character/line limit would be exceeded. The conditions have to be positioned onto the following line, and indented 4 characters. The logical operators (&&, ||, etc.) should be at the beginning of the line to make it easier to comment (and exclude) the condition. The closing parenthesis and opening brace get their own line at the end of the conditions.

Keeping the operators at the beginning of the line has two advantages: It is trivial to comment out a particular line during development while keeping syntactically correct code (except of course the first line). Further is the logic kept at the front where it's not forgotten. Scanning such conditions is very easy since they are aligned below each other.

if (($condition1

|| $condition2)

&& $condition3

&& $condition4

) {

//code here

}

$is_foo = ($condition1 || $condition2);

$is_bar = ($condition3 && $condtion4);

if ($is_foo && $is_bar) {

// ....

}

10. Keywords and True/False/Null

PHP keywords MUST be in lowercase. The PHP constants true, false, and null MUST be in lower case.

11. Code Readability

11.1 every assignment or comparison oprations must preceed and follow with a space

e.g.

$rows=1; // wrong

$rows = 1; // correct

11.2 the equal signs may be aligned in block-related assignments:

$short = foo($bar);

$longer = foo($baz);

The rule can be broken when the length of the variable name is at least 8 characters longer/shorter than the previous one:

$short = foo($bar);

$thisVariableNameIsVeeeeeeeeeeryLong = foo($baz);

Split long assignments onto several lines

Assignments may be split onto several lines when the character/line limit would be exceeded. The equal sign has to be positioned onto the following line, and indented by 4 characters.

$GLOBALS['TSFE']->additionalHeaderData[$this->strApplicationName]

= $this->xajax->getJavascript(t3lib_extMgm::siteRelPath('nr_xajax'));

11.3 Array assignments

Non Associative

$days = array(1, 2, 3, 4, 5, 6, 7);

or if not fit in line

$months = array(

'Jan', 'Feb', 'Mar', 'Apr', 'May', 'June',

'July', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec',

);

Associative Array

$some_array = array(

'foo' => 'bar',

'spam' => 'ham',

);

12. Comments

12.1 Documentable comments

Inline documentation comment blocks (docblocks) must be provided. All blocks must be compatible with phpDocumentor format.

Follwoing docblocks are mendatory with the specified formats. However extra doctags (which are allowed as per phpDocumentor) can be used.

12.1.1 File docblock

every php file must have this block at the start of the file.

/**

* Short description for file

* Long description for file (if any)...

12.1.2 Class docblock

Every class must have this block above the class definition

/**

* Short description for class

* Long description for class (if any)...

12.1.3 Function docblock

Every function must have this block above the function defination

/**

* description for function

* @access

* @param

* @return

12.2 Non Documentable comments

Non-documentation comments are strongly encouraged and Only following style of syntex are allowed. Non documentation comments are for small blocks of code describing what that block of code is doing.

/* */ and //

Note: avoid obvious comments

// get the country code

$country_code = get_country_code($_SERVER['REMOTE_ADDR']);

// if country code is US

if ($country_code == 'US') {

// display the form input for state

echo form_input_state();

}

can be rewritten as

// display state selection for US users

$country_code = get_country_code($_SERVER['REMOTE_ADDR']);

if ($country_code == 'US') {

echo form_input_state();

}

13. E_STRICT-compatible code

This means that php script must not produce any warnings or errors when PHP's error reporting level is set to E_ALL | E_STRICT.

14. Deprecated Code

Deprecated Features of PHP should not be used. Depends on which version of PHP is applicable for the project.

Friday, February 27, 2015

JobScheduler:

Thursday, February 26, 2015

Wednesday, February 25, 2015

Apache Hadoop

HBase

When Would I Use Apache HBase?

Features

Apache Sqoop

YARN:

MapR NFS

Tuesday, February 24, 2015

Wednesday, February 18, 2015

Tuesday, February 17, 2015

Friday, February 13, 2015

Wednesday, February 4, 2015

12. Comments

13. E_STRICT-compatible code

`13.` `E_STRICT`-compatible code