Monday, June 19, 2017

Querying Apache HBase using Talend Open Studio for Big Data

Recent blog posts have described how to set up authorization for Apache HBase using Apache Ranger. However the posts just covered inputing and reading data using the HBase Shell. In this post, we will show how Talend Open Studio for Big Data can be used to read data stored in Apache HBase. This post is along the same lines of other recent tutorials on reading data from Kafka and HDFS.

1) HBase setup

Follow this tutorial on setting up Apache HBase in standalone mode, and creating a 'data' table with some sample values using the HBase Shell.

2) Download Talend Open Studio for Big Data and create a job

Now we will download Talend Open Studio for Big Data (6.4.0 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "HBaseRead". In the search bar on the right-hand side, enter "hbase" and hit enter. Drag "tHBaseConnection" and "tHBaseInput" onto the palette, as well as "tLogRow".

"tHBaseConnection" is used to set up the connection to "HBase", "tHBaseInput" uses the connection to read data from HBase, and "tLogRow" will log the data that was read so that we can see that the job ran successfully. Right-click on "tHBaseConnection" and select "Trigger/On Subjob Ok" and drag the resulting arrow to the "tHBaseInput" component. Now right click on "tHBaseInput" and select "Row/Main" and drag the arrow to "tLogRow".
3) Configure the components

Now let's configure the individual components. Double click on "tHBaseConnection" and select the distribution "Hortonworks" and Version "HDP V2.5.0" (from an earlier tutorial we are using HBase 1.2.6). We are not using Kerberos here so we can skip the rest of the security configuration. Now double click on "tHBaseInput". Select the "Use an existing connection" checkbox. Now hit "Edit Schema" and add two entries to map the column we created in two different column families: "c1" which matches DB "col1" of type String, and "c2" which matches DB "col1" of type String.


Select "data" for the table name back in tHBaseInput and add a mapping for "c1" to "colfam1", and "c2" to "colfam2".


Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. You should see "val1" and "val2" appear in the console window.

No comments:

Post a Comment