Thursday, May 18, 2017

Configuring Kerberos for HDFS in Talend Open Studio for Big Data

A recent series of blog posts showed how to install and configure Apache Hadoop as a single node cluster, and how to authenticate users via Kerberos and authorize them via Apache Ranger. Interacting with HDFS via the command line tools as shown in the article is convenient but limited. Talend offers a freely-available product called Talend Open Studio for Big Data which you can use to interact with HDFS instead (and many other components as well). In this article we will show how to access data stored in HDFS that is secured with Kerberos as per the previous tutorials.

1) HDFS setup

To begin with please follow the first tutorial to install Hadoop and to store the LICENSE.txt in a '/data' folder. Then follow the fifth tutorial to set up an Apache Kerby based KDC testcase and configure HDFS to authenticate users via Kerberos. To test everything is working correctly on the command line do:
  • export KRB5_CONFIG=/pathtokerby/target/krb5.conf
  • kinit -k -t /pathtokerby/target/alice.keytab alice
  • bin/hadoop fs -cat /data/LICENSE.txt
2) Download Talend Open Studio for Big Data and create a job

Now we will download Talend Open Studio for Big Data (6.4.0 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "HDFSKerberosRead". In the search bar under "Palette" on the right hand side enter "tHDFS" and hit enter. Drag "tHDFSConnection" and "tHDFSInput" to the middle of the screen. Do the same for "tLogRow":
We now have all the components we need to read data from HDFS. "tHDFSConnection" will be used to configure the connection to Hadoop. "tHDFSInput" will be used to read the data from "/data" and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tHDFSConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tHDFSInput". Right click on "tHDFSInput" and select "Row/Main" and drag the resulting line to "tLogRow":
3) Configure the components

Now let's configure the individual components. Double click on "tHDFSConnection". For the "version", select the "Hortonworks" Distribution with version HDP V2.5.0 (we are using the original Apache distribution as part of this tutorial, but it suffices to select Hortonworks here). Under "Authentication" tick the checkbox called "Use kerberos authentication". For the Namenode principal specify "hdfs/localhost@hadoop.apache.org". Select the checkbox marked "Use a keytab to authenticate". Select "alice" as the principal and "<path.to.kerby.project>/target/alice.keytab" as the "Keytab":
Now click on "tHDFSInput". Select the checkbox for "Use an existing connection" + select the "tHDFSConnection" component in the resulting component list. For "File Name" specify the file we want to read: "/data/LICENSE.txt":
Now click on "Edit schema" and hit the "+" button. This will create a "newColumn" column of type "String". We can leave this as it is, because we are not doing anything with the data other than logging it. Save the job. Now the only thing that remains is to point to the krb5.conf file that is generated by the Kerby project. Click on "Window/Preferences" at the top of the screen. Select "Talend" and "Run/Debug". Add a new JVM argument: "-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf":

Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. If everything is working correctly, you should see the contents of "/data/LICENSE.txt" displayed in the Run window.

No comments:

Post a Comment