Thursday, September 21, 2017

Configuring Kerberos for Hive in Talend Open Studio for Big Data

Earlier this year, I showed how to use Talend Open Studio for Big Data to access data stored in HDFS, where HDFS had been configured to authenticate users using Kerberos. A similar blog post showed how to read data from an Apache Kafka topic using kerberos. In this tutorial I will show how to create a job in Talend Open Studio for Big Data to read data from an Apache Hive table using kerberos. As a prerequisite, please follow a recent tutorial on setting up Apache Hadoop and Apache Hive using kerberos. 

1) Download Talend Open Studio for Big Data and create a job

Download Talend Open Studio for Big Data (6.4.1 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "HiveKerberosRead". In the search bar under "Palette" on the right hand side enter "hive" and hit enter. Drag "tHiveConnection" and "tHiveInput" to the middle of the screen. Do the same for "tLogRow":

"tHiveConnection" will be used to configure the connection to Hive. "tHiveInput" will be used to perform a query on the "words" table we have created in Hive (as per the earlier tutorial linked above), and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tHiveConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tHiveInput". Right click on "tHiveInput" and select "Row/Main" and drag the resulting line to "tLogRow":



3) Configure the components

Now let's configure the individual components. Double click on "tHiveConnection". Select the following configuration options:
  • Distribution: Hortonworks
  • Version: HDP V2.5.0
  • Host: localhost
  • Database: default
  • Select "Use Kerberos Authentication"
  • Hive Principal: hiveserver2/localhost@hadoop.apache.org
  • Namenode Principal: hdfs/localhost@hadoop.apache.org
  • Resource Manager Principal: mapred/localhost@hadoop.apache.org
  • Select "Use a keytab to authenticate"
  • Principal: alice
  • Keytab: Path to "alice.keytab" in the Kerby test project.
  • Unselect "Set Resource Manager"
  • Set Namenode URI: "hdfs://localhost:9000"

Now click on "tHiveInput" and select the following configuration options:
  • Select "Use an existing Connection"
  • Choose the tHiveConnection name from the resulting "Component List".
  • Click on "Edit schema". Create a new column called "word" of type String, and a column called "count" of type int. 
  • Table name: words
  • Query: "select * from words where word == 'Dare'"

Now the only thing that remains is to point to the krb5.conf file that is generated by the Kerby project. Click on "Window/Preferences" at the top of the screen. Select "Talend" and "Run/Debug". Add a new JVM argument: "-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf":
Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. You should see the following output in the Run Window in the Studio:

No comments:

Post a Comment