Tuesday, May 23, 2017

Configuring Kerberos for Kafka in Talend Open Studio for Big Data

A recent blog post showed how to use Talend Open Studio for Big Data to access data stored in HDFS, where HDFS had been configured to authenticate users using Kerberos. In this post we will follow a similar setup, to see how to create a job in Talend Open Studio for Big Data to read data from an Apache Kafka topic using kerberos.

1) Kafka setup

Follow a recent tutorial to setup an Apache Kerby based KDC testcase and to configure Apache Kafka to require kerberos for authentication. Create a "test" topic and write some data to it, and verify with the command-line consumer that the data can be read correctly.

2) Download Talend Open Studio for Big Data and create a job

Now we will download Talend Open Studio for Big Data (6.4.0 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "KafkaKerberosRead". 
In the search bar under "Palette" on the right hand side enter "kafka" and hit enter. Drag "tKafkaConnection" and "tKafkaInput" to the middle of the screen. Do the same for "tLogRow":
We now have all the components we need to read data from the Kafka topic. "tKafkaConnection" will be used to configure the connection to Kafka. "tKafkaInput" will be used to read the data from the "test" topic, and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tKafkaConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tKafkaInput". Right click on "tKafkaInput" and select "Row/Main" and drag the resulting line to "tLogRow":

3) Configure the components

Now let's configure the individual components. Double click on "tKafkaConnection". If a message appears that informs you that you need to install additional jars, then click on "Install". Select the version of Kafka that corresponds to the version you are using (if it doesn't match then select the most recent version). For the "Zookeeper quorum list" property enter "localhost:2181". For the "broker list" property enter "localhost:9092".

Now we will configure the kerberos related properties of "tKafkaConnection". Select the "Use kerberos authentication" checkbox and some additional configuration properties will appear. For "JAAS configuration path" you need to enter the path of the "client.jaas" file as described in the tutorial to set up the Kafka test-case. You can leave "Kafka brokers principal name" property as the default value ("kafka"). Finally, select the "Set kerberos configuration path" property and enter the path of the "krb5.conf" file supplied in the target directory of the Apache Kerby test-case.



Now click on "tKafkaInput". Select the checkbox for "Use an existing connection" + select the "tKafkaConnection" component in the resulting component list. For "topic name" specify "test". The "Consumer group id" can stay as the default "mygroup".

Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. Send some data via the producer to the "test" topic and you should see the data appear in the Run Window in the Studio.

No comments:

Post a Comment