A recent blog post showed how to use Talend Open Studio for Big Data to access data stored in HDFS, where HDFS had been configured to authenticate users using Kerberos. In this post we will follow a similar setup, to see how to create a job in Talend Open Studio for Big Data to read data from an Apache Kafka topic using kerberos.
1) Kafka setup
Follow a recent tutorial to setup an Apache Kerby based KDC testcase and to configure Apache Kafka to require kerberos for authentication. Create a "test" topic and write some data to it, and verify with the command-line consumer that the data can be read correctly.
2) Download Talend Open Studio for Big Data and create a job
Now we will download
Talend Open Studio for Big Data (6.4.0 was used for the purposes of
this tutorial). Unzip the file when it is downloaded and then start the
Studio using one of the platform-specific scripts. It will prompt you to
download some additional dependencies and to accept the licenses. Click
on "Create a new job" called "KafkaKerberosRead".
In the search bar under "Palette" on the right hand side enter "kafka"
and hit enter. Drag "tKafkaConnection" and "tKafkaInput" to the middle of
the screen. Do the same for "tLogRow":
3) Configure the components
Now let's configure the individual components. Double click on "tKafkaConnection". If a message appears that informs you that you need to install additional jars, then click on "Install". Select the version of Kafka that corresponds to the version you are using (if it doesn't match then select the most recent version). For the "Zookeeper quorum list" property enter "localhost:2181". For the "broker list" property enter "localhost:9092".
Now we will configure the kerberos related properties of "tKafkaConnection". Select the "Use kerberos authentication" checkbox and some additional configuration properties will appear. For "JAAS configuration path" you need to enter the path of the "client.jaas" file as described in the tutorial to set up the Kafka test-case. You can leave "Kafka brokers principal name" property as the default value ("kafka"). Finally, select the "Set kerberos configuration path" property and enter the path of the "krb5.conf" file supplied in the target directory of the Apache Kerby test-case.
Now click on "tKafkaInput". Select the checkbox for "Use an existing
connection" + select the "tKafkaConnection" component in the resulting
component list. For "topic name" specify "test". The "Consumer group id" can stay as the default "mygroup".
Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. Send some data via the producer to the "test" topic and you should see the data appear in the Run Window in the Studio.