Open Source Security: Securing Apache Sqoop

This is the first in a series of posts on how to secure Apache Sqoop. Apache Sqoop is a tool to transfer bulk data mainly between HDFS and relational databases, but also supporting other projects such as Apache Kafka. In this post we will look at how to set up Apache Sqoop to perform a simple use-case of transferring a file from HDFS to Apache Kafka. Subsequent posts will show how to authorize this data transfer using both Apache Ranger and Apache Sentry.

Note that we will only use Sqoop 2 (current version 1.99.7), as this is the only version that both Sentry and Ranger support. However, this version is not (yet) recommended for production deployment.

1) Set up Apache Hadoop and Apache Kafka

First we will set up Apache Hadoop and Apache Kafka. The use-case is that we want to transfer a file from HDFS (/data/LICENSE.txt) to a Kafka topic (test). Follow part (1) of an earlier tutorial I wrote about installing Apache Hadoop. The following change is also required for ''etc/hadoop/core-site.xml' (in addition to the "fs.defaultFS" setting that is configured in the earlier tutorial):

	<configuration>
	<property>
	<name>hadoop.proxyuser.sqoop2.groups</name>
	<value>*</value>
	</property>
	<property>
	<name>hadoop.proxyuser.sqoop2.hosts</name>
	<value>*</value>
	</property>
	</configuration>

view raw core-site.xml hosted with ❤ by GitHub

Make sure that LICENSE.txt is uploaded to the /data directory as outlined in the tutorial. Now we will set up Apache Kafka. Download Apache Kafka and extract it (1.0.0 was used for the purposes of this tutorial). Start Zookeeper with:

bin/zookeeper-server-start.sh config/zookeeper.properties

and start the broker and then create a "test" topic with:

bin/kafka-server-start.sh config/server.properties
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

Finally let's set up a consumer for the "test" topic:

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning --consumer.config config/consumer.properties

2) Set up Apache Sqoop

Download Apache Sqoop and extract it (1.99.7 was used for the purposes of this tutorial).

2.a) Configure + start Sqoop

Before starting Sqoop, edit 'conf/sqoop.properties' and change the following property to point instead to the Hadoop configuration directory (e.g. /path.to.hadoop/etc/hadoop):

org.apache.sqoop.submission.engine.mapreduce.configuration.directory

Then configure and start Apache Sqoop with the following commands:

export HADOOP_HOME=path to Hadoop home
bin/sqoop2-tool upgrade
bin/sqoop2-tool verify
bin/sqoop2-server start (stop)

2.b) Configure links/job in Sqoop

Now that Sqoop has started we need to configure it to transfer data from HDFS to Kafka. Start the Shell via:

bin/sqoop2-shell

"show connector" lists the connectors that are available. We first need to configure a link for the HDFS connector:

create link -connector hdfs-connector
Name: HDFS
URI: hdfs://localhost:9000
Conf directory: Path to Hadoop conf directory

Similarly, for the Kafka connector:

create link -connector kafka-connector
Name: KAFKA
Kafka brokers: localhost:9092
Zookeeper quorum: localhost:2181

"show link" shows the links we've just created. Now we need to create a job from the HDFS link to the Kafka link as follows (accepting the default values if they are not specified below):

create job -f HDFS -t KAFKA
Name: testjob
Input Directory: /data
Topic: test

We can see the job we've created with "show job". Now let's start the job:

start job -name testjob

You should see the content of the HDFS "/data" directory (i.e. the LICENSE.txt) appear in the window of the Kafka "test" consumer, thus showing that Sqoop has transfered data from HDFS to Kafka.

Open Source Security

Friday, January 26, 2018

Securing Apache Sqoop - part I

No comments:

Post a Comment