Monday, January 29, 2018

Securing Apache Sqoop - part II

This is the second in a series of posts on how to secure Apache Sqoop. The first post looked at how to set up Apache Sqoop to perform a simple use-case of transferring a file from HDFS to Apache Kafka. In this post we will look at securing Apache Sqoop with Apache Ranger, such that only authorized users can interact with it. We will then show how to use the Apache Ranger Admin UI to create authorization policies for Apache Sqoop.

1) Install the Apache Ranger Sqoop plugin

If you have not done so already, please follow the steps in the earlier tutorial to set up Apache Sqoop. First we will install the Apache Ranger Sqoop plugin. Download Apache Ranger and verify that the signature is valid and that the message digests match. Due to some bugs that were fixed for the installation process, I am using version 1.0.0-SNAPSHOT in this post. Now extract and build the source, and copy the resulting plugin to a location where you will configure and install it:
  • mvn clean package assembly:assembly -DskipTests
  • tar zxvf target/ranger-1.0.0-SNAPSHOT-sqoop-plugin.tar.gz
  • mv ranger-1.0.0-SNAPSHOT-sqoop-plugin ${ranger.sqoop.home}
Now go to ${ranger.sqoop.home} and edit "install.properties". You need to specify the following properties:
  • POLICY_MGR_URL: Set this to "http://localhost:6080"
  • REPOSITORY_NAME: Set this to "SqoopTest".
  • COMPONENT_INSTALL_DIR_NAME: The location of your Apache Sqoop installation
Save "install.properties" and install the plugin as root via "sudo -E ./enable-sqoop-plugin.sh". Make sure that the user you are running Sqoop as has permission to access '/etc/ranger/SqoopTest', which is where the Ranger plugin for Sqoop will download authorization policies created in the Ranger Admin UI.

In the Apache Sqoop directory, copy 'conf/ranger-sqoop-security.xml' to the root directory (or else add the 'conf' directory to the Sqoop classpath). Now restart Apache Sqoop and try to see the Connectors that were installed:
  • bin/sqoop2-server start
  • bin/sqoop2-shell
  • show connector
You should see an empty list here as you are not authorized to see the connectors. Note that "show job" should still work OK, as you have permission to view jobs that you created.

2) Create authorization policies in the Apache Ranger Admin console

Next we will use the Apache Ranger admin console to create authorization policies for Sqoop. Follow the steps in this tutorial (except use at least Ranger 1.0.0) to install the Apache Ranger admin service. Start the Apache Ranger admin service with "sudo ranger-admin start" and open a browser and go to "http://localhost:6080/" and log on with "admin/admin". Add a new Sqoop service with the following configuration values:
  • Service Name: SqoopTest
  • Username: admin
  • Sqoop URL: http://localhost:12000
Note that "Test Connection" is not going to work here, as the "admin" user is not authorized at this stage to read from the Sqoop 2 server. However, once the service is created and the policies synced to the Ranger plugin in Sqoop (roughly every 30 seconds by default), it should work correctly.

Once the "SqoopTest" service is created, we will create some authorization policies for the user who is using the Sqoop Shell.
Click on "Settings" and "Users/Groups" and add a new user corresponding to the user for whom you wish to create authorization policies. When this is done then click on the "SqoopTest" service and edit the existing policies, adding this user (for example):


Wait 30 seconds for the policies to sync to the Ranger plugin that is co-located with the Sqoop service. Now re-start the Shell and "show connector" should list the full range of Sqoop Connectors, as authorization has succeeded. Similar policies could be created to allow only certain users to run jobs created by other users.



Friday, January 26, 2018

Securing Apache Sqoop - part I

This is the first in a series of posts on how to secure Apache Sqoop. Apache Sqoop is a tool to transfer bulk data mainly between HDFS and relational databases, but also supporting other projects such as Apache Kafka. In this post we will look at how to set up Apache Sqoop to perform a simple use-case of transferring a file from HDFS to Apache Kafka. Subsequent posts will show how to authorize this data transfer using both Apache Ranger and Apache Sentry.

Note that we will only use Sqoop 2 (current version 1.99.7), as this is the only version that both Sentry and Ranger support. However, this version is not (yet) recommended for production deployment.

1) Set up Apache Hadoop and Apache Kafka

First we will set up Apache Hadoop and Apache Kafka. The use-case is that we want to transfer a file from HDFS (/data/LICENSE.txt) to a Kafka topic (test). Follow part (1) of an earlier tutorial I wrote about installing Apache Hadoop. The following change is also required for ''etc/hadoop/core-site.xml' (in addition to the "fs.defaultFS" setting that is configured in the earlier tutorial):

Make sure that LICENSE.txt is uploaded to the /data directory as outlined in the tutorial. Now we will set up Apache Kafka. Download Apache Kafka and extract it (1.0.0 was used for the purposes of this tutorial). Start Zookeeper with:
  • bin/zookeeper-server-start.sh config/zookeeper.properties
and start the broker and then create a "test" topic with:
  • bin/kafka-server-start.sh config/server.properties
  • bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Finally let's set up a consumer for the "test" topic:
  • bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning --consumer.config config/consumer.properties
2) Set up Apache Sqoop

Download Apache Sqoop and extract it (1.99.7 was used for the purposes of this tutorial).

2.a) Configure + start Sqoop

Before starting Sqoop, edit 'conf/sqoop.properties' and change the following property to point instead to the Hadoop configuration directory (e.g. /path.to.hadoop/etc/hadoop):
  • org.apache.sqoop.submission.engine.mapreduce.configuration.directory
Then configure and start Apache Sqoop with the following commands:
  • export HADOOP_HOME=path to Hadoop home
  • bin/sqoop2-tool upgrade
  • bin/sqoop2-tool verify
  • bin/sqoop2-server start (stop)
2.b) Configure links/job in Sqoop

Now that Sqoop has started we need to configure it to transfer data from HDFS to Kafka. Start the Shell via:
  • bin/sqoop2-shell
"show connector" lists the connectors that are available. We first need to configure a link for the HDFS connector:
  • create link -connector hdfs-connector
  • Name: HDFS
  • URI: hdfs://localhost:9000
  • Conf directory: Path to Hadoop conf directory
Similarly, for the Kafka connector:
  • create link -connector kafka-connector
  • Name: KAFKA
  • Kafka brokers: localhost:9092
  • Zookeeper quorum: localhost:2181
"show link" shows the links we've just created. Now we need to create a job from the HDFS link to the Kafka link as follows (accepting the default values if they are not specified below):
  • create job -f HDFS -t KAFKA
  • Name: testjob
  • Input Directory: /data
  • Topic: test
We can see the job we've created with "show job". Now let's start the job:
  • start job -name testjob 
You should see the content of the HDFS "/data" directory (i.e. the LICENSE.txt) appear in the window of the Kafka "test" consumer, thus showing that Sqoop has transfered data from HDFS to Kafka.

Tuesday, January 23, 2018

Securing Apache Solr with Apache Sentry

Last year I wrote a series of posts on securing Apache Solr, firstly using basic authentication and then using Apache Ranger for authorization. In this post we will look at an alternative authorization solution called Apache Sentry. Previously I have blogged about using Apache Sentry to secure Apache Hive and Apache Kafka.

1)  Install and deploy a SolrCloud example

Download and extract Apache Solr (7.1.0 was used for the purpose of this tutorial). Now start SolrCloud via:
  • bin/solr -e cloud
Accept all of the default options. This creates a cluster of two nodes, with a collection "gettingstarted" split into two shards and two replicas per-shard. A web interface is available after startup at: http://localhost:8983/solr/. Once the cluster is up and running we can post some data to the collection we have created via:
  • bin/post -c gettingstarted example/exampledocs/books.csv
We can then perform a search for all books with author "George R.R. Martin" via:
  • curl http://localhost:8983/solr/gettingstarted/query?q=author:George+R.R.+Martin
2) Authenticating users to our SolrCloud instance

Now that our SolrCloud instance is up and running, let's look at how we can secure access to it, by using HTTP Basic Authentication to authenticate our REST requests. Download the following security configuration which enables Basic Authentication in Solr:
Two users are defined - "alice" and "bob" - both with password "SolrRocks". Now upload this configuration to the Apache Zookeeper instance that is running with Solr:
  • server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd putfile /security.json security.json
Now try to run the search query above again using Curl. A 401 error will be returned. Once we specify the correct credentials then the request will work as expected, e.g.:
  • curl -u alice:SolrRocks http://localhost:8983/solr/gettingstarted/query?q=author:George+R.R.+Martin 
3) Using Apache Sentry for authorization

a) Install the Apache Sentry distribution

Download the binary distribution of Apache Sentry (2.0.0 was used for the purposes of this tutorial). Verify that the signature is valid and that the message digests match. Now extract it to ${sentry.home}. Apache Sentry provides an RPC service which stores authorization privileges in a database. For the purposes of this tutorial we will just configure the authorization privileges in a configuration file local to the Solr distrbution. Therefore we don't need to do any further configuration to the Apache Sentry distribution at this point.

b) Copy Apache Sentry jars into Apache Solr 

To get Sentry authorization working in Apache Solr, we need to copy some jars from the Sentry distribution into Solr. Copy the following jars from ${sentry.home}/lib into ${solr.home}/server/solr-webapp/webapp/WEB-INF/lib:
  • sentry-binding-solr-2.0.0.jar
  • sentry-core-model-solr-2.0.0.jar
  • sentry-core-model-db-2.0.0.jar
  • sentry-core-common-2.0.0.jar
  • shiro-core-1.4.0.jar
  • sentry-policy*.jar
  • sentry-provider-*
c) Add Apache Sentry configuration files

Next we will configure Apache Solr to use Apache Sentry for authorization. Create a new file in the Solr distribution called "sentry-site.xml" with the following content (substituting the correct directory for "sentry.solr.provider.resource"):
This is the configuration file for the Sentry plugin for Solr. It essentially says that the authorization privileges are stored in a local file, and that the groups for authenticated users should be retrieved from this file. Finally, we need to specify the authorization privileges. Create a new file in the config directory called "sentry.ini" with the following content:

This configuration file contains three separate sections. The "[users]" section maps the authenticated principals to local groups. The "[groups]" section maps the groups to roles, and the "[roles]" section lists the actual privileges.

d) Update security.json to add authorization

Next we need to update the security.json to reference Apache Sentry for authorization. Use the following content, substituting the correct path for the "authorization.sentry.site" parameter. Also change the "superuser" to the user running Sentry:

Upload this file via:
  • server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd putfile /security.json security.json

5) Testing authorization

We need to restart Apache Solr to enable authorization with Apache Sentry. Stop Solr via:
  • bin/solr stop -all
Next edit 'bin/solr.in.sh' and add the following properties:
  • SOLR_AUTH_TYPE="basic"
  • SOLR_AUTHENTICATION_OPTS="-Dbasicauth=colm:SolrRocks"
Now restart Apache Solr and test authorization. When "bob" is used, an error should be returned (either 403 or in our case 500, as we have not configured a group for "bob"). "alice" should be able to query the collection, due to the authorization policy we have created for her.