Thursday, April 20, 2017

Securing Apache Hadoop Distributed File System (HDFS) - part II

This is the second in a series of posts on securing HDFS. The first post described how to install Apache Hadoop, and how to use POSIX permissions and ACLs to restrict access to data stored in HDFS. In this post we will look at how to use Apache Ranger to authorize access to data stored in HDFS. The Apache Ranger Admin console allows you to create policies which are retrieved and enforced by a HDFS authorization plugin. Apache Ranger allows us to create centralized authorization policies for HDFS, as well as an authorization audit trail stored in SOLR or HDFS.

1) Install the Apache Ranger HDFS plugin

First we will install the Apache Ranger HDFS plugin. Follow the steps in the previous tutorial to setup Apache Hadoop, if you have not done this already. Then download Apache Ranger and verify that the signature is valid and that the message digests match. Due to some bugs that were fixed for the installation process, I am using version 1.0.0-SNAPSHOT in this post. Now extract and build the source, and copy the resulting plugin to a location where you will configure and install it:
  • mvn clean package assembly:assembly -DskipTests
  • tar zxvf target/ranger-1.0.0-SNAPSHOT-hdfs-plugin.tar.gz
  • mv ranger-1.0.0-SNAPSHOT-hdfs-plugin ${ranger.hdfs.home}
Now go to ${ranger.hdfs.home} and edit "". You need to specify the following properties:
  • POLICY_MGR_URL: Set this to "http://localhost:6080"
  • REPOSITORY_NAME: Set this to "HDFSTest".
  • COMPONENT_INSTALL_DIR_NAME: The location of your Apache Hadoop installation
Save "" and install the plugin as root via "sudo ./". The Apache Ranger HDFS plugin should now be successfully installed. Start HDFS with:
  • sbin/
2) Create authorization policies in the Apache Ranger Admin console

Next we will use the Apache Ranger admin console to create authorization policies for our data in HDFS. Follow the steps in this tutorial to install the Apache Ranger admin service. Start the Apache Ranger admin service with "sudo ranger-admin start" and open a browser and go to "http://localhost:6080/" and log on with "admin/admin". Add a new HDFS service with the following configuration values:
  • Service Name: HDFSTest
  • Username: admin
  • Password: admin
  • Namenode URL: hdfs://localhost:9000
Click on "Test Connection" to verify that we can connect successfully to HDFS + then save the new service. Now click on the "HDFSTest" service that we have created. Add a new policy for the "/data" resource path for the user "alice" (create this user if you have not done so already under "Settings, Users/Groups"), with permissions of "read" and "execute".

3) Testing authorization in HDFS

Now let's test the Ranger authorization policy we created above in action. Note that by default the HDFS authorization plugin checks for a Ranger authorization policy that grants access first, and if this fails it falls back to the default POSIX permissions. The Ranger authorization plugin will pull policies from the Admin service every 30 seconds by default. For the "HDFSTest" example above, they are stored in "/etc/ranger/HDFSTest/policycache/" by default. Make sure that the user you are running Hadoop as can access this directory.

Now let's test to see if I can read the data file as follows:
  • bin/hadoop fs -cat /data/LICENSE* (this should work via the underlying POSIX permissions)
  • sudo -u alice bin/hadoop fs -cat /data/LICENSE* (this should work via the Ranger authorization policy)
  • sudo -u bob bin/hadoop fs -cat /data/LICENSE* (this should fail as we don't have an authorization policy for "bob").

No comments:

Post a Comment