Wednesday, November 29, 2017

Authorizing access to Apache Yarn using Apache Ranger

Earlier this year, I wrote a series of blog posts on how to secure access to the Apache Hadoop filesystem (HDFS), using tools like Apache Ranger and Apache Atlas. In this post, we will go further and show how to authorize access to Apache Yarn using Apache Ranger. Apache Ranger allows us to create and enforce authorization policies based on who is allowed to submit applications to run on Apache Yarn. Therefore it can be used to enforce authorization decisions for Hive on Yarn or Spark on Yarn jobs.

1) Installing Apache Hadoop

First, follow the steps outlined in the earlier tutorial (section 1) on setting up Apache Hadoop, except that in this tutorial we will work with Apache Hadoop 2.8.2. In addition, we will need to follow some additional steps to configure Yarn (see here for the official documentation). Create a new file called 'etc/hadoop/mapred-site.xml' with the content:
Next edit 'etc/hadoop/yarn-site.xml' and add:
Now we can start Apache Yarn via 'sbin/start-yarn.sh'. We are going to submit jobs as a local user called "alice" to test authorization. First we need to create some directories in HDFS:
  • bin/hdfs dfs -mkdir -p /user/alice/input
  • bin/hdfs dfs -put etc/hadoop/*.xml /user/alice/input
  • bin/hadoop fs -chown -R alice /user/alice
  • bin/hadoop fs -mkdir /tmp
  • bin/hadoop fs -chmod og+w /tmp
Now we can submit an example job as "alice" via:
  • sudo -u alice bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep input output 'dfs[a-z.]+'
The job should run successfully and store the output in '/user/alice/output'. Delete this directory before trying to run the job again ('bin/hadoop fs -rm -r /user/alice/output').

2) Install the Apache Ranger Yarn plugin

Next we will install the Apache Ranger Yarn plugin. Download Apache Ranger and verify that the signature is valid and that the message digests match. Due to some bugs that were fixed for the installation process, I am using version 1.0.0-SNAPSHOT in this post. Now extract and build the source, and copy the resulting plugin to a location where you will configure and install it:
  • mvn clean package assembly:assembly -DskipTests
  • tar zxvf target/ranger-1.0.0-SNAPSHOT-yarn-plugin.tar.gz
  • mv ranger-1.0.0-SNAPSHOT-yarn-plugin ${ranger.yarn.home}
Now go to ${ranger.yarn.home} and edit "install.properties". You need to specify the following properties:
  • POLICY_MGR_URL: Set this to "http://localhost:6080"
  • REPOSITORY_NAME: Set this to "YarnTest".
  • COMPONENT_INSTALL_DIR_NAME: The location of your Apache Hadoop installation
Save "install.properties" and install the plugin as root via "sudo -E ./enable-yarn-plugin.sh". Make sure that the user who is running Yarn has the permission to read the policies stored in '/etc/ranger/YarnTest'. There is one additional step to be performed in Hadoop before restarting Yarn. Edit 'etc/hadoop/ranger-yarn-security.xml' and add a property called "ranger.add-yarn-authorization" with value "false". This means that if Ranger policy authorization fails, it doesn't fall back to the default Yarn ACLs (which allow all users to submit jobs to the default queue).

Finally, re-start Yarn and try to resubmit the job as "alice" as per the previous section. You should now see an authorization error: "User alice cannot submit applications to queue root.default".

3) Create authorization policies in the Apache Ranger Admin console

Next we will use the Apache Ranger admin console to create authorization policies for Yarn. Follow the steps in this tutorial to install the Apache Ranger admin service. Start the Apache Ranger admin service with "sudo ranger-admin start" and open a browser and go to "http://localhost:6080/" and log on with "admin/admin". Add a new Yarn service with the following configuration values:
  • Service Name: YarnTest
  • Username: admin
  • Password: admin
  • Yarn REST URL: http://localhost:8088
Click on "Test Connection" to verify that we can connect successfully to Yarn + then save the new service. Now click on the "YarnTest" service that we have created. Add a new policy for the "root.default" queue for the user "alice" (create this user if you have not done so already under "Settings, Users/Groups"), with a permission of "submit-app".

Allow up to 30 seconds for the Apache Ranger plugin to download the new authorization policy from the admin service. Then try to re-run the job as "alice". This time it should succeed due to the authorization policy that we have created.

Tuesday, November 28, 2017

Installing the Apache Kerby KDC

Apache Kerby is a subproject of the Apache Directory project, and is a complete open-source KDC written entirely in Java. Apache Kerby 1.1.0 has just been released. This release contains two major new features: a GSSAPI module (covered previously here) and cross-realm support (the subject of a forthcoming blog post).

I have previously used Apache Kerby in this blog as a KDC to illustrate some security-based test-cases for big data components such as Apache Hadoop, Hive, Storm, etc, by pointing to some code on github that shows how to launch a Kerby KDC using Apache maven. This is convenient as a KDC can be launched with the principals already created via a single maven command. However, it is not suitable if the KDC is to be used in a standalone setting.

In this post, we will show how to create a Kerby KDC distribution without writing any code.

1) Install and configure the Apache Kerby KDC

The first step is to download the Apache Kerby source code. Unzip the source and build the distribution via:
  • mvn clean install -DskipTests
  • cd kerby-dist
  • mvn package -Pdist
The "kerby-dist" directory contains the KDC distribution in "kdc-dist", as well as the client tools in "tool-dist". Copy both "kdc-dist" and "tool-dist" directories to another location instead of working directly in the Kerby source. In "kdc-dist" create a directory called "keytabs" and "runtime". Then create some keytabs via:
  • sh bin/kdcinit.sh conf keytabs
This will create keytabs for the "kadmin" and "protocol" principals, and store them in the "keytabs" directory. For testing purposes, we will change the port of the KDC from the default "88" to "12345" to avoid having to run the KDC with administrator privileges. Edit "conf/krb5.conf" and "conf/kdc.conf" and change "88" to "12345".

The Kerby principals are stored in a backend that is configured in "conf/backend.conf". By default this is a JSON file that is stored in "/tmp/kerby/jsonbackend". However, Kerby also supports other more robust backends, such as LDAP, Mavibot, Zookeeper, etc.

We can start the KDC via:
  • sh bin/start-kdc.sh conf runtime
Let's create a new user called "alice":
  • sh bin/kadmin.sh conf/ -k keytabs/admin.keytab
  • addprinc -pw password alice@EXAMPLE.COM
2) Install and configure the Apache Kerby tool dist

We can check that the KDC has started properly using the MIT kinit tool, if it is installed locally:
  • export KRB5_CONFIG=/path.to.kdc.dist/conf/krb5.conf
  • kinit alice (use "password" for the password when prompted)
Now you can see the ticket for alice using "klist". Apache Kerby also ships a "tool-dist" distribution that contains implementations of "kinit", "klist", etc. First call "kdestroy" to remove the ticket previously obtained for "alice". Then go into the directory where "tool-dist" was installed to in the previous section. Edit "conf/krb5.conf" and replace "88" with "12345". We can now obtain a ticket for "alice" via:
  • sh bin/kinit.sh -conf conf alice
  • sh bin/klist.sh