Wednesday, September 20, 2017

Securing Apache Hive - part VI

This the sixth and final blog post in a series of articles on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. The third post looked at how to use Apache Ranger to create policies to both mask and filter data returned in the Hive query. The fourth post looked how Apache Ranger can create "tag" based authorization policies for Apache Hive using Apache Atlas. The fifth post looked at an alternative authorization solution called Apache Sentry.

In this post we will switch our attention from authorization to authentication, and show how we can authenticate Apache Hive users via kerberos.

1) Set up a KDC using Apache Kerby

A github project that uses Apache Kerby to start up a KDC is available here:
  • bigdata-kerberos-deployment: This project contains some tests which can be used to test kerberos with various big data deployments, such as Apache Hadoop etc.
The KDC is a simple junit test that is available here. To run it just comment out the "org.junit.Ignore" annotation on the test method. It uses Apache Kerby to define the following principals for both Apache Hadoop and Apache Hive:
  • hdfs/
  • HTTP/
  • mapred/
  • hiveserver2/
Keytabs are created in the "target" folder. Kerby is configured to use a random port to lauch the KDC each time, and it will create a "krb5.conf" file containing the random port number in the target directory.

2) Configure Apache Hadoop to use Kerberos

The next step is to configure Apache Hadoop to use Kerberos. As a pre-requisite, follow the first tutorial on Apache Hive so that the Hadoop data and Hive table are set up before we apply Kerberos to the mix. Next, follow the steps in section (2) of an earlier tutorial on configuring Hadoop with Kerberos that I wrote. Some additional steps are also required when configuring Hadoop for use with Hive.

Edit 'etc/hadoop/core-site.xml' and add:
  • hadoop.proxyuser.hiveserver2.groups: *
  • hadoop.proxyuser.hiveserver2.hosts: localhost
The previous tutorial on securing HDFS with kerberos did not specify any kerberos configuration for Map-Reduce, as it was not required. For Apache Hive we need to configure Map Reduce appropriately. We will simplify things by using a single principal for the Job Tracker, Task Tracker and Job History. Create a new file 'etc/hadoop/mapred-site.xml' with the following properties:
  • classic
  • mapreduce.jobtracker.kerberos.principal: mapred/
  • mapreduce.jobtracker.keytab.file: Path to Kerby mapred.keytab (see above).
  • mapreduce.tasktracker.keytab.file: mapred/
  • mapreduce.tasktracker.keytab.file: Path to Kerby mapred.keytab (see above).
  • mapreduce.jobhistory.kerberos.principal:  mapred/
  • mapreduce.jobhistory.keytab.file: Path to Kerby mapred.keytab (see above).
Start Kerby by running the JUnit test as described in the first section. Now start HDFS via:
  • sbin/
  • sudo sbin/
3) Configure Apache Hive to use Kerberos

Next we will configure Apache Hive to use Kerberos. Edit 'conf/hiveserver2-site.xml' and add the following properties:
  • hive.server2.authentication: kerberos
  • hive.server2.authentication.kerberos.principal: hiveserver2/
  • hive.server2.authentication.kerberos.keytab: Path to Kerby hiveserver2.keytab (see above).
Start Hive via 'bin/hiveserver2'. In a separate window, log on to beeline via the following steps:
  • export KRB5_CONFIG=/pathtokerby/target/krb5.conf
  • kinit -k -t /pathtokerby/target/alice.keytab alice
  • bin/beeline -u "jdbc:hive2://localhost:10000/default;principal=hiveserver2/"
At this point authentication is successful and we should be able to query the "words" table as per the first tutorial.

No comments:

Post a Comment