Thursday, September 21, 2017

Configuring Kerberos for Hive in Talend Open Studio for Big Data

Earlier this year, I showed how to use Talend Open Studio for Big Data to access data stored in HDFS, where HDFS had been configured to authenticate users using Kerberos. A similar blog post showed how to read data from an Apache Kafka topic using kerberos. In this tutorial I will show how to create a job in Talend Open Studio for Big Data to read data from an Apache Hive table using kerberos. As a prerequisite, please follow a recent tutorial on setting up Apache Hadoop and Apache Hive using kerberos. 

1) Download Talend Open Studio for Big Data and create a job

Download Talend Open Studio for Big Data (6.4.1 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "HiveKerberosRead". In the search bar under "Palette" on the right hand side enter "hive" and hit enter. Drag "tHiveConnection" and "tHiveInput" to the middle of the screen. Do the same for "tLogRow":

"tHiveConnection" will be used to configure the connection to Hive. "tHiveInput" will be used to perform a query on the "words" table we have created in Hive (as per the earlier tutorial linked above), and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tHiveConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tHiveInput". Right click on "tHiveInput" and select "Row/Main" and drag the resulting line to "tLogRow":



3) Configure the components

Now let's configure the individual components. Double click on "tHiveConnection". Select the following configuration options:
  • Distribution: Hortonworks
  • Version: HDP V2.5.0
  • Host: localhost
  • Database: default
  • Select "Use Kerberos Authentication"
  • Hive Principal: hiveserver2/localhost@hadoop.apache.org
  • Namenode Principal: hdfs/localhost@hadoop.apache.org
  • Resource Manager Principal: mapred/localhost@hadoop.apache.org
  • Select "Use a keytab to authenticate"
  • Principal: alice
  • Keytab: Path to "alice.keytab" in the Kerby test project.
  • Unselect "Set Resource Manager"
  • Set Namenode URI: "hdfs://localhost:9000"

Now click on "tHiveInput" and select the following configuration options:
  • Select "Use an existing Connection"
  • Choose the tHiveConnection name from the resulting "Component List".
  • Click on "Edit schema". Create a new column called "word" of type String, and a column called "count" of type int. 
  • Table name: words
  • Query: "select * from words where word == 'Dare'"

Now the only thing that remains is to point to the krb5.conf file that is generated by the Kerby project. Click on "Window/Preferences" at the top of the screen. Select "Talend" and "Run/Debug". Add a new JVM argument: "-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf":
Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. You should see the following output in the Run Window in the Studio:

Wednesday, September 20, 2017

Securing Apache Hive - part VI

This the sixth and final blog post in a series of articles on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. The third post looked at how to use Apache Ranger to create policies to both mask and filter data returned in the Hive query. The fourth post looked how Apache Ranger can create "tag" based authorization policies for Apache Hive using Apache Atlas. The fifth post looked at an alternative authorization solution called Apache Sentry.

In this post we will switch our attention from authorization to authentication, and show how we can authenticate Apache Hive users via kerberos.

1) Set up a KDC using Apache Kerby

A github project that uses Apache Kerby to start up a KDC is available here:
  • bigdata-kerberos-deployment: This project contains some tests which can be used to test kerberos with various big data deployments, such as Apache Hadoop etc.
The KDC is a simple junit test that is available here. To run it just comment out the "org.junit.Ignore" annotation on the test method. It uses Apache Kerby to define the following principals for both Apache Hadoop and Apache Hive:
  • hdfs/localhost@hadoop.apache.org
  • HTTP/localhost@hadoop.apache.org
  • mapred/localhost@hadoop.apache.org
  • hiveserver2/localhost@hadoop.apache.org
  • alice@hadoop.apache.org 
Keytabs are created in the "target" folder. Kerby is configured to use a random port to lauch the KDC each time, and it will create a "krb5.conf" file containing the random port number in the target directory.

2) Configure Apache Hadoop to use Kerberos

The next step is to configure Apache Hadoop to use Kerberos. As a pre-requisite, follow the first tutorial on Apache Hive so that the Hadoop data and Hive table are set up before we apply Kerberos to the mix. Next, follow the steps in section (2) of an earlier tutorial on configuring Hadoop with Kerberos that I wrote. Some additional steps are also required when configuring Hadoop for use with Hive.

Edit 'etc/hadoop/core-site.xml' and add:
  • hadoop.proxyuser.hiveserver2.groups: *
  • hadoop.proxyuser.hiveserver2.hosts: localhost
The previous tutorial on securing HDFS with kerberos did not specify any kerberos configuration for Map-Reduce, as it was not required. For Apache Hive we need to configure Map Reduce appropriately. We will simplify things by using a single principal for the Job Tracker, Task Tracker and Job History. Create a new file 'etc/hadoop/mapred-site.xml' with the following properties:
  • mapreduce.framework.name: classic
  • mapreduce.jobtracker.kerberos.principal: mapred/localhost@hadoop.apache.org
  • mapreduce.jobtracker.keytab.file: Path to Kerby mapred.keytab (see above).
  • mapreduce.tasktracker.keytab.file: mapred/localhost@hadoop.apache.org
  • mapreduce.tasktracker.keytab.file: Path to Kerby mapred.keytab (see above).
  • mapreduce.jobhistory.kerberos.principal:  mapred/localhost@hadoop.apache.org
  • mapreduce.jobhistory.keytab.file: Path to Kerby mapred.keytab (see above).
Start Kerby by running the JUnit test as described in the first section. Now start HDFS via:
  • sbin/start-dfs.sh
  • sudo sbin/start-secure-dns.sh
3) Configure Apache Hive to use Kerberos

Next we will configure Apache Hive to use Kerberos. Edit 'conf/hiveserver2-site.xml' and add the following properties:
  • hive.server2.authentication: kerberos
  • hive.server2.authentication.kerberos.principal: hiveserver2/localhost@hadoop.apache.org
  • hive.server2.authentication.kerberos.keytab: Path to Kerby hiveserver2.keytab (see above).
Start Hive via 'bin/hiveserver2'. In a separate window, log on to beeline via the following steps:
  • export KRB5_CONFIG=/pathtokerby/target/krb5.conf
  • kinit -k -t /pathtokerby/target/alice.keytab alice
  • bin/beeline -u "jdbc:hive2://localhost:10000/default;principal=hiveserver2/localhost@hadoop.apache.org"
At this point authentication is successful and we should be able to query the "words" table as per the first tutorial.

Friday, September 15, 2017

Securing Apache Hive - part V

This is the fifth in a series of blog posts on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. The third post looked at how to use Apache Ranger to create policies to both mask and filter data returned in the Hive query. The fourth post looked how Apache Ranger can create "tag" based authorization policies for Apache Hive using Apache Atlas. In this post we will look at an alternative authorization solution called Apache Sentry.

1) Build the Apache Sentry distribution

First we will build and install the Apache Sentry distribution. Download Apache Sentry (1.8.0 was used for the purposes of this tutorial). Verify that the signature is valid and that the message digests match. Now extract and build the source and copy the distribution to a location where you wish to install it:
  • tar zxvf apache-sentry-1.8.0-src.tar.gz
  • cd apache-sentry-1.8.0-src
  • mvn clean install -DskipTests
  • cp -r sentry-dist/target/apache-sentry-1.8.0-bin ${sentry.home}
I previously covered the authorization plugin that Apache Sentry provides for Apache Kafka. In addition, Apache Sentry provides an authorization plugin for Apache Hive. For the purposes of this tutorial we will just configure the authorization privileges in a configuration file locally to the Hive Server. Therefore we don't need to do any further configuration to the distribution at this point.

2) Install and configure Apache Hive

Please follow the first tutorial to install and configure Apache Hadoop if you have not already done so. Apache Sentry 1.8.0 does not support Apache Hive 2.1.x, so we will need to download and extract Apache Hive 2.0.1. Set the "HADOOP_HOME" environment variable to point to the Apache Hadoop installation directory above. Then follow the steps as outlined in the first tutorial to create the table in Hive and make sure that a query is successful.

3) Integrate Apache Sentry with Apache Hive

Now we will integrate Apache Sentry with Apache Hive. We need to add three new configuration files to the "conf" directory of Apache Hive.

3.a) Configure Apache Hive to use authorization

Create a file called 'conf/hiveserver2-site.xml' with the content:
Here we are enabling authorization and adding the Sentry authorization plugin.

3.b) Add Sentry plugin configuration

Create a new file in the "conf" directory of Apache Hive called "sentry-site.xml" with the following content:
This is the configuration file for the Sentry plugin for Hive. It essentially says that the authorization privileges are stored in a local file, and that the groups for authenticated users should be retrieved from this file. As we are not using Kerberos, the "testing.mode" configuration parameter must be set to "true".

3.c) Add the authorization privileges for our test-case

Next, we need to specify the authorization privileges. Create a new file in the config directory called "sentry.ini" with the following content:
Here we are granting the user "alice" a role which allows her to perform a "select" on the table "words".

3.d) Add Sentry libraries to Hive

Finally, we need to add the Sentry libraries to Hive. Copy the following files from ${sentry.home}/lib  to ${hive.home}/lib:
  • sentry-binding-hive-common-1.8.0.jar
  • sentry-core-model-db-1.8.0.jar
  • sentry*provider*.jar
  • sentry-core-common-1.8.0.jar
  • shiro-core-1.2.3.jar
  • sentry-policy*.jar
  • sentry-service-*.jar
In addition we need the "sentry-binding-hive-v2-1.8.0.jar" which is not bundled with the Apache Sentry distribution. This can be obtained from "http://repo1.maven.org/maven2/org/apache/sentry/sentry-binding-hive-v2/1.8.0/sentry-binding-hive-v2-1.8.0.jar" instead.

4) Test authorization with Apache Hive

Now we can test authorization after restarting Apache Hive. The user 'alice' can query the table according to our policy:
  • bin/beeline -u jdbc:hive2://localhost:10000 -n alice
  • select * from words where word == 'Dare'; (works)
However, the user 'bob' is denied access:
  • bin/beeline -u jdbc:hive2://localhost:10000 -n bob
  • select * from words where word == 'Dare'; (fails)

Thursday, September 14, 2017

Securing Apache Hive - part IV

This is the fourth in a series of blog posts on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. The third post looked at how to use Apache Ranger to create policies to both mask and filter data returned in the Hive query.

In this post we will show how Apache Ranger can create "tag" based authorization policies for Apache Hive using Apache Atlas. In the second post, we showed how to create a "resource" based policy for "alice" in Ranger, by granting "alice" the "select" permission for the "words" table. Instead, we can grant a user "bob" the "select" permission for a given "tag", which is synced into Ranger from Apache Atlas. This means that we can avoid managing specific resources in Ranger itself.

1) Start Apache Atlas and create entities/tags for Hive

First let's look at setting up Apache Atlas. Download the latest released version (0.8.1) and extract it. Build the distribution that contains an embedded HBase and Solr instance via:
  • mvn clean package -Pdist,embedded-hbase-solr -DskipTests
The distribution will then be available in 'distro/target/apache-atlas-0.8.1-bin'. To launch Atlas, we need to set some variables to tell it to use the local HBase and Solr instances:
  • export MANAGE_LOCAL_HBASE=true
  • export MANAGE_LOCAL_SOLR=true
Now let's start Apache Atlas with 'bin/atlas_start.py'. Open a browser and go to 'http://localhost:21000/', logging on with credentials 'admin/admin'. Click on "TAGS" and create a new tag called "words_tag".  Unlike for HDFS or Kafka, Atlas doesn't provide an easy way to create a Hive Entity in the UI. Instead we can use the following json file to create a Hive Entity for the "words" table that we are using in our example, that is based off the example given here:
You can upload it to Atlas via:
  • curl -v -H 'Accept: application/json, text/plain, */*' -H 'Content-Type: application/json;  charset=UTF-8' -u admin:admin -d @hive-create.json http://localhost:21000/api/atlas/entities
Once the new entity has been uploaded, then you can search for it in the Atlas UI. Once it is found, then click on "+" beside "Tags" and associate the new entity with the "words_tag" tag.

2) Use the Apache Ranger TagSync service to import tags from Atlas into Ranger

To create tag based policies in Apache Ranger, we have to import the entity + tag we have created in Apache Atlas into Ranger via the Ranger TagSync service. After building Apache Ranger then extract the file called "target/ranger-<version>-tagsync.tar.gz". Edit 'install.properties' as follows:
  • Set TAG_SOURCE_ATLAS_ENABLED to "false"
  • Set TAG_SOURCE_ATLASREST_ENABLED to  "true" 
  • Set TAG_SOURCE_ATLASREST_DOWNLOAD_INTERVAL_IN_MILLIS to "60000" (just for testing purposes)
  • Specify "admin" for both TAG_SOURCE_ATLASREST_USERNAME and TAG_SOURCE_ATLASREST_PASSWORD
Save 'install.properties' and install the tagsync service via "sudo ./setup.sh". Start the Apache Ranger admin service via "sudo ranger-admin start" and then the tagsync service via "sudo ranger-tagsync-services.sh start".

3) Create Tag-based authorization policies in Apache Ranger

Now let's create a tag-based authorization policy in the Apache Ranger admin UI (http://localhost:6080). Click on "Access Manager" and then "Tag based policies". Create a new Tag service called "HiveTagService". Create a new policy for this service called "WordsTagPolicy". In the "TAG" field enter a "w" and the "words_tag" tag should pop up, meaning that it was successfully synced in from Apache Atlas. Create an "Allow" condition for the user "bob" with the "select" permissions for "Hive":
We also need to go back to the Resource based policies and edit "cl1_hive" that we created in the second tutorial, and select the tag service we have created above. Once our new policy (including tags) has synced to '/etc/ranger/cl1_hive/policycache' we can test authorization in Hive. Previously, the user "bob" was denied access to the "words" table, as only "alice" was assigned a resource-based policy for the table. However, "bob" can now access the table via the tag-based authorization policy we have created:
  • bin/beeline -u jdbc:hive2://localhost:10000 -n bob
  • select * from words where word == 'Dare';

Monday, September 11, 2017

Integrating JSON Web Tokens with Kerberos using Apache Kerby

JSON Web Tokens (JWTs) are a standard way of encapsulating a number of claims about a particular subject. Kerberos is a long-established and widely-deployed SSO protocol, used extensively in the Big-Data space in recent years. An interesting question is to examine how a JWT could be used as part of the Kerberos protocol. In this post we will consider one possible use-case, where a JWT is used to convey additional authorization information to the kerberized service provider.

This use-case is based on a document available at HADOOP-10959, called "A Complement and Short Term Solution to TokenAuth Based on
Kerberos Pre-Authentication Framework", written by Kai Zheng and Weihua Jiang of Intel (also see here).

1) The test-case

To show how to integrate JWTs with Kerberos we will use a concrete test-case available in my github repo here:
  • cxf-kerberos-kerby: This project contains a number of tests that show how to use Kerberos with Apache CXF, where the KDC used in the tests is based on Apache Kerby
The test-case relevant to this blog entry is the JWTJAXRSAuthenticationTest. Here we have a trivial "double it" JAX-RS service implemented using Apache CXF, which is secured using Kerberos. An Apache Kerby-based KDC is launched which the client code uses to obtain a service ticket using JAAS (all done transparently by CXF), which is sent to the service code as part of the Authorization header when making the invocation.

So far this is just a fairly typical example of a kerberized web-service request. What is different is that the service configuration requires a level of authorization above and beyond the kerberos ticket, by insisting that the user must have a particular role to access the web service. This is done by inserting the CXF SimpleAuthorizingInterceptor into the service interceptor chain. An authenticated user must have the "boss" role to access this service. 

So we need somehow to convey the role of the user as part of the kerberized request. We can do this using a JWT as will be explained in the next few sections.

2) High-level overview of JWT use-case with Kerberos
 
As stated above, we need to convey some additional claims about the user to the service. This can be done by including a JWT containing those claims in the Kerberos service ticket. Let's assume that the user is in possession of a JWT that is issued by an IdP that contains a number of claims relating to that user (including the "role" as required by the service in our test-case). The token must be sent to the KDC when obtaining a service ticket.

The KDC must validate the token (checking the signature is correct, and that the signing identity is trusted, etc.). The KDC must then extract some relevant information from the token and insert it somehow into the service ticket. The kerberos spec defines a structure that can be used for this purposes called the AuthorizationData, which consists of a "type" along with some data to be interpreted according to the "type". We can use this structure to insert the encoded JWT as part of the data.  

On the receiving side, the service can extract the AuthorizationData structure from the received ticket and parse it accordingly to retrieve the JWT, and obtain whatever claims are desired from this token accordingly.

3) Sending a JWT Token to the KDC

Let's take a look at how the test-case works in more detail, starting with the client. The test code retrieves a JWT for "alice" by invoking on the JAX-RS interface of the Apache CXF STS. The token contains the claim that "alice" has the "boss" role, which is required to invoke on the "double it" service. Now we need to send this token to the KDC to retrieve a service ticket for the "double it" service, with the JWT encoded in the ticket.

This cannot be done by the built-in Java GSS implementation. Instead we will use Apache Kerby. Apache Kerby has been covered extensively on this blog (see for example here). As well as providing the implementation for the KDC used in our test-case, Apache Kerby provides a complete GSS implementation that supports tokens in the forthcoming 1.1.0 release. To use the Kerby GSS implementation we need to register the KerbyGssProvider as a Java security provider.

To actually pass the JWT we got from the STS to the Kerby GSS layer, we need to use a custom version of the CXF HttpAuthSupplier interface. The KerbyHttpAuthSupplier implementation takes the JWT String, and creates a Kerby KrbToken class using it. This class is added to the private credential list of the current JAAS Subject. This way it will be available to the Kerby GSS layer, which will send the token to the KDC using Kerberos pre-authentication as defined in the document which is linked at the start of this post.

4) Processing the received token in the KDC

The Apache Kerby-based KDC extracts the JWT token from the pre-authentication data entry and verifies that it is signed and that the issuer is trusted. The KDC is configured in the test-case with a certificate to use for this purpose, and also with an issuer String against which the issuer of the JWT must match. If there is an audience claim in the token, then it must match the principal of the service for which we are requesting a ticket. 

If the verification of the received JWT passes, then it is inserted into the AuthorizationData structure in the issued service ticket. The type that is used is a custom value defined here, as this behaviour is not yet standardized. The JWT is serialized and added to the data part of the token. Note that this behaviour is fully customizable.

5) Processing the AuthorizationData structure on the service end

After the service successfully authenticates the client, we have to access the AuthorizationData part of the ticket to extract the JWT. This can all be done using the Java APIs, Kerby is not required on the receiving side. The standard CXF interceptor for Kerberos is subclassed in the tests, to set up a custom CXF SecurityContext using the GssContext. By casting it to a ExtendedGSSContext, we can access the AuthorizationData and hence the JWT. The role claim is then extracted from the JWT and used to enforce the standard "isUserInRole" method of the CXF SecurityContext. 

If you are interested in exploring this topic further, please get involved with the Apache Kerby project, and help us to further improve and expand this integration between JWT and Kerberos.

Thursday, September 7, 2017

Securing Apache Hive - part III

This is the third in a series of blog posts on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. In this post we will extend the authorization scenario by showing how Apache Ranger can be used to create policies to both mask and filter data returned in the Hive query.

1) Data-masking with Apache Ranger

As a pre-requisite to this tutorial, please follow the previous post to set up Apache Hive and to enforce an authorization policy for the user "alice" using Apache Ranger. Now let's imagine that we would like "alice" to be able to see the "counts", but not the actual words themselves. We can create a data-masking policy in Apache Ranger for this. Open a browser and log in at "http://localhost:6080" using "admin/admin" and click on the "cl1_hive" service that we have created in the previous tutorial.

Click on the "Masking" tab and add a new policy called "WordMaskingPolicy", for the "default" database, "words" table and "word" column. Under the mask conditions, add the user "alice" and choose the "Redact" masking option. Save the policy and wait for it to by synced over to Apache Hive:


Now try to login to beeline as "alice" and view the first five entries in the table:
  • bin/beeline -u jdbc:hive2://localhost:10000 -n alice
  • select * from words LIMIT 5;
You should see that the characters in the "word" column have been masked (replaced by "x"s).



2) Row-level filtering with Apache Ranger 

Now let's imagine that we are happy for "alice" to view the "words" in the table, but that we would like to restrict her to words that start with a "D". The previous "access" policy we created for her allows her to view all "words" in the table. We can do this by specifying a row-level filter policy. Click on the "Masking" tab in the UI and disable the policy we created in the previous section.

Now click on the "Row-level Filter" tab and create a new policy called "AliceFilterPolicy" on the "default" database, "words" table. Add a Row Filter condition for the user "alice" with row filter "word LIKE 'D%'". Save the policy and wait for it to by synced over to Apache Hive:


Now try to login to beeline as "alice" as above. "alice" can successfully retrieve all entries where the words start with "D", but no other entries via:
  • select * from words where word like 'D%';