Earlier this year, I showed how to use Talend Open Studio for Big Data
to access data stored in HDFS, where HDFS had been configured to
authenticate users using Kerberos. A similar blog post showed how to read data from an Apache Kafka topic using kerberos. In this tutorial I will show how to create a job in Talend Open Studio for Big Data to read data from an Apache Hive table using kerberos. As a prerequisite, please follow a recent tutorial on setting up Apache Hadoop and Apache Hive using kerberos.
1) Download Talend Open Studio for Big Data and create a job
Download
Talend Open Studio for Big Data (6.4.1 was used for the purposes of
this tutorial). Unzip the file when it is downloaded and then start the
Studio using one of the platform-specific scripts. It will prompt you to
download some additional dependencies and to accept the licenses. Click
on "Create a new job" called "HiveKerberosRead". In the search bar under "Palette" on the right hand side enter "hive"
and hit enter. Drag "tHiveConnection" and "tHiveInput" to the middle of
the screen. Do the same for "tLogRow":
"tHiveConnection" will be used to configure the connection to Hive.
"tHiveInput" will be used to perform a query on the "words" table we have created in Hive (as per the earlier tutorial linked above), and finally
"tLogRow" will just log the data so that we can be sure that it was read
correctly. The next step is to join the components up. Right click on
"tHiveConnection" and select "Trigger/On Subjob Ok" and drag the
resulting line to "tHiveInput". Right click on "tHiveInput" and select
"Row/Main" and drag the resulting line to "tLogRow":
3) Configure the components
Now let's configure the individual components. Double click on
"tHiveConnection". Select the following configuration options:
Keytab: Path to "alice.keytab" in the Kerby test project.
Unselect "Set Resource Manager"
Set Namenode URI: "hdfs://localhost:9000"
Now click on "tHiveInput" and select the following configuration options:
Select "Use an existing Connection"
Choose the tHiveConnection name from the resulting "Component List".
Click on "Edit schema". Create a new column called "word" of type String, and a column called "count" of type int.
Table name: words
Query: "select * from words where word == 'Dare'"
Now the only thing that remains is to point to the krb5.conf file that
is generated by the Kerby project. Click on "Window/Preferences" at the
top of the screen. Select "Talend" and "Run/Debug". Add a new JVM
argument:
"-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf":
Now we are ready to run the job. Click on the "Run" tab and then hit the
"Run" button. You should see the following output in the Run Window in the Studio:
This the sixth and final blog post in a series of articles on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. The third post looked at how to use Apache Ranger to create policies to both mask and filter
data returned in the Hive query. The fourth post looked how Apache Ranger can create "tag" based authorization policies for Apache Hive using Apache Atlas. The fifth post looked at an alternative authorization solution called Apache Sentry.
In this post we will switch our attention from authorization to authentication, and show how we can authenticate Apache Hive users via kerberos.
1) Set up a KDC using Apache Kerby
A github project that uses Apache Kerby to start up a KDC is available here:
bigdata-kerberos-deployment: This project contains some tests which can be used to test kerberos with
various big data deployments, such as Apache Hadoop etc.
The KDC is a simple junit test that is available here.
To run it just comment out the "org.junit.Ignore" annotation on the
test method. It uses Apache Kerby to define the following principals for both Apache Hadoop and Apache Hive:
hdfs/localhost@hadoop.apache.org
HTTP/localhost@hadoop.apache.org
mapred/localhost@hadoop.apache.org
hiveserver2/localhost@hadoop.apache.org
alice@hadoop.apache.org
Keytabs are created in the "target" folder. Kerby is configured to use a random port to lauch
the KDC each time, and it will create a "krb5.conf" file containing the
random port number in the target directory.
2) Configure Apache Hadoop to use Kerberos
The next step is to configure Apache Hadoop to use Kerberos. As a pre-requisite, follow the first tutorial on Apache Hive so that the Hadoop data and Hive table are set up before we apply Kerberos to the mix. Next, follow the steps in section (2) of an earlier tutorial on configuring Hadoop with Kerberos that I wrote. Some additional steps are also required when configuring Hadoop for use with Hive.
Edit 'etc/hadoop/core-site.xml' and add:
hadoop.proxyuser.hiveserver2.groups: *
hadoop.proxyuser.hiveserver2.hosts: localhost
The previous tutorial on securing HDFS with kerberos did not specify any kerberos configuration for Map-Reduce, as it was not required. For Apache Hive we need to configure Map Reduce appropriately. We
will simplify things by using a single principal for the Job Tracker, Task
Tracker and Job History. Create a new file 'etc/hadoop/mapred-site.xml' with the following properties:
This is the fifth in a series of blog posts on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. The third post looked at how to use Apache Ranger to create policies to both mask and filter
data returned in the Hive query. The fourth post looked how Apache Ranger can create "tag" based authorization policies for Apache Hive using Apache Atlas. In this post we will look at an alternative authorization solution called Apache Sentry.
1) Build the Apache Sentry distribution
First we will build and install the Apache Sentry distribution. Download
Apache Sentry (1.8.0 was used for the purposes of this tutorial).
Verify that the signature is valid and that the message digests match.
Now extract and build the source and copy the distribution to a location
where you wish to install it:
I previously covered the authorization plugin that Apache Sentry provides for Apache Kafka. In addition, Apache Sentry provides an authorization plugin for Apache Hive. For the purposes of this
tutorial we will just configure the authorization privileges in a
configuration file locally to the Hive Server. Therefore we don't need to do
any further configuration to the distribution at this point.
2) Install and configure Apache Hive
Please follow the first tutorial to install and configure Apache Hadoop if you have not already done so. Apache Sentry 1.8.0 does not support Apache Hive 2.1.x, so we will need to download
and extract Apache Hive 2.0.1. Set the "HADOOP_HOME" environment variable to point to the
Apache Hadoop installation directory above. Then follow the steps as outlined in the first tutorial to create the table in Hive and make sure that a query is successful.
3) Integrate Apache Sentry with Apache Hive
Now we will integrate Apache Sentry with Apache Hive. We need to add three new configuration files to the "conf" directory of Apache Hive.
3.a) Configure Apache Hive to use authorization
Create a file called 'conf/hiveserver2-site.xml' with the content:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here we are enabling authorization and adding the Sentry authorization plugin.
3.b) Add Sentry plugin configuration
Create a new file in the "conf" directory of Apache Hive called "sentry-site.xml" with the following content:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is the configuration file for the Sentry plugin for Hive. It
essentially says that the authorization privileges are stored in a local
file, and that the groups for authenticated users should be retrieved
from this file. As we are not using Kerberos, the "testing.mode" configuration parameter must be set to "true".
3.c) Add the authorization privileges for our test-case
Next, we need to specify the authorization
privileges. Create a new file in the config directory called "sentry.ini" with the following content:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here we are granting the user "alice" a role which allows her to perform a "select" on the table "words".
3.d) Add Sentry libraries to Hive
Finally, we need to add the Sentry libraries to Hive. Copy the following files from ${sentry.home}/lib to ${hive.home}/lib:
sentry-binding-hive-common-1.8.0.jar
sentry-core-model-db-1.8.0.jar
sentry*provider*.jar
sentry-core-common-1.8.0.jar
shiro-core-1.2.3.jar
sentry-policy*.jar
sentry-service-*.jar
In addition we need the "sentry-binding-hive-v2-1.8.0.jar" which is not bundled with the Apache Sentry distribution. This can be obtained from "http://repo1.maven.org/maven2/org/apache/sentry/sentry-binding-hive-v2/1.8.0/sentry-binding-hive-v2-1.8.0.jar" instead.
4) Test authorization with Apache Hive
Now we
can test authorization after restarting Apache Hive. The user 'alice' can query the table
according to our policy:
bin/beeline -u jdbc:hive2://localhost:10000 -n alice
select * from words where word == 'Dare'; (works)
However, the user 'bob' is denied access:
bin/beeline -u jdbc:hive2://localhost:10000 -n bob
This is the fourth in a series of blog posts on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. The third post looked at how to use Apache Ranger to create policies to both mask and filter
data returned in the Hive query.
In this post we will show how Apache Ranger can create "tag" based authorization policies for Apache Hive using Apache Atlas. In the second post, we showed how to create a "resource" based policy for "alice" in Ranger, by granting "alice" the "select" permission for the "words" table. Instead, we can grant a user "bob" the "select" permission for a given "tag", which is synced into Ranger from Apache Atlas. This means that we can avoid managing specific resources in Ranger itself.
1) Start Apache Atlas and create entities/tags for Hive
First let's look at setting up Apache Atlas. Download
the latest released version (0.8.1) and extract it. Build
the distribution that contains an embedded HBase and Solr instance via:
The distribution will then be available in
'distro/target/apache-atlas-0.8.1-bin'. To launch Atlas, we
need to set some variables to tell it to use the local HBase and Solr
instances:
export MANAGE_LOCAL_HBASE=true
export MANAGE_LOCAL_SOLR=true
Now let's start Apache Atlas with 'bin/atlas_start.py'. Open a browser
and go to 'http://localhost:21000/', logging on with credentials
'admin/admin'. Click on "TAGS" and create a new tag called
"words_tag". Unlike for HDFS or Kafka, Atlas doesn't provide an
easy way to create a Hive Entity in the UI. Instead we can use the
following json file to create a Hive Entity for the "words" table that we are using in our example, that is based off the example given here:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Once the new entity has been uploaded, then you can search for it in the
Atlas UI. Once it is found, then click on "+" beside "Tags" and associate the new entity
with the "words_tag" tag.
2) Use the Apache Ranger TagSync service to import tags from Atlas into Ranger To create tag based policies in Apache Ranger, we have to import
the entity + tag we have created in Apache Atlas into Ranger via the
Ranger TagSync service. After building Apache Ranger then extract the
file called "target/ranger-<version>-tagsync.tar.gz". Edit 'install.properties' as follows:
Set TAG_SOURCE_ATLAS_ENABLED to "false"
Set TAG_SOURCE_ATLASREST_ENABLED to "true"
Set TAG_SOURCE_ATLASREST_DOWNLOAD_INTERVAL_IN_MILLIS to "60000" (just for testing purposes)
Specify "admin" for both TAG_SOURCE_ATLASREST_USERNAME and TAG_SOURCE_ATLASREST_PASSWORD
Save 'install.properties' and install the tagsync service via "sudo
./setup.sh". Start the Apache Ranger admin service via "sudo
ranger-admin start" and then the tagsync service via "sudo
ranger-tagsync-services.sh
start".
3) Create Tag-based authorization policies in Apache Ranger
Now let's create a tag-based authorization policy in the Apache Ranger
admin UI (http://localhost:6080). Click on "Access Manager" and then "Tag based policies".
Create a new Tag service called "HiveTagService". Create a new policy
for this service called "WordsTagPolicy". In the "TAG" field enter a
"w" and the "words_tag" tag should pop up, meaning that it was successfully
synced in from Apache Atlas. Create an "Allow" condition for the user
"bob" with the "select" permissions for "Hive":
We also need to go back to the Resource based
policies and edit "cl1_hive" that we created in the second tutorial, and select the tag service we have
created above. Once our new policy (including tags) has synced to '/etc/ranger/cl1_hive/policycache' we
can test authorization in Hive. Previously, the user "bob" was denied access to the "words" table, as only "alice" was assigned a resource-based policy for the table. However, "bob" can now access the table via the tag-based authorization policy we have created:
bin/beeline -u jdbc:hive2://localhost:10000 -n bob
JSON Web Tokens (JWTs) are a standard way of encapsulating a number of claims about a particular subject. Kerberos is a long-established and widely-deployed SSO protocol, used extensively in the Big-Data space in recent years. An interesting question is to examine how a JWT could be used as part of the Kerberos protocol. In this post we will consider one possible use-case, where a JWT is used to convey additional authorization information to the kerberized service provider.
This use-case is based on a document available at HADOOP-10959, called "A Complement and Short Term Solution to TokenAuth Based on Kerberos Pre-Authentication Framework", written by Kai Zheng and Weihua Jiang of Intel (also see here). 1) The test-case
To show how to integrate JWTs with Kerberos we will use a concrete test-case available in my github repo here:
cxf-kerberos-kerby: This project contains a number of tests that show how to use Kerberos with
Apache CXF, where the KDC used in the tests is based on Apache Kerby
The test-case relevant to this blog entry is the JWTJAXRSAuthenticationTest. Here we have a trivial "double it" JAX-RS service implemented using Apache CXF, which is secured using Kerberos. An Apache Kerby-based KDC is launched which the client code uses to obtain a service ticket using JAAS (all done transparently by CXF), which is sent to the service code as part of the Authorization header when making the invocation. So far this is just a fairly typical example of a kerberized web-service request. What is different is that the service configuration requires a level of authorization above and beyond the kerberos ticket, by insisting that the user must have a particular role to access the web service. This is done by inserting the CXF SimpleAuthorizingInterceptor into the service interceptor chain. An authenticated user must have the "boss" role to access this service. So we need somehow to convey the role of the user as part of the kerberized request. We can do this using a JWT as will be explained in the next few sections.
2) High-level overview of JWT use-case with Kerberos As stated above, we need to convey some additional claims about the user to the service. This can be done by including a JWT containing those claims in the Kerberos service ticket. Let's assume that the user is in possession of a JWT that is issued by an IdP that contains a number of claims relating to that user (including the "role" as required by the service in our test-case). The token must be sent to the KDC when obtaining a service ticket.
The KDC must validate the token (checking the signature is correct, and that the signing identity is trusted, etc.). The KDC must then extract some relevant information from the token and insert it somehow into the service ticket. The kerberos spec defines a structure that can be used for this purposes called the AuthorizationData,
which consists of a "type" along with some data to be interpreted
according to the "type". We can use this structure to insert the encoded
JWT as part of the data.
On the receiving side, the service can extract the AuthorizationData structure from the received ticket and parse it accordingly to retrieve the JWT, and obtain whatever claims are desired from this token accordingly.
3) Sending a JWT Token to the KDC Let's take a look at how the test-case works in more detail, starting with the client. The test code retrieves a JWT for "alice" by invoking on the JAX-RS interface of the Apache CXF STS. The token contains the claim that "alice" has the "boss" role, which is required to invoke on the "double it" service. Now we need to send this token to the KDC to retrieve a service ticket for the "double it" service, with the JWT encoded in the ticket. This cannot be done by the built-in Java GSS implementation. Instead we will use Apache Kerby. Apache Kerby has been covered extensively on this blog (see for example here). As well as providing the implementation for the KDC used in our test-case, Apache Kerby provides a complete GSS implementation that supports tokens in the forthcoming 1.1.0 release. To use the Kerby GSS implementation we need to register the KerbyGssProvider as a Java security provider. To actually pass the JWT we got from the STS to the Kerby GSS layer, we need to use a custom version of the CXF HttpAuthSupplier interface. The KerbyHttpAuthSupplier implementation takes the JWT String, and creates a Kerby KrbToken class using it. This class is added to the private credential list of the current JAAS Subject. This way it will be available to the Kerby GSS layer, which will send the token to the KDC using Kerberos pre-authentication as defined in the document which is linked at the start of this post.
4) Processing the received token in the KDC
The Apache Kerby-based KDC extracts the JWT token from the pre-authentication data entry and verifies that it is signed and that the issuer is trusted. The KDC is configured in the test-case with a certificate to use for this purpose, and also with an issuer String against which the issuer of the JWT must match. If there is an audience claim in the token, then it must match the principal of the service for which we are requesting a ticket. If the verification of the received JWT passes, then it is inserted into the AuthorizationData structure in the issued service ticket. The type that is used is a custom value defined here, as this behaviour is not yet standardized. The JWT is serialized and added to the data part of the token. Note that this behaviour is fully customizable. 5) Processing the AuthorizationData structure on the service end
After the service successfully authenticates the client, we have to access the AuthorizationData part of the ticket to extract the JWT. This can all be done using the Java APIs, Kerby is not required on the receiving side. The standard CXF interceptor for Kerberos is subclassed in the tests, to set up a custom CXF SecurityContext using the GssContext. By casting it to a ExtendedGSSContext, we can access the AuthorizationData and hence the JWT. The role claim is then extracted from the JWT and used to enforce the standard "isUserInRole" method of the CXF SecurityContext.
If you are interested in exploring this topic further, please get involved with the Apache Kerby project, and help us to further improve and expand this integration between JWT and Kerberos.
This is the third in a series of blog posts on securing Apache Hive. The first post looked at installing Apache Hive and doing some queries on data stored in HDFS. The second post looked at how to add authorization to the previous tutorial using Apache Ranger. In this post we will extend the authorization scenario by showing how Apache Ranger can be used to create policies to both mask and filter data returned in the Hive query.
1) Data-masking with Apache Ranger
As a pre-requisite to this tutorial, please follow the previous post to set up Apache Hive and to enforce an authorization policy for the user "alice" using Apache Ranger. Now let's imagine that we would like "alice" to be able to see the "counts", but not the actual words themselves. We can create a data-masking policy in Apache Ranger for this. Open a browser and log in at "http://localhost:6080" using "admin/admin" and click on the "cl1_hive" service that we have created in the previous tutorial.
Click on the "Masking" tab and add a new policy called "WordMaskingPolicy", for the "default" database, "words" table and "word" column. Under the mask conditions, add the user "alice" and choose the "Redact" masking option. Save the policy and wait for it to by synced over to Apache Hive:
Now try to login to beeline as "alice" and view the first five entries in the table:
bin/beeline -u jdbc:hive2://localhost:10000 -n alice
select * from words LIMIT 5;
You should see that the characters in the "word" column have been masked (replaced by "x"s).
2) Row-level filtering with Apache Ranger
Now let's imagine that we are happy for "alice" to view the "words" in the table, but that we would like to restrict her to words that start with a "D". The previous "access" policy we created for her allows her to view all "words" in the table. We can do this by specifying a row-level filter policy. Click on the "Masking" tab in the UI and disable the policy we created in the previous section.
Now click on the "Row-level Filter" tab and create a new policycalled "AliceFilterPolicy" on the "default" database, "words" table. Add a Row Filter condition for the user "alice" with row filter "word LIKE 'D%'". Save the policy and wait for it to by synced over to Apache Hive:
Now try to login to beeline as "alice" as above. "alice" can successfully retrieve all entries where the words start with "D", but no other entries via: