Friday, May 26, 2017

Securing Apache Storm - part I

This is the first tutorial in a planned three part series on securing Apache Storm. In this post we will look at setting up a simple Storm cluster that authenticates users via Kerberos, and how to run a simple topology on it. Future posts will cover authorization using Apache Ranger. For more information on how to setup Kerberos for Apache Storm, please see the following documentation.

1) Set up a KDC using Apache Kerby

As for other kerberos-related tutorials that I have written on this blog, we will use a github project I wrote that uses Apache Kerby to start up a KDC:
  • bigdata-kerberos-deployment: This project contains some tests which can be used to test kerberos with various big data deployments, such as Apache Hadoop etc.
The KDC is a simple junit test that is available here. To run it just comment out the "org.junit.Ignore" annotation on the test method. It uses Apache Kerby to define the following principals:
  • zookeeper/localhost@storm.apache.org
  • zookeeper-client@storm.apache.org
  • storm/localhost@storm.apache.org
  • storm-client@@storm.apache.org
  • alice@storm.apache.org
Keytabs are created in the "target" folder. Kerby is configured to use a random port to lauch the KDC each time, and it will create a "krb5.conf" file containing the random port number in the target directory.

2) Download and configure Apache Zookeeper

Apache Storm uses Apache Zookeeper to help coordinate the cluster. Download Apache Zookeeper (this tutorial used 3.4.10) and extract it to a local directory. Configure Zookeeper to use Kerberos by adding a new file 'conf/zoo.cfg' with the following properties:
  • dataDir=/tmp/zookeeper
  • clientPort=2181
  • authProvider.1 = org.apache.zookeeper.server.auth.SASLAuthenticationProvider
  • requireClientAuthScheme=sasl 
  • jaasLoginRenew=3600000 
Now create 'conf/zookeeper.jaas' with the following content:

Server {
        com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/zookeeper.keytab" storeKey=true principal="zookeeper/localhost";
};

Before launching Zookeeper, we need to point to the JAAS configuration file above and also to the krb5.conf file generated in the Kerby test-case above. Add a new file 'conf/java.env' adding the SERVER_JVMFLAGS property to the classpath with:
  • -Djava.security.auth.login.config=/path.to.zookeeper/conf/zookeeper.jaas
  • -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf".
Start Zookeeper via:
  • bin/zkServer.sh start
3) Download and configure Apache Storm

Now download and extract the Apache Storm distribution (1.1.0 was used in this tutorial). Edit 'conf/storm.yaml' and edit the following properties:
  • For "storm.zookeeper.servers" add "- localhost"
  • nimbus.seeds: ["localhost"]
  • storm.thrift.transport: "org.apache.storm.security.auth.kerberos.KerberosSaslTransportPlugin"
  • java.security.auth.login.config: "/path.to.storm/conf/storm.jaas"
  • storm.zookeeper.superACL: "sasl:storm"
  • nimbus.childopts: "-Djava.security.auth.login.config=/path.to.storm/conf/storm.jaas -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf" 
  • ui.childopts: "-Djava.security.auth.login.config=/path.to.storm/conf/storm.jaas -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf" 
  • supervisor.childopts: "-Djava.security.auth.login.config=/path.to.storm/conf/storm.jaas -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf"
Create a file called 'conf/storm.jaas' with the content:

Client {
    com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/zookeeper_client.keytab" storeKey=true principal="zookeeper-client";
};

StormClient {  
    com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="path.to.kerby.project/target/storm_client.keytab" storeKey=true principal="storm-client" serviceName="storm";
};

StormServer {
    com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="path.to.kerby.project/target/storm.keytab" storeKey=true principal="storm/localhost@storm.apache.org";
};

'Client' is used to communicate with Zookeeper, 'StormClient' is used by the supervisor nodes and 'StormServer' is used by nimbus. Now start Nimbus and a supervisor node via:
  • bin/storm nimbus
  • bin/storm supervisor
4) Deploy a Topology

As we have the Storm cluster up and running, the next task is to deploy a Topology to it. For this we will need to use another Storm distribution, so extract Storm again to another directory. Edit 'conf/storm.yaml' and edit the following properties:
  • For "storm.zookeeper.servers" add "- localhost"
  • nimbus.seeds: ["localhost"]
  • storm.thrift.transport: "org.apache.storm.security.auth.kerberos.KerberosSaslTransportPlugin"
  • java.security.auth.login.config: "/path.to.storm.client/conf/storm.jaas"
Create a file called 'conf/storm.jaas' with the content:

StormClient {
            com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useTicketCache=true serviceName="storm";
};

Note that we are not using keytabs here, but instead a ticket cache. Now edit 'conf/storm_env.ini' and add:
  • STORM_JAR_JVM_OPTS:-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf
Now that we have everything set up, it's time to deploy a topology to our cluster. I have a simple Storm topology that wires a WordSpout + WordCounterBolt into a topology that can be used for this in github here. Check this project out from github + build it via "mvn assembly:assembly". We will need a Kerberos ticket store in our ticket cache to deploy the job:
  • export KRB5_CONFIG=/path.to.kerby.project/target/krb5.conf
  • kinit -k -t /path.to.kerby.project/target/alice.keytab alice
Finally we can submit our topology:
  • bin/storm jar /path.to.storm.project/target/bigdata-storm-demo-1.0-jar-with-dependencies.jar  org.apache.coheigea.bigdata.storm.StormMain /path.to.storm.project/target/test-classes/words.txt
If you take a look at the logs in the nimbus distribution you should see that the topology has run correctly, e.g. 'logs/workers-artifacts/mytopology-1-1495813912/6700/worker.log'.

Tuesday, May 23, 2017

Configuring Kerberos for Kafka in Talend Open Studio for Big Data

A recent blog post showed how to use Talend Open Studio for Big Data to access data stored in HDFS, where HDFS had been configured to authenticate users using Kerberos. In this post we will follow a similar setup, to see how to create a job in Talend Open Studio for Big Data to read data from an Apache Kafka topic using kerberos.

1) Kafka setup

Follow a recent tutorial to setup an Apache Kerby based KDC testcase and to configure Apache Kafka to require kerberos for authentication. Create a "test" topic and write some data to it, and verify with the command-line consumer that the data can be read correctly.

2) Download Talend Open Studio for Big Data and create a job

Now we will download Talend Open Studio for Big Data (6.4.0 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "KafkaKerberosRead". 
In the search bar under "Palette" on the right hand side enter "kafka" and hit enter. Drag "tKafkaConnection" and "tKafkaInput" to the middle of the screen. Do the same for "tLogRow":
We now have all the components we need to read data from the Kafka topic. "tKafkaConnection" will be used to configure the connection to Kafka. "tKafkaInput" will be used to read the data from the "test" topic, and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tKafkaConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tKafkaInput". Right click on "tKafkaInput" and select "Row/Main" and drag the resulting line to "tLogRow":

3) Configure the components

Now let's configure the individual components. Double click on "tKafkaConnection". If a message appears that informs you that you need to install additional jars, then click on "Install". Select the version of Kafka that corresponds to the version you are using (if it doesn't match then select the most recent version). For the "Zookeeper quorum list" property enter "localhost:2181". For the "broker list" property enter "localhost:9092".

Now we will configure the kerberos related properties of "tKafkaConnection". Select the "Use kerberos authentication" checkbox and some additional configuration properties will appear. For "JAAS configuration path" you need to enter the path of the "client.jaas" file as described in the tutorial to set up the Kafka test-case. You can leave "Kafka brokers principal name" property as the default value ("kafka"). Finally, select the "Set kerberos configuration path" property and enter the path of the "krb5.conf" file supplied in the target directory of the Apache Kerby test-case.



Now click on "tKafkaInput". Select the checkbox for "Use an existing connection" + select the "tKafkaConnection" component in the resulting component list. For "topic name" specify "test". The "Consumer group id" can stay as the default "mygroup".

Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. Send some data via the producer to the "test" topic and you should see the data appear in the Run Window in the Studio.

Monday, May 22, 2017

Security advisories issued for Apache CXF Fediz

Two security advisories were recently issued for Apache CXF Fediz. In addition to fixing these issues, the recent releases of Fediz impose tighter security constraints in some areas by default compared to older releases. In this post I will document the advisories and the other security-related changes in the recent Fediz releases.

1) Security Advisories

The first security advisory is CVE-2017-7661: "The Apache CXF Fediz Jetty and Spring plugins are vulnerable to CSRF attacks.". Essentially, both the Jetty 8/9 and Spring Security 2/3 plugins are subject to a CSRF-style vulnerability when the user doesn't complete the authentication process. In addition, the Jetty plugins are vulnerable even if the user does first complete the authentication process, but only the root context is available as part of this attack.

The second advisory is CVE-2017-7662: "The Apache CXF Fediz OIDC Client Registration Service is vulnerable to CSRF attacks". The OIDC client registration service is a simple web application that allows the creation of clients for OpenId Connect, as well as a number of other administrative tasks. It is vulnerable to CSRF attacks, where a malicious application could take advantage of an existing session to make changes to the OpenId Connect clients that are stored in the IdP.

2) Fediz IdP security constraints

This section only concerns the WS-Federation (and SAML-SSO) IdP in Fediz. The WS-Federation RP application sends its address via the 'wreply' parameter to the IdP. For SAML SSO, the address to reply to is taken from the consumer service URL of the SAML SSO Request. Previously, the Apache CXF Fediz IdP contained an optional 'passiveRequestorEndpointConstraint' configuration value in the 'ApplicationEntity', which allows the admin to specify a regular expression constraint on the 'wreply' URL.

From Fediz 1.4.0, 1.3.2 and 1.2.4, a new configuration option is available in the 'ApplicationEntity' called 'passiveRequestorEndpoint'. If specified, this is directly matched against the 'wreply' parameter. In a change that breaks backwards compatibility, but that is necessary for security reasons, one of 'passiveRequestorEndpointConstraint' or 'passiveRequestorEndpoint must be specified in the 'ApplicationEntity' configuration. This ensures that the user cannot be redirected to a malicious client. Similarly, new configuration options are available called 'logoutEndpoint' and 'logoutEndpointConstraint' which validate the 'wreply' parameter in the case of redirecting the user after logging out, one of which must be specified.

3) Fediz RP security constraints

This section only concerns the WS-Federation RP plugins available in Fediz. When the user tries to log out of the Fediz RP application, a 'wreply' parameter can be specified to give the address that the Fediz IdP can redirect to after logout is complete. The old functionality was that if 'wreply' was not specified, then the RP plugin instead used the value from the 'logoutRedirectTo' configuration parameter.

From Fediz 1.4.0, 1.3.2 and 1.2.4, a new configuration option is available called 'logoutRedirectToConstraint'. If a 'wreply' parameter is presented, then it must match the regular expression that is specified for 'logoutRedirectToConstraint', otherwise the 'wreply' value is ignored and it falls back to 'logoutRedirectTo'. 

Thursday, May 18, 2017

Configuring Kerberos for HDFS in Talend Open Studio for Big Data

A recent series of blog posts showed how to install and configure Apache Hadoop as a single node cluster, and how to authenticate users via Kerberos and authorize them via Apache Ranger. Interacting with HDFS via the command line tools as shown in the article is convenient but limited. Talend offers a freely-available product called Talend Open Studio for Big Data which you can use to interact with HDFS instead (and many other components as well). In this article we will show how to access data stored in HDFS that is secured with Kerberos as per the previous tutorials.

1) HDFS setup

To begin with please follow the first tutorial to install Hadoop and to store the LICENSE.txt in a '/data' folder. Then follow the fifth tutorial to set up an Apache Kerby based KDC testcase and configure HDFS to authenticate users via Kerberos. To test everything is working correctly on the command line do:
  • export KRB5_CONFIG=/pathtokerby/target/krb5.conf
  • kinit -k -t /pathtokerby/target/alice.keytab alice
  • bin/hadoop fs -cat /data/LICENSE.txt
2) Download Talend Open Studio for Big Data and create a job

Now we will download Talend Open Studio for Big Data (6.4.0 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "HDFSKerberosRead". In the search bar under "Palette" on the right hand side enter "tHDFS" and hit enter. Drag "tHDFSConnection" and "tHDFSInput" to the middle of the screen. Do the same for "tLogRow":
We now have all the components we need to read data from HDFS. "tHDFSConnection" will be used to configure the connection to Hadoop. "tHDFSInput" will be used to read the data from "/data" and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tHDFSConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tHDFSInput". Right click on "tHDFSInput" and select "Row/Main" and drag the resulting line to "tLogRow":
3) Configure the components

Now let's configure the individual components. Double click on "tHDFSConnection". For the "version", select the "Hortonworks" Distribution with version HDP V2.5.0 (we are using the original Apache distribution as part of this tutorial, but it suffices to select Hortonworks here). Under "Authentication" tick the checkbox called "Use kerberos authentication". For the Namenode principal specify "hdfs/localhost@hadoop.apache.org". Select the checkbox marked "Use a keytab to authenticate". Select "alice" as the principal and "<path.to.kerby.project>/target/alice.keytab" as the "Keytab":
Now click on "tHDFSInput". Select the checkbox for "Use an existing connection" + select the "tHDFSConnection" component in the resulting component list. For "File Name" specify the file we want to read: "/data/LICENSE.txt":
Now click on "Edit schema" and hit the "+" button. This will create a "newColumn" column of type "String". We can leave this as it is, because we are not doing anything with the data other than logging it. Save the job. Now the only thing that remains is to point to the krb5.conf file that is generated by the Kerby project. Click on "Window/Preferences" at the top of the screen. Select "Talend" and "Run/Debug". Add a new JVM argument: "-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf":

Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. If everything is working correctly, you should see the contents of "/data/LICENSE.txt" displayed in the Run window.

Monday, May 15, 2017

Securing Apache Kafka with Kerberos

Last year, I wrote a series of blog articles based on securing Apache Kafka. The articles covered how to secure access to the Apache Kafka broker using TLS client authentication, and how to implement authorization policies using Apache Ranger and Apache Sentry. Recently I wrote another article giving a practical demonstration how to secure HDFS using Kerberos. In this post I will look at how to secure Apache Kafka using Kerberos, using a test-case based on Apache Kerby. For more information on securing Kafka with kerberos, see the Kafka security documentation.

1) Set up a KDC using Apache Kerby

A github project that uses Apache Kerby to start up a KDC is available here:
  • bigdata-kerberos-deployment: This project contains some tests which can be used to test kerberos with various big data deployments, such as Apache Hadoop etc.
The KDC is a simple junit test that is available here. To run it just comment out the "org.junit.Ignore" annotation on the test method. It uses Apache Kerby to define the following principals:
  • zookeeper/localhost@kafka.apache.org
  • kafka/localhost@kafka.apache.org
  • client@kafka.apache.org
Keytabs are created in the "target" folder. Kerby is configured to use a random port to lauch the KDC each time, and it will create a "krb5.conf" file containing the random port number in the target directory. 

2) Configure Apache Zookeeper

Download Apache Kafka and extract it (0.10.2.1 was used for the purposes of this tutorial). Edit 'config/zookeeper.properties' and add the following properties:
  • authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
  • requireClientAuthScheme=sasl 
  • jaasLoginRenew=3600000
Now create 'config/zookeeper.jaas' with the following content:

Server {
        com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/zookeeper.keytab" storeKey=true principal="zookeeper/localhost";
};

Before launching Zookeeper, we need to point to the JAAS configuration file above and also to the krb5.conf file generated in the Kerby test-case above. This can be done by setting the "KAFKA_OPTS" system property with the JVM arguments:
  • -Djava.security.auth.login.config=/path.to.zookeeper/config/zookeeper.jaas 
  • -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf
Now start Zookeeper via:
  • bin/zookeeper-server-start.sh config/zookeeper.properties 
3) Configure Apache Kafka broker

Create 'config/kafka.jaas' with the content:

KafkaServer {
            com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/kafka.keytab" storeKey=true principal="kafka/localhost";
};

Client {
        com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/kafka.keytab" storeKey=true principal="kafka/localhost";
};

The "Client" section is used to talk to Zookeeper. Now edit  'config/server.properties' and add the following properties:
  • listeners=SASL_PLAINTEXT://localhost:9092
  • security.inter.broker.protocol=SASL_PLAINTEXT 
  • sasl.mechanism.inter.broker.protocol=GSSAPI 
  • sasl.enabled.mechanisms=GSSAPI 
  • sasl.kerberos.service.name=kafka 
We will just concentrate on using SASL for authentication, and hence we are using "SASL_PLAINTEXT" as the protocol. For "SASL_SSL" please follow the keystore generation as outlined in the following article. Again, we need to set the "KAFKA_OPTS" system property with the JVM arguments:
  • -Djava.security.auth.login.config=/path.to.kafka/config/kafka.jaas 
  • -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf
Now we can start the server and create a topic as follows:
  • bin/kafka-server-start.sh config/server.properties
  • bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
4) Configure Apache Kafka producers/consumers

To make the test-case simpler we added a single principal "client" in the KDC for both the producer and consumer. Create a file called "config/client.jaas" with the content:

KafkaClient {
        com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/client.keytab" storeKey=true principal="client";
};

Edit *both* 'config/producer.properties' and 'config/consumer.properties' and add:
  • security.protocol=SASL_PLAINTEXT
  • sasl.mechanism=GSSAPI 
  • sasl.kerberos.service.name=kafka
Now set the "KAFKA_OPTS" system property with the JVM arguments:
  • -Djava.security.auth.login.config=/path.to.kafka/config/client.jaas 
  • -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf
We should now be all set. Start the producer and consumer via:
  • bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test --producer.config config/producer.properties
  • bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning --consumer.config config/consumer.properties --new-consumer

Tuesday, May 9, 2017

Securing Apache Hadoop Distributed File System (HDFS) - part VI

This is the sixth and final article in a series of posts on securing HDFS. In the second and third posts we looked at how to use Apache Ranger to authorize access to data stored in HDFS. In the fifth post, we looked at how to configure HDFS to authenticate users via Kerberos. In this post we will combine both scenarios, that is we will use Apache Ranger to authorize access to HDFS, which is secured using Kerberos.

1) Authenticating to Apache Ranger

Follow the fifth tutorial to set up HDFS using Kerberos for authentication. Then follow the second tutorial to install the Apache Ranger HDFS plugin. The Ranger HDFS plugin will not be able to download new policies from Apache Ranger, as we have not configured Ranger to be able to authenticate clients via Kerberos. Edit 'conf/ranger-admin-site.xml' in the Apache Ranger Admin service and edit the following properties:
  • ranger.spnego.kerberos.principal: HTTP/localhost@hadoop.apache.org
  • ranger.spnego.kerberos.keytab: Path to Kerby ranger.keytab
  • hadoop.security.authentication: kerberos
Now we need to configure Kerberos to use the krb5.conf file generated by Apache Kerby:
  • export JAVA_OPTS="-Djava.security.krb5.conf=<path to Kerby target/krb5.conf"
Start the Apache Ranger admin service ('sudo -E ranger-admin start' to pass the JAVA_OPTS variable through) and edit the "cl1_hadoop" service that was created in the second tutorial. Under "Add New Configurations" add the following:
  • policy.download.auth.users: hdfs
The Ranger HDFS policy should be able to download the policies now from the Ranger Admin service and apply authorization accordingly.

2) Authenticating to HDFS

As we have configured HDFS to require Kerberos, we won't be able to see the HDFS directories in the Ranger Admin service when creating policies any more, without making some changes to enable the Ranger Admin service to authenticate to HDFS. Edit 'conf/ranger-admin-site.xml' in the Apache Ranger Admin service and edit the following properties:
  • ranger.lookup.kerberos.principal: ranger/localhost@hadoop.apache.org
  • ranger.lookup.kerberos.keytab: Path to Kerby ranger.keytab
Edit the 'cl1_hadoop' policy that we created in the second tutorial and click on 'Test Connection'. This should fail as Ranger is not configured to authenticate to HDFS. Add the following properties:
  • Authentication Type: Kerberos
  • dfs.datanode.kerberos.principal: hdfs/localhost
  • dfs.namenode.kerberos.principal: hdfs/localhost
  • dfs.secondary.namenode.kerberos.principal: hdfs/localhost
Now 'Test Connection' should be successful.

Friday, May 5, 2017

Using SASL to secure the the data transfer protocol in Apache Hadoop

The previous blog article showed how to set up a pseudo-distributed Apache Hadoop cluster such that clients are authenticated using Kerberos. The DataNode that we configured authenticates itself by using privileged ports configured in the properties "dfs.datanode.address" and "dfs.datanode.http.address". This requires building and configuring JSVC as well as making sure that we can ssh to localhost without a password as root. An alternative solution (as noted in the article) is to use SASL to secure the data transfer protocol. Here we will briefly show how to do this, building on the configuration given in the previous post.

1) Configuring Hadoop to use SASL for the data transfer protocol

Follow section (2) of the previous post to configure Hadoop to authenticate users via Kerberos. We need to make the following changes to 'etc/hadoop/hdfs-site.xml':
  • dfs.datanode.address: Change the port number here to be a non-privileged port.
  • dfs.datanode.http.address: Change the port number here to be a non-privileged port.
We also need add the following properties to 'etc/hadoop/hdfs-site.xml':
  • dfs.data.transfer.protection: integrity.
  • dfs.http.policy: HTTPS_ONLY.
Edit 'etc/hadoop/hadoop-env.sh' and comment out the values we added for:
  • HADOOP_SECURE_DN_USER
  • JSVC_HOME
2) Configure SSL keys in ssl-server.xml

The next step is to configure some SSL keys in 'etc/hadoop/ssl-server.xml'. We'll use some sample keys that are used in Apache CXF to run the systests for the purposes of this dem. Download cxfca.jks and bob.jks into 'etc/hadoop'. Now edit 'etc/hadoop/ssl-server.xml' and define the following properties:
  • ssl.server.truststore.location: etc/hadoop/cxf-ca.jks
  • ssl.server.truststore.password: password
  • ssl.server.keystore.location: etc/hadoop/bob.jks
  • ssl.server.keystore.password: password
  • ssl.server.keystore.keypassword: password
3) Launch Kerby and HDFS and test authorization

Now that we have hopefully configured everything correctly it's time to launch the Kerby based KDC and HDFS. Start Kerby by running the JUnit test as described in the first section of the previous article. Now start HDFS via:
  • sbin/start-dfs.sh
Note that 'sudo sbin/start-secure-dns.sh' is not required as we are now using SASL for the data transfer protocol. Now we can read the file we added to "/data" in the previous article as "alice":
  • export KRB5_CONFIG=/pathtokerby/target/krb5.conf
  • kinit -t -k /pathtokerby/target/alice.keytab alice
  • bin/hadoop fs -cat /data/LICENSE.txt

Thursday, May 4, 2017

Securing Apache Hadoop Distributed File System (HDFS) - part V

This is the fifth in a series of blog posts on securing HDFS. The first post described how to install Apache Hadoop, and how to use POSIX permissions and ACLs to restrict access to data stored in HDFS. The second post looked at how to use Apache Ranger to authorize access to data stored in HDFS. The third post looked at how Apache Ranger can create "tag" based authorization policies for HDFS using Apache Atlas. The fourth post looked at how to implement transparent encryption for HDFS using Apache Ranger. Up to now, we have not shown how to authenticate users, concentrating only on authorizing local access to HDFS. In this post we will show how to configure HDFS to authenticate users via Kerberos.

1) Set up a KDC using Apache Kerby

If we are going to configure Apache Hadoop to use Kerberos to authenticate users, then we need a Kerberos Key Distribution Center (KDC). Typically most documentation revolves around installing the MIT Kerberos server, adding principals, and creating keytabs etc. However, in this post we will show a simpler way of getting started by using a pre-configured maven project that uses Apache Kerby. Apache Kerby is a subproject of the Apache Directory project, and is a complete open-source KDC written entirely in Java.

A github project that uses Apache Kerby to start up a KDC is available here:
  • bigdata-kerberos-deployment: This project contains some tests which can be used to test kerberos with various big data deployments, such as Apache Hadoop etc.
The KDC is a simple junit test that is available here. To run it just comment out the "org.junit.Ignore" annotation on the test method. It uses Apache Kerby to define the following principals:
  • alice@hadoop.apache.org
  • bob@hadoop.apache.org
  • hdfs/localhost@hadoop.apache.org
  • HTTP/localhost@hadoop.apache.org
Keytabs are created in the "target" folder for "alice", "bob" and "hdfs" (where the latter has both the hdfs/localhost + HTTP/localhost principals included). Kerby is configured to use a random port to lauch the KDC each time, and it will create a "krb5.conf" file containing the random port number in the target directory. So all we need to do is to point Hadoop to the keytabs that were generated and the krb5.conf, and it should be able to communicate correctly with the Kerby-based KDC.

2) Configure Hadoop to authenticate users via Kerberos

Download and configure Apache Hadoop as per the first tutorial. For now, we will not enable the Ranger authorization plugin, but rather secure access to the "/data" directory using ACLs, as described in section (3) of the first tutorial, such that "alice" has permission to read the file stored in "/data" but "bob" does not. The next step is to configure Hadoop to authenticate users via Kerberos.

Edit 'etc/hadoop/core-site.xml' and adding the following property name/values:
  • hadoop.security.authentication: kerberos
Next edit 'etc/hadoop/hdfs-site.xml' and add the following property name/values to configure Kerberos for the namenode:
  • dfs.namenode.keytab.file: Path to Kerby hdfs.keytab (see above).
  • dfs.namenode.kerberos.principal: hdfs/localhost@hadoop.apache.org
  • dfs.namenode.kerberos.internal.spnego.principal: HTTP/localhost@hadoop.apache.org
Add the exact same property name/values for the secondary namenode, except using the property name "secondary.namenode" instead of "namenode". We also need to configure Kerberos for the datanode:
  • dfs.datanode.data.dir.perm: 700
  • dfs.datanode.address: 0.0.0.0:1004
  • dfs.datanode.http.address: 0.0.0.0:1006
  • dfs.web.authentication.kerberos.principal: HTTP/localhost@hadoop.apache.org
  • dfs.datanode.keytab.file: Path to Kerby hdfs.keytab (see above).
  • dfs.datanode.kerberos.principal: hdfs/localhost@hadoop.apache.org
  • dfs.block.access.token.enable: true 
As we are not using SASL to secure the the data transfer protocol (see here), we need to download and configure JSVC into JSVC_HOME. Then edit 'etc/hadoop/hadoop-env.sh' and add the following properties:
  • export HADOOP_SECURE_DN_USER=(the user you are running HDFS as)
  • export JSVC_HOME=(path to JSVC as above)
  • export HADOOP_OPTS="-Djava.security.krb5.conf=<path to Kerby target/krb5.conf"
You also need to make sure that you can ssh to localhost as "root" without specifying a password.

3) Launch Kerby and HDFS and test authorization

Now that we have hopefully configured everything correctly it's time to launch the Kerby based KDC and HDFS. Start Kerby by running the JUnit test as described in the first section. Now start HDFS via:
  • sbin/start-dfs.sh
  • sudo sbin/start-secure-dns.sh
Now let's try to read the file in "/data" using "bin/hadoop fs -cat /data/LICENSE.txt". You should see an exception as we have no credentials. Let's try to read as "alice" now:
  • export KRB5_CONFIG=/pathtokerby/target/krb5.conf
  • kinit -k -t /pathtokerby/target/alice.keytab alice
  • bin/hadoop fs -cat /data/LICENSE.txt
This should be successful. However the following should result in a "Permission denied" message:
  • kdestroy
  • kinit -k -t /pathtokerby/target/bob.keytab bob
  • bin/hadoop fs -cat /data/LICENSE.txt