This is the documentation for Cloudera Enterprise 5.8.x. Documentation for other versions is available at Cloudera Documentation.

Synchronizing HDFS ACLs and Sentry Permissions

This topic introduces an HDFS-Sentry plugin that allows you to configure synchronization of Sentry privileges with HDFS ACLs for specific HDFS directories.

Previously, when Sentry was used to secure data in Hive or Impala, it was difficult to securely share the same HDFS data files with other components such as Pig, MapReduce, Spark, and HDFS clients. You had two options:
  • You could set ownership for the entire Hive warehouse to hive:hive and not allow other components any access to the data. While this is secure, it does not allow for sharing.
  • Use HDFS ACLs and synchronize Sentry privileges and HDFS ACLs manually. For example, if a user only has the Sentry SELECT privilege on a table, that user should only be able to read the table data files, and not write to those HDFS files.

Introduction

To solve the problem stated above, CDH 5.3 introduces integration of Sentry and HDFS permissions that will automatically keep HDFS ACLs in sync with the privileges configured with Sentry. This feature offers the easiest way to share data between Hive, Impala and other components such as MapReduce, and Pig, while setting permissions for that data with just one set of rules through Sentry. It maintains the ability of Hive and Impala to set permissions on views, in addition to tables, while access to data outside of Hive and Impala (for example, reading files off HDFS) requires table permissions. HDFS permissions for some or all of the files that are part of tables defined in the Hive Metastore will now be controlled by Sentry.

This change consists of three components:
  • An HDFS NameNode plugin
  • A Sentry-Hive Metastore plugin
  • A Sentry Service plugin

With synchronization enabled, Sentry will translate permissions on tables to the appropriate corresponding HDFS ACL on the underlying table files in HDFS. For example, if a user group is assigned to a Sentry role that has SELECT permission on a particular table, then that user group will also have read access to the HDFS files that are part of that table. When you list those files in HDFS, this permission will be listed as an HDFS ACL.

Note that when Sentry was enabled, the hive user/group was given ownership of all files/directories in the Hive warehouse (/user/hive/warehouse). Hence, the resulting synchronized Sentry permissions will reflect this fact. If you skipped that step, Sentry permissions will be based on the existing Hive warehouse ACLs. Sentry will not automatically grant ownership to thehive user.

The mapping of Sentry privileges to HDFS ACLs is as follows:

  • SELECT privilege -> Read access on the file.
  • INSERT privilege -> Write access on the file.
  • ALL privilege -> Read and Write access on the file.
Note that you must explicitly specify the path prefix to the Hive warehouse (default: user/hive/warehouse) and any other directories that must be managed by Sentry. This procedure is described in the following section.
  Important:
  • With synchronization enabled, your ability to set HDFS permissions for those files is disabled. Permissions for those particular files can be set only through Sentry, and when examined through HDFS these permissions appear as HDFS ACLs. A configurable set of users, such as hive and impala, will have full access to the files automatically. This ensures that a key requirement of using Sentry with Hive and Impala — giving these processes full access to regulate permissions on underlying data files — is met automatically.
  • Tables that are not associated with Sentry, that is, have no user with Sentry privileges to access them, will retain their old ACLs.

  • Synchronized privileges are not persisted to HDFS. This means that when this feature is disabled, HDFS privileges will return to their original values.

  • Setting HDFS ACLs on Sentry-managed paths will not affect the original HDFS ACLs. That is, if you set an ACL for a Hive object that also falls under the Sentry-managed path prefixes, no action will be taken. If the path does not point to a Hive object managed by Sentry, HDFS ACLs will be set as expected.

    Removing HDFC ACLs from paths will work the same way. If you attempt to remove an ACL associated with a Hive object managed by Sentry, no action will be taken. In all other cases, the ACL will be removed as is expected behavior.

  • With HDFS-Sentry sync enabled, if the NameNode plugin is unable to communicate with the Sentry Service for a particular period of time (configurable by the sentry.authorization-provider.cache-stale-threshold.ms property), permissions for all directories under Sentry-managed path prefixes, irrespective of whether those file paths correspond to Hive warehouse objects, will be set to hive:hive.

  • Sentry HDFS synchronization does not support Hive metastore HA.

  • Column-level access control for access from Spark SQL is not supported by the HDFS-Sentry plug-in.

Prerequisites

  • CDH 5.3.0 (or higher)
  • (Strongly Recommended) Implement Kerberos authentication on your cluster.
The following conditions must be also be true when enabling Sentry-HDFS synchronization. Failure to comply with any of these will result in validation errors.
  • You must use the Sentry service, not policy file-based authorization.
  • Enabling HDFS Extended Access Control Lists (ACLs) is required.
  • There must be exactly one Sentry service dependent on HDFS.
  • The Sentry service must have exactly one Sentry Server role.
  • The Sentry service must have exactly one dependent Hive service.
  • The Hive service must have exactly one Hive Metastore role (that is, High Availability should not be enabled).

Enabling the HDFS-Sentry Plugin Using Cloudera Manager

  1. Go to the HDFS service.
  2. Click the Configuration tab.
  3. Select Scope > HDFS (Service-Wide).
  4. Select Category > All.
  5. Type Check HDFS Permissions in the Search box.
  6. Select Check HDFS Permissions.
  7. Select Enable Sentry Synchronization.
  8. Locate the Sentry Synchronization Path Prefixes property or search for it by typing its name in the Search box.
  9. Edit the Sentry Synchronization Path Prefixes property to list HDFS path prefixes where Sentry permissions should be enforced. Multiple HDFS path prefixes can be specified. By default, this property points to user/hive/warehouse and must always be non-empty. HDFS privilege synchronization will not occur for tables located outside the HDFS regions listed here.
  10. Click Save Changes.
  11. Restart the cluster. Note that it may take an additional two minutes after cluster restart for privilege synchronization to take effect.

Enabling the HDFS-Sentry Plugin Using the Command Line

  Important:
  • You can use either Cloudera Manager or the following command-line instructions to complete this configuration.
  • This information applies specifically to CDH 5.8.x. If you use an earlier version of CDH, see the documentation for that version located at Cloudera Documentation.

To enable the Sentry plugins on an unmanaged cluster, you must explicitly allow the hdfs user to interact with Sentry, and install the plugin packages as described in the following sections.

Allowing the hdfs user to connect with Sentry

For an unmanaged cluster, add hdfs to the sentry.service.allow.connect property in sentry-site.xml.
<property>
    <name>sentry.service.allow.connect</name>
    <value>impala,hive,hue,hdfs</value>
</property>

Installing the HDFS-Sentry Plugin

  Note: Install Cloudera Repository

Before using the instructions on this page to install the package, install the Cloudera yum, zypper/YaST or apt repository, and install or upgrade CDH 5 and make sure it is functioning correctly. For instructions, see Installing the Latest CDH 5 Release.

Use the following the instructions, depending on your operating system, to install the sentry-hdfs-plugin package. The package must be installed (at a minimum) on the following hosts:
  • The host running the NameNode and Secondary NameNode
  • The host running the Hive Metastore
  • The host running the Sentry Service
OS Command
RHEL-compatible
$ sudo yum install sentry-hdfs-plugin
SLES
$ sudo zypper install sentry-hdfs-plugin
Ubuntu or Debian
$ sudo apt-get install sentry-hdfs-plugin

Configuring the HDFS NameNode Plugin

Add the following properties to the hdfs-site.xml file on the NameNode host.
<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>

<property>
<name>dfs.namenode.authorization.provider.class</name>
<value>org.apache.sentry.hdfs.SentryAuthorizationProvider</value>
</property>

<property>
<name>dfs.permissions</name>
<value>true</value>
</property>

<!-- Comma-separated list of HDFS path prefixes where Sentry permissions should be enforced. -->
<!-- Privilege synchronization will occur only for tables located in HDFS regions specified here. -->
<property>
<name>sentry.authorization-provider.hdfs-path-prefixes</name>
<value>/user/hive/warehouse</value>  
</property>

<property>
<name>sentry.hdfs.service.security.mode</name>
<value>kerberos</value>  
</property>

<property>
<name>sentry.hdfs.service.server.principal</name>
<value> SENTRY_SERVER_PRINCIPAL (for eg :  sentry/_HOST@VPC.CLOUDERA.COM )</value>
</property>

<property>
<name>sentry.hdfs.service.client.server.rpc-port</name>
<value>SENTRY_SERVER_PORT</value>
</property>

<property>
<name>sentry.hdfs.service.client.server.rpc-address</name>
<value>SENTRY_SERVER_HOST</value>  
</property>

Configuring the Hive Metastore Plugin

Add the following properties to hive-site.xml on the Hive Metastore Server host.
<property>
<name>sentry.metastore.plugins</name>
<value>org.apache.sentry.hdfs.MetastorePlugin</value>
</property>

<property>
<name>sentry.hdfs.service.client.server.rpc-port</name>
<value> SENTRY_SERVER_PORT </value>
</property>

<property>
<name>sentry.hdfs.service.client.server.rpc-address</name>
<value> SENTRY_SERVER_HOSTNAME </value>
</property>

<property>
<name>sentry.hdfs.service.client.server.rpc-connection-timeout</name>
<value>200000</value>
</property>

<property>
<name>sentry.hdfs.service.security.mode</name>
<value>kerberos</value>
</property>

<property>
<name>sentry.hdfs.service.server.principal</name>
<value> SENTRY_SERVER_PRINCIPAL (for eg :  sentry/_HOST@VPC.CLOUDERA.COM )</value>
</property>

Configuring the Sentry Service Plugin

Add the following properties to the sentry-site.xml file on the NameNode host.
<property>
<name>sentry.service.processor.factories</name>
<value>org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessorFactory,
org.apache.sentry.hdfs.SentryHDFSServiceProcessorFactory</value>
</property>

<property>
<name>sentry.policy.store.plugins</name>
<value>org.apache.sentry.hdfs.SentryPlugin</value>
</property>
  Important: Once all the configuration changes are complete, restart your cluster. Note that it may take an additional two minutes after cluster restart for privilege synchronization to take effect.

Testing the Sentry Synchronization Plugins

The following tasks should help you make sure that Sentry-HDFS synchronization has been enabled and configured correctly:

For a folder that has been enabled for the plugin, such as the Hive warehouse, try accessing the files in that folder outside Hive and Impala. For this, you should know what tables those HDFS files belong to and the Sentry permissions on those tables. Attempt to view or modify the Sentry permissions settings over those tables using one of the following tools:
  • (Recommended) Hue's Security application
  • HiveServer2 CLI
  • Impala CLI
  • Access the table files directly in HDFS. For example:
    • List files inside the folder and verify that the file permissions shown in HDFS (including ACLs) match what was configured in Sentry.
    • Run a MapReduce, Pig or Spark job that accesses those files. Pick any tool besides HiveServer2 and Impala
Page generated July 8, 2016.