The recently published survey from SANS, “Enabling Big Data by Removing Security and Compliance Barriers,” revealed a majority of organizations are using Big Data systems to process highly sensitive data sets. The findings of this survey are consistent with the use cases Cloudera customers have deployed for many years, taking advantage of Cloudera’s continued investments in comprehensive, compliance-ready security that make these use cases possible. In fact, Cloudera is the only Hadoop distribution to have passed a compliance audit and, our customer, MasterCard has been operating a PCI-certified enterprise data hub since 2014.
That said, even Neil Young would agree that security never sleeps, so Cloudera is continuing to push the security envelope and enable additional use cases. One of the most common requests is to enable fine-grained access control to structured data that is consistently enforced across multiple compute frameworks, including Impala, Spark, and MapReduce. The challenge with this today is driven by the file-based nature of HDFS POSIX controls and Extended Access Control Lists (ACL’s): A user either has access to the entire data set in a file or no access at all. To address this, Cloudera is introducing RecordService, a new core security layer that centrally enforces fine-grained access control policy. Complementing Apache Sentry, which provides unified policy management, Cloudera now delivers unified row- and column-based security, and dynamic data masking, to every Hadoop access path. This combination of RecordService and Sentry allow security administrators to define fine-grained access control policies that will be uniformly enforced for Impala, Spark, Pig, Hive, MapReduce and Solr, with no performance impact.
Let’s explore three Spark use cases that, without RecordService, could only be achieved by executing a long, difficult to maintain set of workarounds, but can now be quickly implemented using RecordService and Sentry.
Restricting access at the column-level based upon user role
Restricting access at the row- and column-level based upon user role
Dynamic data masking based upon user role
For all three examples we will use a simple set of structured transaction information
1) Restricting access at the column- level based upon user role
For our first authorization use case, we have two classes of analysts. One class, Analyst I, is tasked with using Spark to analyze transactions based upon Transaction Type, Country Code, and Amount. Analyst I should never have access to Customer ID or Account Number information. The second analyst class, Analyst II, is tasked with using Spark to analyze transactions by Customer ID and Account Number. Analyst II will have access to the entire table.
Without RecordService and Sentry, security architects would be required to make a copy of the Transactions table that excluded Customer ID and Account Number and then provide Analyst I access to the copy. Every time the Transactions table is updated, the copy would need to be updated as well. Failure to keep these files in sync, or race conditions during synchronization (e.g. an Analyst I accesses the file while it is being updated), would have to be mitigated to ensure the validity of Analyst I’s work
By contrast, to implement these controls with Sentry and RecordService, a security architect would simply define the appropriate policies in Sentry, link these policies to the appropriate groups in Active Directory or LDAP, and these policies would then be uniformly and consistently enforced as they pass through RecordService, whether an Analyst is using Impala, Spark, or some other compute framework
There is no longer a need to create a second copy of the data, and it eliminates issues around synchronization and race conditions. Clean and simple.
2) Restricting access at the row- and column-level based upon user role
Our second authorization use case builds upon the first. In this example, internal corporate controls mandate that Analyst I and Analyst II access to transaction data must also be restricted by Country Code. Assuming there are three country codes in the data – US, EU and UK -we end up with the following analyst classifications
- Analyst I – US, Analyst I – EU, Analyst I – UK
- Analyst II – US, Analyst II – EU, Analyst II – UK
The access for Analyst I – US and Analyst II – US are illustrated below
As one would imagine, the workaround here is similar to the workaround for our first use case, but even harder to maintain. Instead of one copy of Transactions, you would have to create six distinct copies as none of analyst groups should ever have access to the entire data set. More copying, more synchronization, more race conditions. In addition, access controls for the data copies must be put in place ensuring that e.g. Analyst I – US is never given access to a file only intended for Analyst II – UK
Again, implementation in Sentry is straightforward: Define policies for each Analyst class, link policies to appropriate groups, and those policies will be uniformly enforced. So in addition to eliminating the copying, synchronization, and race condition issues, this use case also demonstrates Sentry and RecordService can not only provide column- and row-level access restrictions to data sets, but also provide a scalable, maintainable strategy for handling the multitude of access permutations present inside today’s enterprise.
3) Dynamic data masking based upon user role
For our third authorization use case, we have a business application that uses Spark to pull lists of transactions by Customer ID. This business application is used by customer support representatives and again, there are two levels of access. Customer Agent I can see the entire set of transaction information for a given customer, but can only see a redacted view of the Account Number (e.g. XXXXXXXX1234). Customer Agent II can see the entire set of transaction information, including the full Account Number. Additionally there is an Account Services team that requires the same access as the Customer Agent I team
Delivering this without Sentry and RecordService involves all of the copying, synchronization, and race conditions mentioned in the previous use cases, but also requires that the process put in place to create copies of the data also masks Account Number when creating the copy.
Implementation with Sentry and RecordService requires creation of two policies; one policy that will be used for both Customer Agent I and Account Services and the second policy for Customer Agent II. This use case highlights yet another benefit of Sentry, which is the ability to define policies that can be used for multiple groups, as opposed to having to configure access for each group independently.
In order for organizations to derive maximum value from their data, without the burden of creating and maintaining multiple copies, organizations must use a Hadoop platform with the requisite security controls. Sentry and RecordService provide organizations with the unified, fine-grained access controls and centralized policy management needed to enable multiple audiences with varying access levels to operate on a single data set, using any of the access paths available. This in turn makes the data stored in Cloudera’s enterprise data hub more valuable as it can be safely shared with more users in the organization.
RecordService is available as a public beta under the Apache open source license, with intent to donate to the ASF incubator. To get started, download it now at cloudera.com/downloads. You can also start contributing to this project at http://github.com/cloudera/recordservice.
For more details on the motivations and design behind RecordService, check out the Developer Blog.