General Data Protection Regulation (GDPR) and Data Science

Categories: Compliance Data Science

If your organization collects data about citizens of the European Union (EU), you should know about the General Data Protection Regulation (GDPR). GDPR defines and strengthens data protection for consumers and harmonizes data security rules within the EU. The European Parliament approved the measure on April 27, 2016. It goes into effect in less than a year, on May 25, 2018.

Much of the commentary about GDPR focuses on how the new rules affect collection and management of personally identifiable information (PII) about consumers. However, GDPR will also change how organizations practice data science. That is the subject of this blog post.

One caveat before we begin. GDPR is complicated. In some areas, GDPR defines high-level outcomes, but delegates detailed compliance rules to a new entity, the European Data Protection Board. GDPR regulations intersect with many national laws and regulations; organizations that conduct business in the United Kingdom must also assess the unknown impacts of Brexit. The information contained in this document is not intended to be and should not be construed to be legal advice, and we recommend that organizations subject to GDPR engage expert management and legal counsel in developing a compliance plan.  

GDPR and Data Science

GDPR affects data science practice in three areas. First, GDPR imposes limits on data processing and consumer profiling. Second, for organizations that use automated decision-making, GDPR creates a “right to an explanation” for consumers. Third, GDPR holds firms accountable for bias and discrimination in automated decisions.  

Data processing and profiling. GDPR imposes controls on data processing and consumer profiling; these rules supplement the requirements for data collection and management. GDPR defines profiling as:

Any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular, to analyse or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements.

In general, organizations may process personal data when they can demonstrate a legitimate business purpose (such as a customer or employment relationship) that does not conflict with the consumer’s rights and freedoms. Organizations must inform consumers about profiling and its consequences, and provide them with the opportunity to opt out.

The Right to an Explanation. GDPR grants consumers the right “not to be subject to a decision…which is based solely on automated processing and which provides legal effects (on the subject).”  Experts characterize this rule as a “right to an explanation.”  GDPR does not precisely define the scope of decisions covered by this section. The United Kingdom’s Information Commissioner’s Office (ICO) says that the right is “very likely” to apply to credit applications, recruitment, and insurance decisions. Other agencies, law courts or the European Data Protection Board may define the scope differently.

Bias and Discrimination. When organizations use automated decision-making, they must prevent discriminatory effects based on racial or ethnic origin, political opinion, religion or beliefs, trade union membership, genetic or health status or sexual orientation, or that result in measures having such an effect. Moreover, they may not use specific categories of personal data in automated decisions except under defined circumstances.

How GDPR Affects Data Science

How will the new rules affect the way data science teams do their work? Let’s examine the impact in three key areas.

Data Processing and Profiling. The new rules allow organizations to process personal data for specific business purposes, fulfill contractual commitments, and comply with national laws. A credit card issuer may process personal data to determine a cardholder’s available credit; a bank may screen transactions for money laundering as directed by regulators. Consumers may not opt out of processing and profiling performed under these “safe harbors.”

However, organizations may not use personal data for purposes other than its original purpose without securing additional permission from the consumer. This requirement could limit the amount of data available for exploratory data science.

GDPR’s constraints on data processing and profiling apply only to data that identifies an individual consumer.

The principles of data protection should therefore not apply to … personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.

The clear implication is that organizations subject to GDPR must build robust anonymization into data engineering and data science processes.

Explainable Decisions. There is some controversy about the impact of this provision. Some cheer it; others disapprove; still others deny that GDPR creates such a right. One expert in EU law argues that the requirement may force data scientists to stop using opaque techniques (such as deep learning), which can be hard to explain and interpret.

There is no question that GDPR will affect how organizations handle certain decisions. The impact on data scientists, however, may be exaggerated:

— The “right to an explanation” is limited in scope. As noted above, one regulator interprets the law to cover credit applications, recruitment, and insurance decisions. Other regulators or law courts may interpret the rules differently, but it’s clear that the right applies in specific settings.

— In many jurisdictions, a “right to an explanation” already exists and has existed for years. For example, regulations governing credit decisions in the United Kingdom are similar to those in the United States, where issuers must provide an explanation for adverse credit decisions based on credit bureau information. GDPR expands the scope of these rules, but tools for compliance are commercially available today.

— Most businesses that must decline some customer requests understand that it is good business practice to explain an adverse decision. Lending institutions and other firms that must refuse some customer requests understand the need to provide a clear explanation of the decision.

— The need to deliver an explanation affects decision engines but need not influence the choice of methods data scientists use for model training. Techniques available today make it possible to “reverse-engineer” interpretable explanations for model scores even if the data scientist uses an opaque method to train the model.

A challenge for data science is that “how” a decision was made may not always be transparent, particularly as deep learning and AI techniques are applied. This means there are good reasons for data scientists to consider using interpretable techniques. Financial services giant Capital One considers them to be a potent weapon against hidden bias (discussed below.) However, one should not conclude that GDPR will force data scientists to limit the techniques they use to train predictive models.

Bias and Discrimination. GDPR requires that organizations must avoid discriminatory effects in automated decisions. This rule places an extra burden of due diligence on data scientists who build predictive models, and on the procedures organizations use to approve predictive models for production.

Organizations that use automated decision-making must:

  • Ensure fair and transparent processing
  • Use appropriate mathematical and statistical procedures
  • Establish measures to ensure the accuracy of subject data employed in decisions

GDPR expressly prohibits the use of personal characteristics (such as age, race, ethnicity, and other enumerated classes) in automated decisions. However, it is not sufficient to just avoid using this data. The mandate against discriminatory outcomes means data scientists must also take steps to prevent indirect bias from proxy variables, multicollinearity or other causes. For example, an automated decision that uses a seemingly neutral characteristic, such as a consumer’s residential neighborhood, may inadvertently discriminate against ethnic minorities.

Data scientists must also take affirmative steps to confirm that the data they use when they develop predictive models is accurate; “garbage in/garbage out,” or GIGO, is not a defense. They must also consider whether biased training data on past outcomes can bias models. As a result, data scientists will need to concern themselves with data lineage, to trace the flow of data through all processing steps from source to target. GDPR will also drive greater concern for reproducibility, or the ability to accurately replicate a predictive modeling project.

You can learn more about data lineage with Cloudera Navigator here.

Your Next Steps

If you do business in the European Union, now is the time to start planning for GDPR. There is much to be done: evaluating the data you collect, implementing compliance procedures, assessing your processing operations and so forth. If you are currently using machine learning for profiling and automated decisions, there are five things you need to do now.

  • Limit access to personally identifiable information (PII) about consumers. Implement robust anonymization, so that by default analytic users cannot access PII. Define an exception process that permits access to PII in exceptional cases under proper security.  
  • For predictive models that currently use PII ask:
    • Is this data analytically necessary? Does it deliver unique information value to a predictive model?
    • Does the predictive model support a permitted use case, such as anti-money laundering?
  • Implement a process to handle consumer questions and concerns about automated decisions.
  • Establish a data science process that minimizes the risk of errors and bias.
    • Train data scientists on methods and procedures that ensure proper model development, testing, and validation.
    • Consider if training data has itself a “built in” bias
    • Rigorously test predictive models.
    • Implement data lineage for all data used in the process.
    • Ensure full reproducibility for every project.
  • Define a review and acceptance process for customer-facing predictive models that is independent of the model developers.

Even if your organization is not subject to GDPR, consider implementing these practices anyway. It’s the right way to do business.

Learn more about Cloudera Data Science here.

Simplify your response to GDPR. Learn more here.


Every organization should determine its own needs with regard to GDPR and then evaluate solutions for suitability to those needs. The information contained in this document is not intended to be and should not be construed to be legal advice.  Organizations subject to GDPR must not rely on the information herein and they should obtain legal advice from their own legal counsel or other professional legal services provider.


Leave a Reply