The GDPR, or the Global Data Protection Regulation, is an EU legislation that went into effect on May 28th, 2018. Since it was first introduced in 2016, companies that do business in Europe have invested a lot of resources to achieve compliance. Your company and many others probably hired a Data Protection Officer and conducted several training sessions for your staff to ensure understanding of the new rules; you put in place new processes to document and classify the data you have, you introduced and established consent procedures, conducted several information audits, and reviewed your data governance process. According to FT, your company and others globally spent billions of dollars in preparation for the enforcement of the regulation. And according to a PwC report, more than 88% of companies spend $1 million, and 40% spend more than $10 million on the cost of maintaining GDPR compliance.
And yet.
To cite a few, Google, H&M, TIM, British Airways, and Marriott have paid hefty fines for failure to comply with the guidelines. You’ve probably also heard of the record Amazon GDPR fine. On July 30th, 2021, news broke that Luxembourg’s National Commission for Data Protection (CNPD) had hit Amazon with a record-breaking €746 million ($887 million) GDPR fine over its use of customer data for targeted advertising purposes. The fine is unprecedented; it is the most significant GDPR fine issued and is more than double the amount of any other GDPR fines combined.
So what is GDPR? What engineering challenges does it bring? And how to build the proper data infrastructure to ensure GDPR compliance?
I’ll keep it short. GDPR is a regulation that requires businesses to protect the personal data and privacy of EU citizens for transactions that occur within EU member states. It came into force on May 25th, 2018, and aims to give every EU citizen the right to know and decide how their data is used, stored, protected, transferred, and deleted.
Here is an excellent summary of what GDPR means and the broader scope of what it covers.
To understand the challenges GDPR brings to your data infrastructure, let’s go over some of the framework's pillars. Under GDPR, EU citizens are given a particular set of rights:
Complete elimination of users’ data must be conducted upon their request. In the era of auto-scheduled backups, non-volatile storage systems, and all-pervasive caching, this represents a real engineering challenge.
Or the right to request. And this means that users have the right to retrieve all the information a company has collected from them in an exportable, universally readable format, a.k.a. another technical challenge to overcome.
While keeping the data (upon user consent and for “necessary” business operations) is allowed, additional explicit user consent is still required to process the data. Think about how you are going to exclude certain records when writing SQL code…
This gives users the right to change their PII data in your system as they see fit. From an engineering perspective, this means your company will need to have a way of tagging and enumerating PII-related data.
Users need to be informed when their data is being collected and for what purposes. In case of a data breach, users need to be notified as soon as possible, and data protection authorities informed within 72 hours.
Taking all this into account and mindful of other broader engineering challenges GDPR might bring, let’s dive into the obstacles Data Engineers face and how to resolve them.
Let’s start from here. An essential requirement for compliance with GDPR is locating, enumerating, and accessing all user data classified as PII. You also need to think about PII security measures (encryption and fine-grained security access for PII), but that’s another complex topic.
Gone are the days when you could hoard all sorts of data and wish for the best. Clear scope for data collection, storage, and flow through all the processes within your organization is paramount to ensuring a technical foundation for GDPR compliance.
The idea here is having the ability to remove ALL records related to a user that are distributed across several databases, tables, and systems should they make the request. Pseudonymization of data (GDPR articles 6, 25, 32, 89) could bring a solution to this, but we’ll dive into that later.
Let’s take a look at some technical challenges that present themselves here:
This is arguably the least challenging requirement, although preventing an automated system from processing the data it stores may seem like an arduous and almost illogical task. Here are a few ideas:
Under GDPR, when collecting user data for a particular business use case, users will frequently have to understand and agree to whatever your organization wants to use their data for. This can be an absolute nightmare if your organization, like many, doesn’t fully understand the data it has, where it is being stored, or its specific business use case once it’s been stored. Often organizations “hoard” data hoping that the use cases will define themselves later. To ensure GDPR compliance, companies will need to understand precisely why they are collecting data and get into the habit of tagging that data at the time of collection.
Another difficulty here is in the definition of the scope of use. Say, for example, a user only consents to “marketing purposes,” you’re going to need to track and enforce that restriction from collection to use, cf—Data lineage.
This might seem like an obvious rule but isn't always diligently followed. A user should be able to access and edit all sorts of personal data you’ve collected about them, including PII you would’ve fetched from third parties (Salesforce, Apple login, Facebook ID, etc.). As a general rule, all personal data should be editable through the UI.
Let me start by saying that when it comes to building a Data Infrastructure that complies perfectly with GDPR, there is no magic recipe or one size fits all approach. However, some best practices and tools can enhance overall governance.
In no particular order, the below tools can be used in conjunction with each other or to complement an existing infrastructure to achieve Data protection and GDPR compliance. Perhaps the key thing to keep in mind when building your data infrastructure is; that you need to quickly access ALL the PII data within your organization, tag it and segregate it into a separate table, making it easy to encompass and delete if necessary.
Data Lineage allows organizations to trace the movement of data from its source to its point of use, providing visibility into how it has changed from point to point. It is already being widely used in heavily regulated industries such as banking, insurance, and healthcare as a way to maintain data-related regulatory compliance. Particularly on “the right to be forgotten” and the “right to be informed” pillars by visualizing and mapping metadata, Data Lineage allows to understand in which table PII resides and presents the entire lineage and interdependencies from a specific BI report through the tables and ETL processes. Hence 1. keeping a record of usages by the business and 2. assessing the impact of a row deletion and ensuring it is orchestrated in a GDPR compliant manner.
Data Lineage also eases some of the challenges related to “the right to access”.
A Data Catalog is a detailed inventory of all data assets in an organization and their metadata. It is designed to help data professionals quickly find the most appropriate data for any analytical business purpose by facilitating the access and classification of data at scale.
For GDPR purposes, the catalog solution must have the ability to automate data profiling and data tagging, thus allowing the organization to take those tags and feed them into a metadata processing engine and eventually anonymize that data from raw files to the initial data load.
Using a combination of Lineage and Catalog should allow you to find, access quickly, and tag the PII data within your storage systems and effectively orchestrate deletions by visualizing and mapping metadata and interdependencies between data assets.
When it comes to consent, GDPR requires explicit permission from the user to process their data. Practically speaking, there should be an explicit “check box” on the UI for each particular processing activity. From a data storage perspective, you should keep these consent checkboxes in separate columns in the database. Suppose the user unchecks the box from their profile as a consent withdrawal. In that case, a fetching mechanism will allow you to link to the processes reliant on PII related to a particular user and exclude it.
As defined in article 4 of GDPR, pseudonymization is a data management and de-identification process by which personally identifiable information (PII) can no longer be attributed to a specific data subject without the use of additional information. When appropriately implemented, pseudonymization ensures a certain level of protection during the processing of personal data. Although pseudonymization is not entirely exempt from data privacy requirements as re-identification remains possible, it can prove advantageous for comprehensive data analysis, mainly (if done correctly and consistently across all data processes) in the data warehouse domain comprising analytics.
When it comes to carrying out PII deletions, a good practice is to conduct them in batches; the best way to incorporate this is to flag and date PII data as a data management process and then show batch deletions once a month within the 30-day window stated by GDPR.
By leveraging your metadata management and lineage engines, you should be able to automate the identification and localization of personal data within your data warehouse and storage systems and query the metadata table to generate reports for individuals who make the request.
It goes without saying: data is every modern company’s greatest asset, and therefore, regulating Data is indirectly shaping the economy and the way modern-day organizations conduct their businesses.
The introduction of the European Union’s General Data Protection Regulation (GDPR) in 2018 pioneered Data Protection Regulation by introducing a new set of rules for global companies operating in the EU. The California Consumer Protection Act (CCPA) came into play later on the 1st of January 2021, while similar legislation has also been passed in countries such as China, New Zealand, Canada, and South Africa.
Despite it being an EU legislation, the GDPR has far-reaching implications. So far, since its enforcement on May 25th, 2018, companies like Amazon, Google, H&M, and British Airways, to cite a few, have paid hefty fines for failure to comply with the guidelines.
Two main factors make it challenging to support GDPR in Big Data technologies: the immutable nature of storage and the infeasibility of partitioning datasets by a single user.
Although there is no one-size-fits-all approach, implementing certain technologies when building your data infrastructures, such as lineage and metadata management, can prove to be efficient in keeping up with the PII within your organization and easing GDPR compliance while still being able to rely on data to drive your decision making.