Data is the lifeblood of businesses today. It’s used to make informed decisions on everything from what products to work on to where to allocate resources. However, if that data is inaccurate or low quality, it can lead to disastrous consequences for companies. In this blog post, we’ll explore how poor-quality data can impact businesses and some steps you can take to mitigate those risks.
Let’s start from the beginning.
Data-driven businesses are those that base their decisions on data and analytics. When you use data to guide your business decisions, you are using observable facts rather than your gut feeling. Sounds easy, right? Far from it. EY recently conducted research on how companies are using data across functional areas. The results showed that while 81% of organizations think data should be at the core of business decisions, they are still approaching it in a non-optimized way — therefore limiting the business value of the gathered information. In fact, the same report showed that only 31% of the organizations included in the research significantly restructured their operations to accommodate the new needs that large amounts of data bring.
Adopting a data-driven approach can be beneficial for organizations in multiple ways. Some of the benefits of being a data-driven organization are:
Overall, evidence-based decisions can make companies more confident in their choices and lead them to create better products and services. So, if data has become one of the most powerful tools companies can use, what is stopping them from fully adopting a data-driven approach?
There are many challenges that companies face when trying to adopt a data-driven approach. They can depend on company size, data governance, culture, and data literacy among employees. Let’s go through these challenges in detail.
Let’s dive deeper into the issue of data quality.
The specific cost of data quality issues varies from organization to organization. But according to a 2021 Gartner report, bad data costs businesses around 13 million dollars annually on average. In other words, every year, companies waste valuable resources because of data quality problems. This is a huge amount of money that could be used to improve the business in other ways rather than dealing with data quality issues.
Data quality is a measure of how accurate and consistent your data is. Data quality issues can occur at any stage of the data pipeline, from ingestion to BI tools.
As data consumption increases within organizations, it becomes more and more crucial to trust the data to be reliable. Unreliable data can quickly become detrimental to the business. Missed opportunities, financial costs, customer dissatisfaction, failure to achieve regulatory compliance, inaccurate decision-making, etc., can all impact data quality. There are many ways to assess and improve data quality, but it ultimately comes down to ensuring that your data is clean, complete, and consistent. There are a few metrics you can use to understand whether you can trust your data:
Failing to assess and improve data quality can have many negative effects on the business. Some examples are:
At Sifflet, we have come up with a concept called Data Entropy, which symbolizes all the chaos and the disorder that many data practitioners have to deal with — especially with the growing complexity of the data platforms and the growing expectations from the business in terms of data and data infrastructure.
Entropy in data can manifest itself in different ways:
To sum up, Data Entropy is caused by a mix of technology and people/culture in organizations.
So, is data entropy inevitable in a data platform?
The short answer to this question is yes. In the past few years, the data ecosystem has undergone the revolution brought by the adoption of the modern data stack — which can be defined as a collection of different technologies used to transform raw data into actionable business insights (e.g., data warehouse, ELT, transformation, BI, reverse ETL). The complexity of the modern data stack necessarily allows for data entropy to increase.
The modern data stack gives more flexibility to data practitioners to do more with data, but to do more with data, you need to control or try to reduce the entropy surrounding the workflows around the data. There are two main ways to do that:
There are plenty of data errors and issues that can be prevented. But rather than the errors themselves, what is really important is the stage at which you catch them from propagating downstream and causing negative outcomes for the business and data team. The stage at which you catch these anomalies enormously affects the outcome that they can have on the rest of the organization.
In an ideal world, everything would be preventable. In this ideal world, you would have a 360 view of all of your data assets, you would know who is using what data, who changed what at all times, etc. But the reality is that, even in the most modern organizations, data platforms have become too complex to obtain this kind of overview.
But what are the ways in which you can start preventing data issues?
The most basic way to go about this is to implement manual checks to get ahead of data incidents. You can start by implementing testing at the orchestration layer, you can check the ingestion patterns, you can look at the schemas, etc. Obviously, the earlier the checks are implemented in the data lifecycle, the better because it enables you to catch the problem at the source and avoid further propagation. Unfortunately, this is not always straightforward. And on top of that, in most cases, it is not enough. In the current environment where data consumers want more and more control over their data assets, catching problems at the sources is only half of the job. You still have to follow the whole workflow of the data and how the issue is likely to propagate downstream.
It’s important to start by saying that there is no single right approach to tackle this. Different practices work for different organizations. For instance, at GoCardless, the data team implemented the concept of Data Contracts — which is an example of data quality checks implemented early on in the data lifecycle.
There are also other examples in which companies adopt a fully decentralized approach, implementing concepts like the Data Mesh. And here, ensuring data quality becomes the responsibility of the data consumer.
As previously mentioned, there is no right or wrong approach. The best practice to adopt depends on the organization’s resources, how the team is set up, and on the ratio of data engineers versus data consumers. But as a best practice, it is important to keep in mind that the best data quality programs are the ones that are adopted by every data practitioner — from data engineers to data producers.
In the current economic environment, where companies are downsizing in one way or another while facing a lot of crucial decisions, it becomes essential to remove everything that is not backed by reliable data.
On top of that, as expectations around data, data platforms, and data teams increase, companies show less and less tolerance toward data incidents. Therefore, data entropy — the uncertainty and disorder within the data team — needs to be reduced to optimize the full potential of what data can help achieve for the business while fostering and nurturing a data-driven culture.
These challenges can be very overwhelming for businesses. However, there are some actions every enterprise can take to successfully start embedding data in every business decision:
Although data entropy/data quality issues are inevitable in modern data ecosystems, there are ways to ensure that bad-quality data does not impact your business. Starting with a data governance framework, setting appropriate goals, and investing in data literacy are necessary steps for any organization that wants to become fully data-driven. Ultimately, the key to success lies in making sure that good data quality practices are adopted and followed by all the members of the team — from data engineers to decision-makers.