Over the past decade, data-driven decision-making has proven to be instrumental in any modern organization’s growth stage. As a result, modern data environments are constantly evolving and becoming more and more complex. This surge in the complexity of the Modern Data Stack is responsible for adding more unpredictability and risk of failure to data pipelines. The idea of “garbage in, garbage out” still applies to today’s data environments, however, as data issues can happen at any stage of the pipeline, the concept is no longer sufficient to deal with data quality issues efficiently. Data issues can occur at any stage, from ingestion through storage and processing to BI tools. This can make the process of turning data into actionable insights slow and expensive, causing organizations to make compromises on their data. Making compromises on data quality and data governance, however, is dangerous for organizations for many reasons:
In general, more data means you have a better shot at understanding and differentiating your business. But more data also means more risks of data failures and, hence, decreased data quality, bad business decision-making, and consequently erosion of data trust. Fortunately, in the past few years, a new approach to dealing with data quality, trust, and reliability has emerged: data observability.
The aim of this blog is to introduce the concept of Data Observability and spell out why organizations need to integrate a Full Data Stack Observability tool into their data stack. We’ll do so by answering the following questions:
Originally borrowed from control theory, software engineering adopted the term “observability” before spreading to the data world. The first software observability tools rose from the foundation laid by cloud services like AWS. Software observability tools like Datadog and New Relic gave engineering teams the ability to collect metrics across their systems to provide them with a complete understanding of the health status of these systems. The premise behind software observability is very straightforward: as the cloud allows to host more and more infrastructure components — such as databases, servers, and API endpoints — it becomes crucial to carefully monitor this complex infrastructure to know when something goes wrong. Today, software engineers cannot imagine not having a centralized view across their systems. Software observability tools have radically transformed the world of software, putting the importance of visibility across all systems front and center.
The data space is currently going through the same revolution. As mentioned in the introduction, the set of tools that data teams use is growing and becoming more and more complex, increasing the opportunity for data to break and making it more challenging to gain visibility into different parts of the data pipeline. These problems have become prevalent in organizations, and data users are unsatisfied with data quality.
In short, the emerging category of data observability aims to solve the following problems: monitoring data to quickly assess where issues might have happened to give teams the context to solve them promptly. Data Observability can therefore be defined as the ability of organizations to gain actionable insights regarding the health status of their data.
Although Data Observability originated in Software Observability, there are some significant differences to keep in mind. Software Observability is built on three pillars: Metrics, Traces and Logs
These pillars, however, don’t quite match the critical aspects that constitute the workflow of Data Engineering. A new framework need to be introduced to capture all the complexities of Data and Data Infrastructure, we suggest:
Data engineers often use tests to detect and prevent potential data quality issues. This approach worked fine until companies started consuming so much data that testing wasn’t enough anymore. Testing has become inefficient because data quality issues have become harder to detect and predict. Although teams have hundreds of tests to cover the predictable data issues, they are yet to cover all the infinite possibilities for data breaking throughout the entire pipeline or have the context to understand the data issues and learn from them, leaving them in a constant state of firefighting. On the other hand, Observability is scalable, delivers end-to-end coverage, and provides the context needed (thanks to lineage) to get ahead of data catastrophes and become proactive about DQ.
Data observability and data quality monitoring are often used interchangeably, but they are two different things. Or better, one enables the other. Data Observability enables data monitoring. Data quality monitoring alerts users when data assets or data sets don’t match the pre-established metrics or parameters. This process generates the same limitations as testing. Although you might gain some visibility over the quality status of your data assets and attributes in data quality monitoring, you have no way of knowing how to troubleshoot potential issues quickly. So neither testing nor traditional data quality monitoring can deal with the challenges of the Modern Data Stack on their own. This is where data observability comes in.
Data observability constantly collects signals across the entire data stack — logs, jobs, datasets, pipelines, BI dashboards, data science models, etc. — enabling monitoring and anomaly detection at scale. In other words, Data Observability acts as an overseeing layer for the data stack, ensuring that the data is reliable and traceable at every stage of the pipeline and regardless of which processing point it resides.
Data Observability should be perceived as an overseeing layer to make your Modern Data Stack more proficient and ensure that data is reliable regardless of where it sits. We’ve created Sifflet — the first Full Data Stack Observability platform — to enable organizations to get automated data reliability at every step of the pipeline.
In the Full Data Stack Observability approach, each component of the Modern Data Stack is perceived as a compartment that serves a purpose in the data journey. The compartments have a logic to operate and release information that can be leveraged to understand and observe the metadata, the data itself, the lineage, and the resulting data objects (metrics, charts, dashboards, etc.). To this end, the extensive lineage between the data assets and the objects across the data stack is the backbone of the Full Data Stack Observability framework.
To add some context to this definition, let’s look at some of the most critical use cases for Full Data Stack Observability.
Anomaly detection: at both the metadata and the data level. The idea is to introduce a set of metrics that can help define the health state of a data platform. Some standard business agnostic metrics include:
Lineage: lineage represents the dependencies between the data assets within an organization. As data volumes grow and data platforms become more complex, keeping track of how one asset is related to another becomes impossible. But why is keeping track of the dependencies even relevant? Let’s look at a couple of scenarios that a data practitioner at your average data mature organization deals with daily:
The following use cases are closely tied to the two above in the sense that they come as a result of the two aforementioned combined. In other words, say you have implemented an anomaly detection model that can also consume and produce lineage information. You receive an alert notifying you that something broke; what do you do?
Incident Management: I would say that the first thing you want to do is assess the impact of such an anomaly. What does it mean for the data consumers? What are its potential implications? What dashboards, charts, or ML models is it feeding into? Who else should be alerted? Extensive lineage with column detail helps answer these questions. Think of what Brian could’ve done to avoid the crisis.
Root Cause Analysis: Data engineers need to get to the bottom of the issue now that relevant stakeholders are aware of how their workflows are impacted. A decent lineage model will show you the upstream dependencies (left of the warehouse), so you can better understand where the problem stems from. A great lineage model will also link to applications, jobs, and orchestrators, resulting in the anomalous data asset. Jacob and his team would’ve been able to quickly understand the causing issue and rectify the numbers before the CEO’s press conference.
Post-Mortem: What happened, and how do we learn from it? Think of a data incident as a fire that can quickly spread and have both material and non-material repercussions. When you introduce testing and basic Data Quality monitoring, you are in fire fighting mode, but when you adopt a Full Data Stack Observability approach, you can learn how to identify fire hazards, how to stop the occasional fire, reduce the potential impacts of it and move from fire fighting to fire prevention. Conducting a purposeful post-mortem analysis is key to achieving sustainable health of data assets.
The complexity of a data platform is no longer an excuse to justify poor data quality. Any modern leader should embrace this increasing complexity because it means that the business is growing and that there is more data to leverage. However, any great value that can be extracted from data is offset without proper tooling and processes in place.
Full Data Stack Observability is an overseeing layer of the data stack and ensures that data is reliable at every step of the enterprise data pipeline. Although Data Observability frameworks draw a lot of inspiration from Software Observability and APM, some fundamental differences called for industry-defining frameworks and practices over the past couple of years. A Full Data Stack Observability approach leverages a combination of metrics, ingestion to BI lineage, and metadata to provide data engineers and data consumers with actionable insights to monitor and reduce the impact of data incidents and actively increase the reliability of the data assets.
We believe that adopting Full Data Stack Observability is the best way to unlock the unlimited possibilities of data-driven decision-making without constantly worrying about the reliability of the data. Do you want to know more about Sifflet’s Full Data Stack Observability approach? Would you like to see it applied to your specific use case? Book a demo or get in touch for a two-week free trial!