dbt at Enterprise Scale: A Sifflet-Powered Approach
dbt has revolutionized the Modern Data Stack by introducing software engineering best practices to data engineering. Its adoption has significantly boosted productivity and streamlined workflows for data teams. However, scaling dbt to manage thousands of models brings unique challenges, such as ownership ambiguity, rising costs, technical debt, and declining data quality. This article explores how Sifflet, with its robust observability capabilities and a well-architected approach, serves as the perfect companion to scale dbt environments gracefully and efficiently.
dbt, a cornerstone of the Modern Data Stack, brought much-needed standardization and software engineering best practices to the world of data engineering. As data teams rapidly adopted it, they saw significant productivity boosts and big improvements to their development workflow. However, scaling dbt proved to be a different challenge altogether. Beyond a certain point, most teams found that managing thousands of models led to ownership ambiguity, spiraling costs, accumulating technical debt, and a decline in data quality.
At Sifflet, we believe that a well-architected, dbt-centric data platform, combined with a robust observability component, can scale gracefully without these pitfalls and maintain efficiency even at thousands of models. We built Sifflet to play this very role within your data stack, and after the overhaul of our dbt integration last year, detailed in a previous article, Sifflet is now an ideal companion to your dbt workflows.
In this article, we’ll explore four key pillars, powered by Sifflet, that enable you to successfully scale your dbt environment to thousands of models and dozens of use cases.
Laying the Foundation: Standards, Standards, Standards
Defining and enforcing standards is a critical, yet often overlooked, step to successfully scaling your dbt environment (and the entire data platform). When starting out, it's tempting to prioritize speed of delivery and the immediate benefits of dbt, and so for example you may overlook defining what should be present in your project’s YAML files. However, as the number of models (and their complexity) grows, a lack of standards creates a chaotic environment that becomes increasingly difficult to maintain and manage.
These standards can range from the types of transformations performed at each layer of your platform to mandatory descriptions, tags, criticality tiers, clear ownership assignments, and required monitors for key assets. The specific standards and the strictness with which you enforce them will depend on your particular use case. But without clearly defined ownership, asset criticality, and operational status, you're building on shaky ground.
Sifflet is designed to support this standards-driven approach. First, it provides centralized access to all your metadata (including custom dbt metadata), creating a single source of truth to document and catalog your assets with all relevant context. Second, our data sharing feature enables you to monitor compliance with your defined standards. For example, you can track the number of monitors on critical assets (identified via tags) and configure a Sifflet monitor to notify you if a team allows more than 20% of their critical assets to go unmonitored. These monitors help ensure everyone adheres to the established rules and prevent undetected deviations from your defined standards.
Making Sense of the Lineage Maze
While dbt docs provide a good starting point for understanding data flow and dependencies, their value diminishes as your data platform grows more complex, spanning multiple tools and systems. Once you inevitably reach that point, a dbt-only lineage becomes insufficient.
To effectively identify the root cause of an incident or pinpoint a pipeline bottleneck, you need a comprehensive lineage graph. This graph should integrate actionable metadata from all components of your stack into a single, streamlined view, allowing you to easily trace the path your data takes as it moves across systems and logical layers.
Sifflet automatically builds this unified lineage by connecting your dbt models to their corresponding datasets in your data platform, pulling in their execution status, and incorporating metadata from all other integrated components (from data creation to consumption). This consolidated view simplifies troubleshooting, impact analysis, and data discovery.
From dbt Tests to Bulletproof Observability
While dbt tests serve as an ideal initial line of defense, ensuring basic data quality checks from day one, scaling requires a more flexible and robust observability strategy that aligns with your platform standards (enforcing standards is the first pillar, after all).
That's where Sifflet's monitoring and alerting capabilities come into play. Sifflet provides a rich toolbox of monitors, which can be applied at scale via our Data-Quality-as-Code framework (complementing your dbt YAML with Sifflet YAML configurations) and a wide range of alerting capabilities that rely on robust integrations with notification and issue-tracking tools. Sifflet's monitors extend the capabilities of dbt tests, providing comprehensive coverage across your data platform.
Moreover, you can ingest your existing dbt tests into Sifflet, effectively giving them the same features as native Sifflet monitors: sophisticated alerting, intelligent grouping, root cause analysis tools, and incident management capabilities.
Spotting Performance Bottlenecks Before They Cripple You
As your dbt footprint grows, keeping tabs on cost, performance, and usage becomes increasingly challenging. Without a proper cost and performance monitoring strategy, it can take you weeks (months?) before you detect models that have become silent bottlenecks, are unnecessarily expensive, or are simply redundant.
Sifflet helps you uncover these hidden issues with two key features. First, our dedicated dbt runs tab lets you track the execution of your models, their cost, and their runtime – all within an intuitive UI. This allows you to quickly pinpoint performance issues, as well as areas for improvement or cost reduction.
Second, our data sharing feature enables you to perform even more sophisticated analytics. Thanks to the historical dbt runs data that Sifflet provides, you can build custom reports and dashboards in your preferred BI tool to monitor dbt spending and performance and even define monitors based on Sifflet's dbt performance data. This allows you to automatically identify pipeline optimization opportunities, track performance trends over time, and get alerted whenever a performance or cost threshold is crossed.
How Mollie Manages their dbt environment at scale with Sifflet
Mollie, a Sifflet customer with a large dbt environment and a mature data architecture, leverages Sifflet’s dbt-related features to successfully manage their data platform at scale. Koen Mevissen, Mollie’s manager of data platform, said the following the release of our recent dbt features:
The Takeaway
Scaling dbt doesn't have to be a painful process.
By laying a solid foundation of standards, leveraging a unified lineage view, supercharging your testing with observability, and gaining granular visibility into performance, you can manage your dbt projects confidently and efficiently as they grow to thousands of models and dozens of use cases.
With its feature-rich dbt integration and wide range of related capabilities, Sifflet is an ideal partner on that journey, providing the tools and insights you need to build a data platform that's not just big, but also reliable, cost-effective, and well-governed.