Categorizing your data is crucial to ensure it is properly monitored and protected. When it comes to complying with privacy regulation, having all your sensitive data identified as PII so that the appropriate masking policies are applied is for instance not something you want to leave to chance. Making your way to well categorized data can, however, be very time consuming, especially when dealing with large volumes of data or complex datasets.
To help our customers optimize their time spent managing data assets, we are excited to announce classification tags and AI Suggestions for classification tags.
Classification tags help you categorize your data catalog assets according to key data characteristics. This categorization reduces the time needed for data teams to grasp the structure and content of their data, allowing them to ensure the right monitoring and protection methods are applied to their data as a result. Powered by machine learning, Sifflet automatically recommends which classifications tags to apply, making the categorization process all the more efficient.
In this post, we will discuss the benefits of a strong data categorization strategy and how Sifflet can help simplify this classification process.
Sifflet’s data catalog serves as the source of truth of your data ecosystem. Because it integrates with technologies across the Modern Data Stack, the data catalog automatically inherits all the metadata required to build a centralized repository that data stakeholders can leverage to easily locate and work with their data assets. The collected metadata includes information such as data assets’ and fields’ descriptions, data assets’ level of usage, lineage, dataset schema versions, and more. This existing context can then be enriched directly on Sifflet: inherited descriptions can for instance be complemented with insights from the various teams using the data asset, data assets can be assigned to some specific domains and/or tagged per environment, etc.
With classification tags, Sifflet takes data cataloging to the next level by allowing teams to categorize their data catalog assets and monitors according to key data characteristics such as PII or low cardinality.
Data analysts often spend more time than they would like querying data to understand their data assets’ structure. The `Low Cardinality` classification tag now allows data teams to flag all fields that have a limited number of unique values. Leveraging these tags reduces the time required by data analysts to identify dimensional fields by making it immediately clear to them what the structure of the data is like and how it should then be queried.
Classification tags can be added to your data assets and monitors manually but this fresh release also comes with AI Suggestions for classification tags. Sifflet indeed leverages machine learning to automatically suggest the addition of classification tags to the relevant dataset columns.
AI Suggestions for classification tags alleviate the burden of laborious manual work typically associated with data categorization, saving you time and resources. By automating the process, Sifflet ensures that categorization is more comprehensive and consistent, reducing the potential for human error and enhancing the overall quality and reliability of data management efforts.
The first step towards well monitored data is well categorized data.
For example, tagging all appropriate fields as `Low Cardinality` makes it a lot easier to then know which monitors should be created on these fields.
An account table with a `plan` column tagged as `Low Cardinality` prompts you to configure the adequate monitors. Configuring a Low Cardinality monitor on the `plan` column would for instance ensure that in case the software engineering team modifies plans’ names, the right teams get alerted ahead of time so they can modify dashboards that leverage this `plan` field downstream. Similarly, creating a Distribution Change monitor on a `newsletter_subscribed` field after spotting its `Low Cardinality` tag ensures business teams get alerted in case the proportion of unsubscribed users drastically decreases, indicating a decision that negatively impacted newsletter subscription.
`Low Cardinality` tags also make it immediately clear dimensions that might be relevant to leverage when configuring your monitor.
For instance, as a data stakeholder of a retail company looking to monitor the completeness of a sales dataset, seeing the `country` column flagged as `Low Cardinality` makes it apparent that the monitor query might need to be grouped by the `country` field. Sales volumes often differ quite a bit from one country to another and grouping the monitor by the `country` dimension is key to ensure teams do not miss out on variations that might otherwise have been hidden by a higher level monitor.
Data categorization also plays a crucial role in simplifying compliance with data protection and privacy regulations, while also providing valuable insights into potential risks. Properly tagged data assets make it a lot easier for data governance teams to identify where sensitive data lives among the company’s data fleet.
Thanks to Sifflet `PII` classification tags, you can for instance immediately understand which columns of a table contain or do not contain sensitive data.
A robust data classification then empowers data teams to take action to ensure the right protection is applied to their sensitive data assets. A column tagged `PII` should hence be a signal to verify at the data warehouse level that the proper masking policy is for instance in place to safeguard the content of this field.
If you want to learn more about how Sifflet can help you catalog your data assets and simplify their categorization, you can check out our documentation or reach out for a demo.