Introduction

Ensuring data quality is paramount for businesses relying on data-driven decision-making. As data volumes grow and sources diversify, manual quality checks become increasingly impractical and error-prone. This is where automated data quality checks come into play, offering a scalable solution to maintain data integrity and reliability.

At my organization, which collects large volumes of public web data, we’ve developed a robust system for automated data quality checks using two powerful open-source tools: Dagster and Great Expectations. These tools are the cornerstone of our approach to data quality management, allowing us to efficiently validate and monitor our data pipelines at scale.

In this article, I’ll explain how we use Dagster, an open-source data orchestrator, and Great Expectations, a data validation framework, to implement comprehensive automated data quality checks. I’ll also explore the benefits of this approach and provide practical insights into our implementation process, including a Gitlab demo, to help you understand how these tools can enhance your own data quality assurance practices.

Let’s discuss each of them in more detail before moving to practical examples.

Learning Outcomes

  • Understand the importance of automated data quality checks in data-driven decision-making.
  • Learn how to implement data quality checks using Dagster and Great Expectations.
  • Explore different testing strategies for static and dynamic data.
  • Gain insights into the benefits of real-time monitoring and compliance in data quality management.
  • Discover practical steps to set up and run a demo project for automated data quality validation.

This article was published as a part of the Data Science Blogathon.

Understanding Dagster: An Open-Source Data Orchestrator

Used for ETL, analytics, and machine learning workflows, Dagster lets you build, schedule, and monitor data pipelines. This Python-based tool allows data scientists and engineers to easily debug runs, inspect assets, or get details about their status, metadata, or dependencies.

As a result, Dagster makes your data pipelines more reliable, scalable, and maintainable. It can be deployed in Azure, Google Cloud, AWS, and many other tools you may already be using. Airflow and Prefect can be named as Dagster competitors, but I personally see more pros in the latter, and you can find plenty of comparisons online before committing.

Understanding Dagster: An Open-Source Data Orchestrator

Exploring Great Expectations: A Data Validation Framework

A great tool with a great name, Great Expectations is an open-source platform for maintaining data quality. This Python library actually uses “Expectation” as their in-house term for assertions about data.

Great Expectations provides validations based on the schema and values. Some examples of such rules could be max or min values and count validations. It also provides data validation and can generate expectations according to the input data. Of course, this feature usually requires some tweaking, but it definitely saves some time.

Another useful aspect is that Great Expectations can be integrated with Google Cloud, Snowflake, Azure, and over 20 other tools. While it can be challenging for data users without technical knowledge, it’s nevertheless worth attempting.

Exploring Great Expectations: A Data Validation Framework: Automating Data Quality Checks with Dagster

Why are Automated Data Quality Checks Necessary?

Automated quality checks have multiple benefits for businesses that handle voluminous data of critical importance. If the information must be accurate, complete, and consistent, automation will always beat manual labor, which is prone to errors. Let’s take a quick look at the 5 main reasons why your organization might need automated data quality checks.

Data integrity

Your organization can collect reliable data with a set of predefined quality criteria. This reduces the chance of wrong assumptions and decisions that are error-prone and not data-driven. Tools like Great Expectations and Dagster can be very helpful here.

Error minimization

While there’s no way to eradicate the possibility of errors, you can minimize the chance of them occurring with automated data quality checks. Most importantly, this will help identify anomalies earlier in the pipeline, saving precious resources. In other words, error minimization prevents tactical mistakes from becoming strategic.

Efficiency

Checking data manually is often time-consuming and may require more than one employee on the job. With automation, your data team can focus on more important tasks, such as finding insights and preparing reports.

Real-time monitoring

Automatization comes with a feature of real-time tracking. This way, you can detect issues before they become bigger problems. In contrast, manual checking takes longer and will never catch the error at the earliest possible stage.

Compliance

Most companies that deal with public web data know about privacy-related regulations. In the same way, there may be a need for data quality compliance, especially if it later goes on to be used in critical infrastructure, such as pharmaceuticals or the military. When you have automated data quality checks implemented, you can give specific evidence about the quality of your information, and the client has to check only the data quality rules but not the data itself.

How to Test Data Quality?

As a public web data provider, having a well-oiled automated data quality check mechanism is key. So how do we do it? First, we differentiate our tests by the type of data. The test naming might seem somewhat confusing because it was originally conceived for internal use, but it helps us to understand what we’re testing.

We have two types of data:

  • Static data. Static means that we don’t scrape the data in real-time but rather use a static fixture.
  • Dynamic data. Dynamic means that we scrape the data from the web in real-time.

Then, we further differentiate our tests by the type of data quality check:

  • Fixture tests. These tests use fixtures to check the data quality.
  • Coverage tests. These tests use a bunch of rules to check the data quality.

Let’s take a look at each of these tests in more detail.

Static Fixture Tests

As mentioned earlier, these tests belong to the static data category, meaning we don’t scrape the data in real-time. Instead, we use a static fixture that we have saved previously.

A static fixture is input data that we have saved previously. In most cases, it’s an HTML file of a web page that we want to scrape. For every static fixture, we have a corresponding expected output. This expected output is the data that we expect to get from the parser.

Steps for Static Fixture Tests

The test works like this:

  • The parser receives the static fixture as an input.
  • The parser processes the fixture and returns the output.
  • The test checks if the output is the same as the expected output. This is not a simple JSON comparison because some fields are expected to change (such as the last updated date), but it is still a simple process.

We run this test in our CI/CD pipeline on merge requests to check if the changes we made to the parser are valid and if the parser works as expected. If the test fails, we know we have broken something and need to fix it.

Static fixture tests are the most basic tests both in terms of process complexity and implementation because they only need to run the parser with a static fixture and compare the output with the expected output using a rather simple Python script.

Nevertheless, they are still really important because they are the first line of defense against breaking changes.

However, a static fixture test cannot check whether scraping is working as expected or whether the page layout remains the same. This is where the dynamic tests category comes in.

Dynamic Fixture Tests

Basically, dynamic fixture tests are the same as static fixture tests, but instead of using a static fixture as an input, we scrape the data in real-time. This way, we check not only the parser but also the scraper and the layout.

Dynamic fixture tests are more complex than static fixture tests because they need to scrape the data in real-time and then run the parser with the scraped data. This means that we need to launch both the scraper and the parser in the test run and manage the data flow between them. This is where Dagster comes in.

Dagster is an orchestrator that helps us to manage the data flow between the scraper and the parser.

Steps for Dynamic Fixture Tests

There are four main steps in the process:

  • Seed the queue with the URLs we want to scrape
  • Scrape
  • Parse
  • Check the parsed document against the saved fixture

The last step is the same as in static fixture tests, and the only difference is that instead of using a static fixture, we scrape the data during the test run.

Dynamic fixture tests play a very important role in our data quality assurance process because they check both the scraper and the parser. Also, they help us understand if the page layout has changed, which is impossible with static fixture tests. This is why we run dynamic fixture tests in a scheduled manner instead of running them on every merge request in the CI/CD pipeline.

However, dynamic fixture tests do have a pretty big limitation. They can only check the data quality of the profiles over which we have control. For example, if we don’t control the profile we use in the test, we can’t know what data to expect because it can change anytime. This means that dynamic fixture tests can only check the data quality for websites in which we have a profile. To overcome this limitation, we have dynamic coverage tests.

Dynamic Coverage Tests

Dynamic coverage tests also belong to the dynamic data category, but they differ from dynamic fixture tests in terms of what they check. While dynamic fixture tests check the data quality of the profiles we have control over, which is pretty limited because it is not possible in all targets, dynamic coverage tests can check the data quality without a need to control the profile. This is possible because dynamic coverage tests don’t check the exact values, but they check those against a set of rules we have defined. This is where Great Expectations comes in.

Dynamic coverage tests are the most complex tests in our data quality assurance process. Dagster also orchestrates them as dynamic fixture tests. However, we use Great Expectations instead of a simple Python script to execute the test here.

At first, we need to select the profiles we want to test. Usually, we select profiles from our database that have high field coverage. We do this because we want to ensure the test covers as many fields as possible. Then, we use Great Expectations to generate the rules using the selected profiles. These rules are basically the constraints that we want to check against the data. Here are some examples:

  • All profiles must have a name.
  • At least 50% of the profiles must have a last name.
  • The education count value cannot be lower than 0.

Steps for Dynamic Coverage Tests

After we have generated the rules, called expectations in Great Expectations, we can run the test pipeline, which consists of the following steps:

  • Seed the queue with the URLs we want to scrape
  • Scrape
  • Parse
  • Validate parsed documents using Great Expectations

This way, we can check the data quality of profiles over which we have no control. Dynamic coverage tests are the most important tests in our data quality assurance process because they check the whole pipeline from scraping to parsing and validate the data quality of profiles over which we have no control. This is why we run dynamic coverage tests in a scheduled manner for every target we have.

However, implementing dynamic coverage tests from scratch can be challenging because it requires some knowledge about Great Expectations and Dagster. This is why we have prepared a demo project showing how to use Great Expectations and Dagster to implement automated data quality checks.

Implementing Automated Data Quality Checks

In this Gitlab repository, you can find a demo of how to use Dagster and Great Expectations to test data quality. The dynamic coverage test graph has more steps, such as seed_urls, scrape, parse, and so on, but for the sake of simplicity, in this demo, some operations are omitted. However, it contains the most important part of the dynamic coverage test — data quality validation. The demo graph consists of the following operations:

  • load_items:  loads the data from the file and loads them as JSON objects.
  • load_structure :  loads the data structure from the file.
  • get_flat_items :  flattens the data.
  • load_dfs : loads the data as Spark DataFrames by using the structure from the load_structure operation.
  • ge_validation : executes the Great Expectations validation for every DataFrame.
  • post_ge_validation : checks if the Great Expectations validation passed or failed.
Implementing Automated Data Quality Checks

While some of the operations are self-explanatory, let’s examine some that might require further detail.

Generating a Structure

The load_structure operation itself is not complicated. However, what is important is the type of structure. It’s represented as a Spark schema because we will use it to load the data as Spark DataFrames because Great Expectations works with them. Every nested object in the Pydantic model will be represented as an individual Spark schema because Great Expectations doesn’t work well with nested data.

For example, a Pydantic model like this:

python
class CompanyHeadquarters(BaseModel):
    city: str
    country: str

class Company(BaseModel):
    name: str
    headquarters: CompanyHeadquarters

This would be represented as two Spark schemas:

json
{
    "company": {
        "fields": [
            {
                "metadata": {},
                "name": "name",
                "nullable": false,
                "type": "string"
            }
        ],
        "type": "struct"
    },
    "company_headquarters": {
        "fields": [
            {
                "metadata": {},
                "name": "city",
                "nullable": false,
                "type": "string"
            },
            {
                "metadata": {},
                "name": "country",
                "nullable": false,
                "type": "string"
            }
        ],
        "type": "struct"
    }
}

The demo already contains data, structure, and expectations for Owler company data. However, if you want to generate a structure for your own data (and your own structure), you can do that by following the steps below. Run the following command to generate an example of the Spark structure:

docker run -it - rm -v $(pwd)/gx_demo:/gx_demo gx_demo /bin/bash -c "gx structure"

This command generates the Spark structure for the Pydantic model and saves it as example_spark_structure.json in the gx_demo/data directory.

Preparing and Validating Data

After we have the structure loaded, we need to prepare the data for validation. That leads us to the get_flat_items operation, which is responsible for flattening the data. We need to flatten the data because each nested object will be represented as a row in a separate Spark DataFrame. So, if we have a list of companies that looks like this:

json
[
    {
        "name": "Company 1",
        "headquarters": {
            "city": "City 1",
            "country": "Country 1"
        }
    },
    {
        "name": "Company 2",
        "headquarters": {
            "city": "City 2",
            "country": "Country 2"
        }
    }
]

After flattening, the data will look like this:

json
{
    "company": [
        {
            "name": "Company 1"
        },
        {
            "name": "Company 2"
        }
    ],
    "company_headquarters": [
        {
            "city": "City 1",
            "country": "Country 1"
        },
        {
            "city": "City 2",
            "country": "Country 2"
        }
    ]

Then, the flattened data from the get_flat_items operation will be loaded into each Spark DataFrame based on the structure that we loaded in the load_structure operation in the load_dfs operation.

The load_dfs operation uses DynamicOut, which allows us to create a dynamic graph based on the structure that we loaded in the load_structure operation.

Basically, we will create a separate Spark DataFrame for every nested object in the structure. Dagster will create a separate ge_validation operation that parallelizes the Great Expectations validation for every DataFrame. Parallelization is useful not only because it speeds up the process but also because it creates a graph to support any kind of data structure.

So, if we scrape a new target, we can easily add a new structure, and the graph will be able to handle it.

Generate Expectations

Expectations are also already generated in the demo and the structure. However, this section will show you how to generate the structure and expectations for your own data.

Make sure to delete previously generated expectations if you’re generating new ones with the same name. To generate expectations for the gx_demo/data/owler_company.json data, run the following command using gx_demo Docker image:

docker run -it - rm -v $(pwd)/gx_demo:/gx_demo gx_demo /bin/bash -c "gx expectations /gx_demo/data/owler_company_spark_structure.json /gx_demo/data/owler_company.json owler company"

The command above generates expectations for the data (gx_demo/data/owler_company.json) based on the flattened data structure (gx_demo/data/owler_company_spark_structure.json). In this case, we have 1,000 records of Owler company data. It’s structured as a list of objects, where each object represents a company.

After running the above command, the expectation suites will be generated in the gx_demo/great_expectations/expectations/owler directory. There will be as many expectation suites as the number of nested objects in the data, in this case, 13.

Each suite will contain expectations for the data in the corresponding nested object. The expectations are generated based on the structure of the data and the data itself. Keep in mind that after Great Expectations generates the expectation suite, which contains expectations for the data, some manual work might be needed to tweak or improve some of the expectations.

Generated Expectations for Followers

Let’s take a look at the 6 generated expectations for the followers field in the company suite:

  • expect_column_min_to_be_between
  • expect_column_max_to_be_between
  • expect_column_mean_to_be_between
  • expect_column_median_to_be_between
  • expect_column_values_to_not_be_null
  • expect_column_values_to_be_in_type_list

We know that the followers field represents the number of followers of the company. Knowing that, we can say that this field can change over time, so we can’t expect the maximum value, mean, or median to be the same.

However, we can expect the minimum value to be greater than 0 and the values to be integers. We can also expect that the values are not null because if there are no followers, the value should be 0. So, we need to get rid of the expectations that are not suitable for this field: expect_column_max_to_be_between, expect_column_mean_to_be_between, and expect_column_median_to_be_between.

However, every field is different, and the expectations might need to be adjusted accordingly. For example, the field completeness_score represents the company’s completeness score. For this field, it makes sense to expect the values to be between 0 and 100, so we can keep not only expect_column_min_to_be_between but also expect_column_max_to_be_between.

Take a look at the Gallery of Expectations to see what kind of expectations you can use for your data.

Running the Demo

To see everything in action, go to the root of the project and run the following commands:

docker build -t gx_demo
docker composer up

After running the above commands, the Dagit (Dagster UI) will be available at localhost:3000. Run the demo_coverage job with the default configuration from the launchpad. After the job execution, you should see dynamically generated ge_validation operations for every nested object.

Automating Data Quality Checks with Dagster

In this case, the data passed all the checks, and everything is beautiful and green. If data validation for any nested object fails, then postprocess_ge_validation operations would be marked as failed (and obviously, it would be red instead of green). Let’s say the company_ceo validation failed. The postprocess_ge_validation[company_ceo] operation would be marked as failed. To see what expectations failed particularly, click on the ge_validation[company_ceo] operation and open “Expectation Results” by clicking on the “[Show Markdown]” link. It will open the validation results overview modal with all the data about the company_ceo dataset.

Conclusion

Depending on the stage of the data pipeline, there are many ways to test data quality. However, it is essential to have a well-oiled automated data quality check mechanism to ensure the accuracy and reliability of the data. Tools like Great Expectations and Dagster aren’t strictly necessary (static fixture tests don’t use any of those), but they can greatly help with a more robust data quality assurance process. Whether you’re looking to enhance your existing data quality processes or build a new system from scratch, we hope this guide has provided valuable insights.

Key Takeaways

  • Data quality is crucial for accurate decision-making and avoiding costly errors in analytics.
  • Dagster enables seamless orchestration and automation of data pipelines with built-in support for monitoring and scheduling.
  • Great Expectations provides a flexible, open-source framework to define, test, and validate data quality expectations.
  • Combining Dagster with Great Expectations allows for automated, real-time data quality checks and monitoring within data pipelines.
  • A robust data quality process ensures compliance and builds trust in the insights derived from data-driven workflows.

Frequently Asked Questions

Q1. What is Dagster used for?

A. Dagster is used for orchestrating, automating, and managing data pipelines, helping ensure smooth data workflows.

Q2. What are Great Expectations in data pipelines?

A. Great Expectations is a tool for defining, validating, and monitoring data quality expectations to ensure data integrity.

Q3. How do Dagster and Great Expectations work together?

A. Dagster integrates with Great Expectations to enable automated data quality checks within data pipelines, enhancing reliability.

Q4. Why is data quality important in analytics?

A. Good data quality ensures accurate insights, helps avoid costly errors, and supports better decision-making in analytics.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.



Source link

Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *