SparkDQ — Data Quality Validation#

Most data quality frameworks weren’t designed with PySpark in mind. They aren’t Spark-native and often lack proper support for declarative pipelines. Instead of integrating seamlessly, they require you to build custom wrappers around them just to fit into production workflows. This adds complexity and makes your pipelines harder to maintain. On top of that, many frameworks only validate data after processing — so you can’t react dynamically or fail early when data issues occur.

SparkDQ takes a different approach. It’s built specifically for PySpark — so you can define and run data quality checks directly inside your Spark pipelines, using Python. Whether you’re validating incoming data, verifying outputs before persistence, or enforcing assumptions in your dataflow: SparkDQ helps you catch issues early, without adding complexity.

Quickstart Examples#

Define checks as dictionaries that can be loaded from YAML/JSON files, stored in databases, or generated by APIs — perfect for CI/CD pipelines and data contracts.

from pyspark.sql import SparkSession

from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

# Declarative configuration via dictionary
# Could be loaded from YAML, JSON, or any external system
check_definitions = [
    {"check-id": "my-null-check", "check": "null-check", "columns": ["name"]},
]
check_set = CheckSet()
check_set.add_checks_from_dicts(check_definitions)

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())

Prefer Python-native development? Alternatively, you can define checks using Python classes for full type safety, IDE autocompletion, and compile-time validation. See docs for examples of both approaches.

Installation#

For Local Development / Standalone Clusters#

Install with PySpark included:

pip install sparkdq[spark]

For Databricks / Managed Platforms#

Install without PySpark (runtime provided by platform):

pip install sparkdq

The framework supports Python 3.10+ and is fully tested with PySpark 3.5.x. SparkDQ will automatically check for PySpark availability on import and provide clear error messages if PySpark is missing in your environment.

Why SparkDQ?#

  • Robust Validation Layer: Clean separation of check definition, execution, and reporting

  • Declarative or Programmatic: Define checks via config files or directly in Python

  • Severity-Aware: Built-in distinction between warning and critical violations

  • Row & Aggregate Logic: Supports both record-level and dataset-wide constraints

  • Typed & Tested: Built with type safety, testability, and extensibility in mind

  • Zero Overhead: Pure PySpark, no heavy dependencies

Typical Use Cases#

SparkDQ is built for modern data platforms that demand trust, transparency, and resilience. It helps teams enforce quality standards early and consistently — across ingestion, transformation, and delivery layers.

  • Data Ingestion: Validate raw data as it enters your platform with schema validation, completeness detection, format validation, and early failure detection

  • Lakehouse Quality: Enforce rules before persisting to storage including Delta/Iceberg/Hudi table validation, partition checks, and data freshness validation

  • ML & Analytics: Assert conditions before model training with feature quality checks, training data validation, bias detection, and model I/O validation

  • Pipeline Monitoring: Flag violations in production workflows through real-time alerts, SLA compliance monitoring, data drift detection, and automated incident response

Let’s Build Better Data Together#

⭐️ Found this useful? Give it a star and help spread the word!

📣 Questions, feedback, or ideas? Open an issue or discussion — we’d love to hear from you.

🤝 Want to contribute? Check out CONTRIBUTING.md to get started.