SparkDQ — Data Quality Validation for Apache Spark¶

SparkDQ is a lightweight data quality framework built natively for PySpark — no JVM bridge like PyDeequ, no complexity overhead like Great Expectations, and no platform lock-in like Databricks dqx. Define checks declaratively via YAML/JSON or through a type-safe Python API, validate at row and aggregate level in a single pass, and extend the framework via a plugin system without touching the core.

Installation¶

Local Development / Standalone ClustersDatabricks / Managed Platforms

Install with PySpark included:

pip install sparkdq[spark]

Install without PySpark (runtime provided by platform):

pip install sparkdq

The framework supports Python 3.11+ and is fully tested with PySpark 3.5.x.

Quickstart¶

DeclarativePython-native

from pyspark.sql import SparkSession
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

check_set = CheckSet()
check_set.add_checks_from_dicts([
    {"check": "null-check", "check-id": "no-null-name", "columns": ["name"]},
])

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())
# Validation Summary (2024-01-01 00:00:00)
# Total records:   3
# Passed records:  2
# Failed records:  1
# Warnings:        0
# Pass rate:       67.00%

Full type safety and IDE autocompletion:

from pyspark.sql import SparkSession
from sparkdq.checks import NullCheckConfig
from sparkdq.core import Severity
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        {"id": 1, "name": "Alice"},
        {"id": 2, "name": None},
        {"id": 3, "name": "Bob"},
    ]
)

check_set = (
    CheckSet()
    .add_check(NullCheckConfig(check_id="null-check", columns=["name"], severity=Severity.CRITICAL))
)

result = BatchDQEngine(check_set).run_batch(df)
print(result.summary())
# Validation Summary (2024-01-01 00:00:00)
# Total records:   3
# Passed records:  2
# Failed records:  1
# Warnings:        0
# Pass rate:       67.00%

SparkDQ ships with 30+ built-in checks across null validation, numeric ranges, string patterns, date boundaries, schema enforcement, uniqueness, and referential integrity.

Why SparkDQ?¶

Extensible by design — Add custom checks via a simple plugin system, no changes to the core required
Declarative or Pythonic — YAML/JSON configs or type-safe Python, your choice
Severity-aware — Distinguish between hard failures (CRITICAL) and soft constraints (WARNING)
Row-level and aggregate — Validate individual records and entire datasets in a single pass
Minimal footprint — Only Pydantic required, PySpark is provided by your platform

Support the Project¶

SparkDQ is open source and community-driven. If you find it useful, here's how you can help:

Star the repository to show your support and help others discover it
Report bugs or issues to help us improve
Share ideas or feedback — every suggestion counts
Contribute code or docs and become part of the project