Schema Check#

Check: schema-check

Purpose: Verifies that the DataFrame matches an expected schema in terms of column names and Spark data types. Optionally enforces strict matching by rejecting unexpected extra columns.

Supported Data Types#

The following Spark types are supported and must be specified as lowercase strings:

string, boolean, int, bigint, float, double, date timestamp, binary, array, map, struct, decimal(precision, scale) — e.g., decimal(10,2)

Important

  • For decimal types, both precision and scale must be specified inside parentheses.

  • No other formats (e.g., integer, decimal(10.2)) are accepted.

Python Configuration#

from sparkdq.checks import SchemaCheckConfig
from sparkdq.core import Severity

SchemaCheckConfig(
    check_id="enforce_schema_contract",
    expected_schema={
        "id": "int",
        "name": "string",
        "amount": "decimal(10,2)",
        "created_at": "timestamp"
    },
    strict=True,
    severity=Severity.CRITICAL
)

Declarative Configuration#

- check: schema-check
  check-id: enforce_schema_contract
  expected-schema:
    id: int
    name: string
    amount: decimal(10,2)
    created_at: timestamp
  strict: true
  severity: critical

Typical Use Cases#

  • ✅ Ensure schema consistency between ingestion, transformation, and consumption stages.

  • ✅ Detect missing or renamed columns early in the pipeline.

  • ✅ Catch incorrect data types that may lead to casting errors or incorrect aggregations.

  • ✅ Enforce a strict schema contract in production pipelines to prevent silent data corruption.