sparkdq.core

sparkdq.core#

class AggregateEvaluationResult(passed: bool, metrics: Dict[str, Any])[source]

Bases: object

Encapsulates the outcome of an aggregate-level data quality check.

This class holds both the result (passed) and additional diagnostic metrics that were computed as part of the check evaluation. These metrics provide context for understanding why a check passed or failed, and are especially useful during debugging or reporting.

For example, in a CountMinCheck, the metrics might include:

  • actual_count: the number of rows actually found

  • expected_min_count: the configured minimum count threshold

Such context allows users to understand the degree of deviation from expectations when a check fails.

passed

Indicates whether the check condition was satisfied.

Type:

bool

metrics

Additional computed values or diagnostic information relevant to the check (e.g., actual vs. expected counts, computed averages, standard deviations, etc.).

Type:

Dict[str, Any]

Example

>>> result = AggregateEvaluationResult(
...     passed=False,
...     metrics={
...         "actual_count": 42,
...         "expected_min_count": 100
...     }
... )
>>> result.passed
False
>>> result.metrics["actual_count"]
42
to_dict() Dict[str, Any][source]

Serializes the evaluation result to a dictionary.

Returns:

Dictionary representation of the evaluation result.

Return type:

Dict[str, Any]

class BaseAggregateCheck(check_id: str, severity: Severity = Severity.CRITICAL)[source]

Bases: BaseCheck

Abstract base class for aggregate-level data quality checks.

Aggregate checks evaluate entire datasets and typically calculate metrics, apply thresholds, or return dataset-level validation results.

Subclasses must implement the _evaluate_logic() method.

evaluate(df: DataFrame) AggregateCheckResult[source]

Evaluates the check and returns a structured result with metadata.

This method serves as the standard entry point for executing aggregate-level checks. It delegates to _evaluate_logic() and wraps the outcome in an AggregateCheckResult.

Parameters:

df (DataFrame) – The dataset to validate.

Returns:

The complete check result including metadata and evaluation outcome.

Return type:

AggregateCheckResult

class BaseAggregateCheckConfig(*, check_id: str, severity: Severity = Severity.CRITICAL)[source]

Bases: ABC, BaseCheckConfig

Base class for aggregate-level check configurations.

Aggregate checks evaluate global dataset properties, such as row counts or min/max thresholds. Subclasses must define a check_class that inherits from BaseAggregateCheck.

Raises:

TypeError – If check_class is missing or not a subclass of BaseAggregateCheck.

model_config: ClassVar[ConfigDict] = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class BaseRowCheck(check_id: str, severity: Severity = Severity.CRITICAL)[source]

Bases: BaseCheck

Abstract base class for row-level data quality checks.

Row-level checks operate on individual records and identify invalid rows directly within the dataset. Subclasses must implement the validate() method.

abstract validate(df: DataFrame) DataFrame[source]

Applies the check to the input DataFrame.

Parameters:

df (DataFrame) – Input dataset to validate.

Returns:

DataFrame with additional markers or columns indicating validation failures.

Return type:

DataFrame

with_check_result_column(df: DataFrame, condition: Column) DataFrame[source]

Adds the check result as a new column to the DataFrame.

The result column will be named according to the check ID and contain the boolean evaluation for each row.

Parameters:
  • df (DataFrame) – The input Spark DataFrame.

  • condition (Column) – The boolean Spark expression representing pass/fail per row.

Returns:

The original DataFrame extended with the check result column.

Return type:

DataFrame

class BaseRowCheckConfig(*, check_id: str, severity: Severity = Severity.CRITICAL)[source]

Bases: ABC, BaseCheckConfig

Base class for row-level check configurations.

Row checks operate on individual records and typically require parameters such as column names. Subclasses must define a check_class that inherits from BaseRowCheck.

Raises:

TypeError – If check_class is missing or not a subclass of BaseRowCheck.

model_config: ClassVar[ConfigDict] = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class Severity(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Severity levels for data quality checks.

The severity level determines how a failed check should be treated, particularly in automated data pipelines (e.g., fail the pipeline or log a warning).

Levels:

  • CRITICAL: The check is considered blocking if it fails.

  • WARNING: The check is non-blocking but may trigger alerts or monitoring actions.