sparkdq.core

sparkdq.core#

class AggregateEvaluationResult(passed: bool, metrics: Dict[str, Any])[source]

Bases: object

Immutable container for aggregate-level data quality check evaluation outcomes.

Encapsulates both the binary validation result and the supporting metrics that explain the evaluation process. The metrics provide essential context for understanding validation outcomes, enabling detailed debugging and comprehensive reporting of data quality assessment results.

The separation of validation outcome from supporting metrics enables flexible consumption patterns while ensuring complete traceability of the evaluation process and its underlying calculations.

passed

Binary indicator of whether the validation criteria were satisfied.

Type:

bool

metrics

Comprehensive diagnostic information including computed values, thresholds, and contextual data that explain the validation outcome.

Type:

Dict[str, Any]

to_dict() Dict[str, Any][source]

Convert the evaluation result to a serializable dictionary format.

Produces a clean dictionary representation suitable for JSON serialization, logging, or integration with external reporting systems.

Returns:

Complete evaluation result in dictionary format,

preserving all validation outcomes and diagnostic metrics.

Return type:

Dict[str, Any]

class BaseAggregateCheck(check_id: str, severity: Severity = Severity.CRITICAL)[source]

Bases: BaseCheck

Abstract foundation for dataset-level data quality validation checks.

Defines the interface for checks that evaluate global properties of entire datasets, such as row counts, statistical distributions, or cross-dataset relationships. These checks produce single validation outcomes per dataset along with detailed metrics explaining the validation results.

Aggregate checks are essential for validating dataset-wide constraints and business rules that cannot be evaluated at the individual record level.

evaluate(df: DataFrame) AggregateCheckResult[source]

Execute the check and produce a comprehensive result with full metadata.

Serves as the primary interface for aggregate check execution, orchestrating the validation logic and enriching the results with complete metadata for reporting and audit purposes. This method ensures consistent result formatting across all aggregate check implementations.

Parameters:

df (DataFrame) – The dataset to validate against this check’s criteria.

Returns:

Complete validation result including check metadata,

configuration parameters, and detailed evaluation outcomes.

Return type:

AggregateCheckResult

class BaseAggregateCheckConfig(*, check_id: str, severity: Severity = Severity.CRITICAL)[source]

Bases: ABC, BaseCheckConfig

Abstract configuration foundation for dataset-level data quality checks.

Specializes the base configuration interface for checks that evaluate global properties of entire datasets. This class enforces proper association with aggregate-level check implementations and provides appropriate validation and instantiation capabilities.

Aggregate-level configurations typically include parameters such as statistical thresholds, count boundaries, and dataset-wide business rules.

Raises:

TypeError – When the subclass fails to define a valid check_class or associates with a non-aggregate-level check implementation.

model_config: ClassVar[ConfigDict] = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class BaseRowCheck(check_id: str, severity: Severity = Severity.CRITICAL)[source]

Bases: BaseCheck

Abstract foundation for record-level data quality validation checks.

Defines the interface for checks that evaluate individual records within a dataset, enabling fine-grained validation at the row level. These checks typically append boolean result columns to the input DataFrame, marking each record as valid or invalid based on the implemented validation logic.

Row-level checks are particularly effective for identifying specific problematic records, enabling downstream filtering, correction, or detailed error reporting on a per-record basis.

abstract validate(df: DataFrame) DataFrame[source]

Execute the validation logic against the provided dataset.

Implementations must apply their specific validation rules to the input DataFrame and return an augmented version containing validation results. The returned DataFrame should preserve all original data while adding validation markers.

Parameters:

df (DataFrame) – The dataset to validate against this check’s criteria.

Returns:

Enhanced dataset containing original data plus validation results,

typically as additional boolean columns indicating pass/fail status per record.

Return type:

DataFrame

with_check_result_column(df: DataFrame, condition: Column) DataFrame[source]

Augment the DataFrame with a validation result column.

Appends a boolean column containing the validation outcome for each record, using the check’s unique identifier as the column name. This standardized approach ensures consistent result formatting across all row-level checks.

Parameters:
  • df (DataFrame) – The dataset to augment with validation results.

  • condition (Column) – Boolean Spark expression evaluating to True for failed records and False for valid records.

Returns:

Original dataset enhanced with a named boolean column indicating

validation status for each record.

Return type:

DataFrame

class BaseRowCheckConfig(*, check_id: str, severity: Severity = Severity.CRITICAL)[source]

Bases: ABC, BaseCheckConfig

Abstract configuration foundation for record-level data quality checks.

Specializes the base configuration interface for checks that operate on individual records within a dataset. This class enforces that subclasses are properly associated with row-level check implementations and provides the appropriate validation and instantiation logic.

Row-level configurations typically include parameters such as column specifications, validation thresholds, and record-specific business rules.

Raises:

TypeError – When the subclass fails to define a valid check_class or associates with a non-row-level check implementation.

model_config: ClassVar[ConfigDict] = {'populate_by_name': True, 'validate_by_alias': True, 'validate_by_name': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class IntegrityCheckMixin[source]

Bases: object

Mixin enabling cross-dataset integrity validation capabilities.

Provides infrastructure for checks that require validation against external reference datasets, such as foreign key constraints, referential integrity, or cross-dataset consistency validations. The mixin handles reference dataset lifecycle management, including injection, storage, and retrieval operations.

This design enables complex validation scenarios while maintaining clean separation between the validation logic and reference data management, supporting both simple lookups and complex multi-dataset relationships.

_reference_datasets

Internal registry mapping reference dataset names to their corresponding Spark DataFrames.

Type:

ReferenceDatasetDict

get_reference_df(name: str) DataFrame[source]

Retrieve a reference dataset by its registered identifier.

Provides access to previously injected reference datasets during check execution. This method ensures type-safe access to reference data while providing clear error handling for missing or misconfigured references.

Parameters:

name (str) – The identifier of the reference dataset to retrieve.

Returns:

The Spark DataFrame associated with the specified identifier.

Return type:

DataFrame

Raises:

MissingReferenceDatasetError – When the requested reference dataset identifier is not found in the current registry.

inject_reference_datasets(datasets: dict[str, DataFrame]) None[source]

Register reference datasets for use during validation execution.

Establishes the reference dataset registry that will be available during check execution. This method is typically invoked by the validation engine as part of the check preparation phase, ensuring all required reference data is accessible when validation logic executes.

Parameters:

datasets (ReferenceDatasetDict) – Registry mapping reference dataset identifiers to their corresponding Spark DataFrames.

class Severity(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Enumeration of data quality check severity classifications.

Provides standardized severity levels that determine how validation failures are handled within data processing pipelines. The severity classification enables sophisticated failure handling strategies, allowing teams to implement nuanced responses to different types of data quality issues.

The design supports both blocking and non-blocking failure modes, enabling flexible pipeline behavior that can adapt to different operational requirements and business contexts.

Levels:
CRITICAL: Validation failures that should block pipeline execution and

require immediate attention before processing can continue.

WARNING: Validation failures that indicate potential issues but should

not prevent pipeline execution, typically triggering monitoring alerts or logging for later investigation.