sparkdq.core¶

The sparkdq.core subpackage defines the core interfaces and data structures used throughout the SparkDQ framework.

core ¶

BaseAggregateCheck ¶

Bases: BaseCheck

Abstract foundation for dataset-level data quality validation checks.

Defines the interface for checks that evaluate global properties of entire datasets, such as row counts, statistical distributions, or cross-dataset relationships. These checks produce single validation outcomes per dataset along with detailed metrics explaining the validation results.

Aggregate checks are essential for validating dataset-wide constraints and business rules that cannot be evaluated at the individual record level.

Source code in sparkdq/core/base_check.py

class BaseAggregateCheck(BaseCheck):
    """
    Abstract foundation for dataset-level data quality validation checks.

    Defines the interface for checks that evaluate global properties of entire datasets,
    such as row counts, statistical distributions, or cross-dataset relationships.
    These checks produce single validation outcomes per dataset along with detailed
    metrics explaining the validation results.

    Aggregate checks are essential for validating dataset-wide constraints and
    business rules that cannot be evaluated at the individual record level.
    """

    def _evaluate_logic(self, df: DataFrame) -> AggregateEvaluationResult:
        """
        Implement the core validation logic for this aggregate check.

        Contains the specific business logic and computational operations required
        to evaluate the dataset against this check's criteria. Implementations
        should perform necessary calculations and return both the validation
        outcome and supporting metrics.

        Subclasses must override this method unless they also inherit ObservableAggregateCheck,
        which provides a default implementation based on aggregations().

        Args:
            df (DataFrame): The dataset to evaluate against this check's criteria.

        Returns:
            AggregateEvaluationResult: Validation outcome including pass/fail status
                and detailed metrics explaining the evaluation results.
        """
        raise NotImplementedError(f"{type(self).__name__} must implement _evaluate_logic().")

    def evaluate(self, df: DataFrame) -> AggregateCheckResult:
        """
        Execute the check and produce a comprehensive result with full metadata.

        Serves as the primary interface for aggregate check execution, orchestrating
        the validation logic and enriching the results with complete metadata for
        reporting and audit purposes. This method ensures consistent result formatting
        across all aggregate check implementations.

        Args:
            df (DataFrame): The dataset to validate against this check's criteria.

        Returns:
            AggregateCheckResult: Complete validation result including check metadata,
                configuration parameters, and detailed evaluation outcomes.
        """
        result = self._evaluate_logic(df)
        return AggregateCheckResult(
            check=self.name,
            check_id=self.check_id,
            severity=self.severity,
            parameters=self._parameters(),
            result=result,
        )

evaluate ¶

evaluate(df: DataFrame) -> AggregateCheckResult

Execute the check and produce a comprehensive result with full metadata.

Serves as the primary interface for aggregate check execution, orchestrating the validation logic and enriching the results with complete metadata for reporting and audit purposes. This method ensures consistent result formatting across all aggregate check implementations.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The dataset to validate against this check's criteria.	required

Returns:

Name	Type	Description
`AggregateCheckResult`	`AggregateCheckResult`	Complete validation result including check metadata, configuration parameters, and detailed evaluation outcomes.

Source code in sparkdq/core/base_check.py

def evaluate(self, df: DataFrame) -> AggregateCheckResult:
    """
    Execute the check and produce a comprehensive result with full metadata.

    Serves as the primary interface for aggregate check execution, orchestrating
    the validation logic and enriching the results with complete metadata for
    reporting and audit purposes. This method ensures consistent result formatting
    across all aggregate check implementations.

    Args:
        df (DataFrame): The dataset to validate against this check's criteria.

    Returns:
        AggregateCheckResult: Complete validation result including check metadata,
            configuration parameters, and detailed evaluation outcomes.
    """
    result = self._evaluate_logic(df)
    return AggregateCheckResult(
        check=self.name,
        check_id=self.check_id,
        severity=self.severity,
        parameters=self._parameters(),
        result=result,
    )

BaseRowCheck ¶

Bases: BaseCheck

Abstract foundation for record-level data quality validation checks.

Defines the interface for checks that evaluate individual records within a dataset, enabling fine-grained validation at the row level. These checks typically append boolean result columns to the input DataFrame, marking each record as valid or invalid based on the implemented validation logic.

Row-level checks are particularly effective for identifying specific problematic records, enabling downstream filtering, correction, or detailed error reporting on a per-record basis.

Source code in sparkdq/core/base_check.py

class BaseRowCheck(BaseCheck):
    """
    Abstract foundation for record-level data quality validation checks.

    Defines the interface for checks that evaluate individual records within a dataset,
    enabling fine-grained validation at the row level. These checks typically append
    boolean result columns to the input DataFrame, marking each record as valid or
    invalid based on the implemented validation logic.

    Row-level checks are particularly effective for identifying specific problematic
    records, enabling downstream filtering, correction, or detailed error reporting
    on a per-record basis.
    """

    @abstractmethod
    def validate(self, df: DataFrame) -> DataFrame:
        """
        Execute the validation logic against the provided dataset.

        Implementations must apply their specific validation rules to the input DataFrame
        and return an augmented version containing validation results. The returned
        DataFrame should preserve all original data while adding validation markers.

        Args:
            df (DataFrame): The dataset to validate against this check's criteria.

        Returns:
            DataFrame: Enhanced dataset containing original data plus validation results,
                typically as additional boolean columns indicating pass/fail status per record.
        """
        ...

    def with_check_result_column(self, df: DataFrame, condition: Column) -> DataFrame:
        """
        Augment the DataFrame with a validation result column.

        Appends a boolean column containing the validation outcome for each record,
        using the check's unique identifier as the column name. This standardized
        approach ensures consistent result formatting across all row-level checks.

        Args:
            df (DataFrame): The dataset to augment with validation results.
            condition (Column): Boolean Spark expression evaluating to True for failed
                records and False for valid records.

        Returns:
            DataFrame: Original dataset enhanced with a named boolean column indicating
                validation status for each record.
        """
        return df.withColumn(self.check_id, condition)

validate `abstractmethod` ¶

validate(df: DataFrame) -> DataFrame

Execute the validation logic against the provided dataset.

Implementations must apply their specific validation rules to the input DataFrame and return an augmented version containing validation results. The returned DataFrame should preserve all original data while adding validation markers.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The dataset to validate against this check's criteria.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Enhanced dataset containing original data plus validation results, typically as additional boolean columns indicating pass/fail status per record.

Source code in sparkdq/core/base_check.py

@abstractmethod
def validate(self, df: DataFrame) -> DataFrame:
    """
    Execute the validation logic against the provided dataset.

    Implementations must apply their specific validation rules to the input DataFrame
    and return an augmented version containing validation results. The returned
    DataFrame should preserve all original data while adding validation markers.

    Args:
        df (DataFrame): The dataset to validate against this check's criteria.

    Returns:
        DataFrame: Enhanced dataset containing original data plus validation results,
            typically as additional boolean columns indicating pass/fail status per record.
    """
    ...

with_check_result_column ¶

with_check_result_column(
    df: DataFrame, condition: Column
) -> DataFrame

Augment the DataFrame with a validation result column.

Appends a boolean column containing the validation outcome for each record, using the check's unique identifier as the column name. This standardized approach ensures consistent result formatting across all row-level checks.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The dataset to augment with validation results.	required
`condition`	`Column`	Boolean Spark expression evaluating to True for failed records and False for valid records.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	Original dataset enhanced with a named boolean column indicating validation status for each record.

Source code in sparkdq/core/base_check.py

def with_check_result_column(self, df: DataFrame, condition: Column) -> DataFrame:
    """
    Augment the DataFrame with a validation result column.

    Appends a boolean column containing the validation outcome for each record,
    using the check's unique identifier as the column name. This standardized
    approach ensures consistent result formatting across all row-level checks.

    Args:
        df (DataFrame): The dataset to augment with validation results.
        condition (Column): Boolean Spark expression evaluating to True for failed
            records and False for valid records.

    Returns:
        DataFrame: Original dataset enhanced with a named boolean column indicating
            validation status for each record.
    """
    return df.withColumn(self.check_id, condition)

IntegrityCheckMixin ¶

Mixin enabling cross-dataset integrity validation capabilities.

Provides infrastructure for checks that require validation against external reference datasets, such as foreign key constraints, referential integrity, or cross-dataset consistency validations. The mixin handles reference dataset lifecycle management, including injection, storage, and retrieval operations.

This design enables complex validation scenarios while maintaining clean separation between the validation logic and reference data management, supporting both simple lookups and complex multi-dataset relationships.

Attributes:

Name	Type	Description
`_reference_datasets`	`ReferenceDatasetDict`	Internal registry mapping reference dataset names to their corresponding Spark DataFrames.

Source code in sparkdq/core/base_check.py

class IntegrityCheckMixin:
    """
    Mixin enabling cross-dataset integrity validation capabilities.

    Provides infrastructure for checks that require validation against external
    reference datasets, such as foreign key constraints, referential integrity,
    or cross-dataset consistency validations. The mixin handles reference dataset
    lifecycle management, including injection, storage, and retrieval operations.

    This design enables complex validation scenarios while maintaining clean
    separation between the validation logic and reference data management,
    supporting both simple lookups and complex multi-dataset relationships.

    Attributes:
        _reference_datasets (ReferenceDatasetDict): Internal registry mapping
            reference dataset names to their corresponding Spark DataFrames.
    """

    _reference_datasets: ReferenceDatasetDict = {}

    def inject_reference_datasets(self, datasets: ReferenceDatasetDict) -> None:
        """
        Register reference datasets for use during validation execution.

        Establishes the reference dataset registry that will be available during
        check execution. This method is typically invoked by the validation engine
        as part of the check preparation phase, ensuring all required reference
        data is accessible when validation logic executes.

        Args:
            datasets (ReferenceDatasetDict): Registry mapping reference dataset
                identifiers to their corresponding Spark DataFrames.
        """
        self._reference_datasets = datasets

    def get_reference_df(self, name: str) -> DataFrame:
        """
        Retrieve a reference dataset by its registered identifier.

        Provides access to previously injected reference datasets during check
        execution. This method ensures type-safe access to reference data while
        providing clear error handling for missing or misconfigured references.

        Args:
            name (str): The identifier of the reference dataset to retrieve.

        Returns:
            DataFrame: The Spark DataFrame associated with the specified identifier.

        Raises:
            MissingReferenceDatasetError: When the requested reference dataset
                identifier is not found in the current registry.
        """
        try:
            return self._reference_datasets[name]
        except KeyError:
            raise MissingReferenceDatasetError(name)

inject_reference_datasets ¶

inject_reference_datasets(
    datasets: ReferenceDatasetDict,
) -> None

Register reference datasets for use during validation execution.

Establishes the reference dataset registry that will be available during check execution. This method is typically invoked by the validation engine as part of the check preparation phase, ensuring all required reference data is accessible when validation logic executes.

Parameters:

Name	Type	Description	Default
`datasets`	`ReferenceDatasetDict`	Registry mapping reference dataset identifiers to their corresponding Spark DataFrames.	required

Source code in sparkdq/core/base_check.py

def inject_reference_datasets(self, datasets: ReferenceDatasetDict) -> None:
    """
    Register reference datasets for use during validation execution.

    Establishes the reference dataset registry that will be available during
    check execution. This method is typically invoked by the validation engine
    as part of the check preparation phase, ensuring all required reference
    data is accessible when validation logic executes.

    Args:
        datasets (ReferenceDatasetDict): Registry mapping reference dataset
            identifiers to their corresponding Spark DataFrames.
    """
    self._reference_datasets = datasets

get_reference_df ¶

get_reference_df(name: str) -> DataFrame

Retrieve a reference dataset by its registered identifier.

Provides access to previously injected reference datasets during check execution. This method ensures type-safe access to reference data while providing clear error handling for missing or misconfigured references.

Parameters:

Name	Type	Description	Default
`name`	`str`	The identifier of the reference dataset to retrieve.	required

Returns:

Name	Type	Description
`DataFrame`	`DataFrame`	The Spark DataFrame associated with the specified identifier.

Raises:

Type	Description
`MissingReferenceDatasetError`	When the requested reference dataset identifier is not found in the current registry.

Source code in sparkdq/core/base_check.py

def get_reference_df(self, name: str) -> DataFrame:
    """
    Retrieve a reference dataset by its registered identifier.

    Provides access to previously injected reference datasets during check
    execution. This method ensures type-safe access to reference data while
    providing clear error handling for missing or misconfigured references.

    Args:
        name (str): The identifier of the reference dataset to retrieve.

    Returns:
        DataFrame: The Spark DataFrame associated with the specified identifier.

    Raises:
        MissingReferenceDatasetError: When the requested reference dataset
            identifier is not found in the current registry.
    """
    try:
        return self._reference_datasets[name]
    except KeyError:
        raise MissingReferenceDatasetError(name)

BaseAggregateCheckConfig ¶

Bases: ABC, BaseCheckConfig

Abstract configuration foundation for dataset-level data quality checks.

Specializes the base configuration interface for checks that evaluate global properties of entire datasets. This class enforces proper association with aggregate-level check implementations and provides appropriate validation and instantiation capabilities.

Aggregate-level configurations typically include parameters such as statistical thresholds, count boundaries, and dataset-wide business rules.

Raises:

Type	Description
`TypeError`	When the subclass fails to define a valid check_class or associates with a non-aggregate-level check implementation.

Source code in sparkdq/core/base_config.py

class BaseAggregateCheckConfig(ABC, BaseCheckConfig):
    """
    Abstract configuration foundation for dataset-level data quality checks.

    Specializes the base configuration interface for checks that evaluate global
    properties of entire datasets. This class enforces proper association with
    aggregate-level check implementations and provides appropriate validation
    and instantiation capabilities.

    Aggregate-level configurations typically include parameters such as statistical
    thresholds, count boundaries, and dataset-wide business rules.

    Raises:
        TypeError: When the subclass fails to define a valid check_class or associates
            with a non-aggregate-level check implementation.
    """

    def __init_subclass__(cls) -> None:
        """
        Enforce proper check class association during subclass definition.

        Validates that the configuration subclass correctly associates with an
        aggregate-level check implementation, preventing runtime errors and
        ensuring type safety across the configuration system.
        """
        super().__init_subclass__()
        if not hasattr(cls, "check_class") or cls.check_class is None:
            raise TypeError(f"{cls.__name__} must define a 'check_class'.")
        if not issubclass(cls.check_class, BaseAggregateCheck):
            raise TypeError(f"{cls.__name__}.check_class must be a subclass of BaseAggregateCheck.")

__init_subclass__ ¶

__init_subclass__() -> None

Enforce proper check class association during subclass definition.

Validates that the configuration subclass correctly associates with an aggregate-level check implementation, preventing runtime errors and ensuring type safety across the configuration system.

Source code in sparkdq/core/base_config.py

def __init_subclass__(cls) -> None:
    """
    Enforce proper check class association during subclass definition.

    Validates that the configuration subclass correctly associates with an
    aggregate-level check implementation, preventing runtime errors and
    ensuring type safety across the configuration system.
    """
    super().__init_subclass__()
    if not hasattr(cls, "check_class") or cls.check_class is None:
        raise TypeError(f"{cls.__name__} must define a 'check_class'.")
    if not issubclass(cls.check_class, BaseAggregateCheck):
        raise TypeError(f"{cls.__name__}.check_class must be a subclass of BaseAggregateCheck.")

BaseRowCheckConfig ¶

Bases: ABC, BaseCheckConfig

Abstract configuration foundation for record-level data quality checks.

Specializes the base configuration interface for checks that operate on individual records within a dataset. This class enforces that subclasses are properly associated with row-level check implementations and provides the appropriate validation and instantiation logic.

Row-level configurations typically include parameters such as column specifications, validation thresholds, and record-specific business rules.

Raises:

Type	Description
`TypeError`	When the subclass fails to define a valid check_class or associates with a non-row-level check implementation.

Source code in sparkdq/core/base_config.py

class BaseRowCheckConfig(ABC, BaseCheckConfig):
    """
    Abstract configuration foundation for record-level data quality checks.

    Specializes the base configuration interface for checks that operate on individual
    records within a dataset. This class enforces that subclasses are properly
    associated with row-level check implementations and provides the appropriate
    validation and instantiation logic.

    Row-level configurations typically include parameters such as column specifications,
    validation thresholds, and record-specific business rules.

    Raises:
        TypeError: When the subclass fails to define a valid check_class or associates
            with a non-row-level check implementation.
    """

    def __init_subclass__(cls) -> None:
        """
        Enforce proper check class association during subclass definition.

        Validates that the configuration subclass correctly associates with a
        row-level check implementation, preventing runtime errors and ensuring
        type safety across the configuration system.
        """
        super().__init_subclass__()
        if not hasattr(cls, "check_class") or cls.check_class is None:
            raise TypeError(f"{cls.__name__} must define a 'check_class'.")
        if not issubclass(cls.check_class, BaseRowCheck):
            raise TypeError(f"{cls.__name__}.check_class must be a subclass of BaseRowCheck.")

__init_subclass__ ¶

__init_subclass__() -> None

Enforce proper check class association during subclass definition.

Validates that the configuration subclass correctly associates with a row-level check implementation, preventing runtime errors and ensuring type safety across the configuration system.

Source code in sparkdq/core/base_config.py

def __init_subclass__(cls) -> None:
    """
    Enforce proper check class association during subclass definition.

    Validates that the configuration subclass correctly associates with a
    row-level check implementation, preventing runtime errors and ensuring
    type safety across the configuration system.
    """
    super().__init_subclass__()
    if not hasattr(cls, "check_class") or cls.check_class is None:
        raise TypeError(f"{cls.__name__} must define a 'check_class'.")
    if not issubclass(cls.check_class, BaseRowCheck):
        raise TypeError(f"{cls.__name__}.check_class must be a subclass of BaseRowCheck.")

AggregateEvaluationResult `dataclass` ¶

Immutable container for aggregate-level data quality check evaluation outcomes.

Encapsulates both the binary validation result and the supporting metrics that explain the evaluation process. The metrics provide essential context for understanding validation outcomes, enabling detailed debugging and comprehensive reporting of data quality assessment results.

The separation of validation outcome from supporting metrics enables flexible consumption patterns while ensuring complete traceability of the evaluation process and its underlying calculations.

Attributes:

Name	Type	Description
`passed`	`bool`	Binary indicator of whether the validation criteria were satisfied.
`metrics`	`Dict[str, Any]`	Comprehensive diagnostic information including computed values, thresholds, and contextual data that explain the validation outcome.

Source code in sparkdq/core/check_results.py

@dataclass(frozen=True)
class AggregateEvaluationResult:
    """
    Immutable container for aggregate-level data quality check evaluation outcomes.

    Encapsulates both the binary validation result and the supporting metrics that
    explain the evaluation process. The metrics provide essential context for
    understanding validation outcomes, enabling detailed debugging and comprehensive
    reporting of data quality assessment results.

    The separation of validation outcome from supporting metrics enables flexible
    consumption patterns while ensuring complete traceability of the evaluation
    process and its underlying calculations.

    Attributes:
        passed (bool): Binary indicator of whether the validation criteria were satisfied.
        metrics (Dict[str, Any]): Comprehensive diagnostic information including
            computed values, thresholds, and contextual data that explain the
            validation outcome.
    """

    passed: bool
    metrics: Dict[str, Any]

    def to_dict(self) -> Dict[str, Any]:
        """
        Convert the evaluation result to a serializable dictionary format.

        Produces a clean dictionary representation suitable for JSON serialization,
        logging, or integration with external reporting systems.

        Returns:
            Dict[str, Any]: Complete evaluation result in dictionary format,
                preserving all validation outcomes and diagnostic metrics.
        """
        return asdict(self)

to_dict ¶

to_dict() -> Dict[str, Any]

Convert the evaluation result to a serializable dictionary format.

Produces a clean dictionary representation suitable for JSON serialization, logging, or integration with external reporting systems.

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: Complete evaluation result in dictionary format, preserving all validation outcomes and diagnostic metrics.

Source code in sparkdq/core/check_results.py

def to_dict(self) -> Dict[str, Any]:
    """
    Convert the evaluation result to a serializable dictionary format.

    Produces a clean dictionary representation suitable for JSON serialization,
    logging, or integration with external reporting systems.

    Returns:
        Dict[str, Any]: Complete evaluation result in dictionary format,
            preserving all validation outcomes and diagnostic metrics.
    """
    return asdict(self)

ObservableAggregateCheck ¶

Bases: BaseAggregateCheck

Abstract base class for aggregate checks that participate in single-pass agg() execution.

When a BatchCheckRunner detects checks implementing this ABC, it collects all their aggregation expressions and executes them in a single df.agg() call. Results are then distributed back to each check via _evaluate_from_agg_results().

Subclasses must implement: - aggregations(): return a dict of {metric_name: Column expression}. - _evaluate_from_agg_results(): compute pass/fail from the resolved values.

The metric_name keys in aggregations() are scoped per check instance by the runner, so duplicate names across different checks do not collide.

Example::

class MyCheck(ObservableAggregateCheck):
    def aggregations(self) -> dict[str, Column]:
        return {"total": F.count("*")}

    def _evaluate_from_agg_results(self, results: dict[str, Any]) -> AggregateEvaluationResult:
        total = results["total"]
        return AggregateEvaluationResult(passed=total > 0, metrics={"total": total})

Source code in sparkdq/core/observable_check.py

class ObservableAggregateCheck(BaseAggregateCheck):
    """
    Abstract base class for aggregate checks that participate in single-pass agg() execution.

    When a BatchCheckRunner detects checks implementing this ABC, it collects all
    their aggregation expressions and executes them in a single df.agg() call.
    Results are then distributed back to each check via _evaluate_from_agg_results().

    Subclasses must implement:
    - aggregations(): return a dict of {metric_name: Column expression}.
    - _evaluate_from_agg_results(): compute pass/fail from the resolved values.

    The metric_name keys in aggregations() are scoped per check instance by the runner,
    so duplicate names across different checks do not collide.

    Example::

        class MyCheck(ObservableAggregateCheck):
            def aggregations(self) -> dict[str, Column]:
                return {"total": F.count("*")}

            def _evaluate_from_agg_results(self, results: dict[str, Any]) -> AggregateEvaluationResult:
                total = results["total"]
                return AggregateEvaluationResult(passed=total > 0, metrics={"total": total})
    """

    @abstractmethod
    def aggregations(self) -> dict[str, Column]:
        """
        Declare the Spark aggregation expressions needed by this check.

        Keys must be unique within a single check instance. The runner scopes
        them by check index to avoid collisions across checks.

        Returns:
            dict[str, Column]: Mapping of metric name to Spark Column expression.
        """
        ...

    @abstractmethod
    def _evaluate_from_agg_results(self, results: dict[str, Any]) -> AggregateEvaluationResult:
        """
        Evaluate the check result from pre-computed aggregation values.

        Args:
            results (dict[str, Any]): Resolved metric values keyed by the names
                returned from aggregations(). Types depend on the Spark aggregation
                (e.g. int for count, float for avg, datetime for max timestamp).

        Returns:
            AggregateEvaluationResult: Check outcome with pass/fail status and metrics.
        """
        ...

    def _evaluate_logic(self, df: DataFrame) -> AggregateEvaluationResult:
        """
        Default implementation that runs aggregations() in a single df.agg() call.

        Subclasses do not need to override this method unless they require custom
        pre-processing before aggregation (e.g. column existence validation).
        The BatchCheckRunner bypasses _evaluate_logic() entirely for observable checks
        and calls _evaluate_from_agg_results() directly with batched results.

        Args:
            df (DataFrame): The dataset to evaluate.

        Returns:
            AggregateEvaluationResult: Check outcome derived from the aggregation results.
        """
        row = df.agg(*[expr.alias(k) for k, expr in self.aggregations().items()]).first()
        return self._evaluate_from_agg_results(row.asDict())  # type: ignore[union-attr]

aggregations `abstractmethod` ¶

aggregations() -> dict[str, Column]

Declare the Spark aggregation expressions needed by this check.

Keys must be unique within a single check instance. The runner scopes them by check index to avoid collisions across checks.

Returns:

Type	Description
`dict[str, Column]`	dict[str, Column]: Mapping of metric name to Spark Column expression.

Source code in sparkdq/core/observable_check.py

@abstractmethod
def aggregations(self) -> dict[str, Column]:
    """
    Declare the Spark aggregation expressions needed by this check.

    Keys must be unique within a single check instance. The runner scopes
    them by check index to avoid collisions across checks.

    Returns:
        dict[str, Column]: Mapping of metric name to Spark Column expression.
    """
    ...

Severity ¶

Bases: Enum

Enumeration of data quality check severity classifications.

Provides standardized severity levels that determine how validation failures are handled within data processing pipelines. The severity classification enables sophisticated failure handling strategies, allowing teams to implement nuanced responses to different types of data quality issues.

The design supports both blocking and non-blocking failure modes, enabling flexible pipeline behavior that can adapt to different operational requirements and business contexts.

Levels

CRITICAL: Validation failures that should block pipeline execution and require immediate attention before processing can continue. WARNING: Validation failures that indicate potential issues but should not prevent pipeline execution, typically triggering monitoring alerts or logging for later investigation.

Source code in sparkdq/core/severity.py

class Severity(Enum):
    """
    Enumeration of data quality check severity classifications.

    Provides standardized severity levels that determine how validation failures
    are handled within data processing pipelines. The severity classification
    enables sophisticated failure handling strategies, allowing teams to implement
    nuanced responses to different types of data quality issues.

    The design supports both blocking and non-blocking failure modes, enabling
    flexible pipeline behavior that can adapt to different operational requirements
    and business contexts.

    Levels:
        CRITICAL: Validation failures that should block pipeline execution and
            require immediate attention before processing can continue.
        WARNING: Validation failures that indicate potential issues but should
            not prevent pipeline execution, typically triggering monitoring
            alerts or logging for later investigation.
    """

    CRITICAL = "critical"
    WARNING = "warning"

    def __str__(self) -> str:
        """
        Provide the canonical string representation of this severity level.

        Returns:
            str: Lowercase severity identifier suitable for serialization,
                logging, and external system integration.
        """
        return self.value

str ¶

__str__() -> str

Provide the canonical string representation of this severity level.

Returns:

Name	Type	Description
`str`	`str`	Lowercase severity identifier suitable for serialization, logging, and external system integration.

Source code in sparkdq/core/severity.py

def __str__(self) -> str:
    """
    Provide the canonical string representation of this severity level.

    Returns:
        str: Lowercase severity identifier suitable for serialization,
            logging, and external system integration.
    """
    return self.value

sparkdq.core¶

core ¶

BaseAggregateCheck ¶

evaluate ¶

BaseRowCheck ¶

validate abstractmethod ¶

with_check_result_column ¶

IntegrityCheckMixin ¶

inject_reference_datasets ¶

get_reference_df ¶

BaseAggregateCheckConfig ¶

__init_subclass__ ¶

BaseRowCheckConfig ¶

__init_subclass__ ¶

AggregateEvaluationResult dataclass ¶

to_dict ¶

ObservableAggregateCheck ¶

aggregations abstractmethod ¶

Severity ¶

__str__ ¶

validate `abstractmethod` ¶

AggregateEvaluationResult `dataclass` ¶

aggregations `abstractmethod` ¶

str ¶