sparkdq.engine#
- class BatchDQEngine(check_set: CheckSet | None = None, fail_levels: List[Severity] = [Severity.CRITICAL])[source]
Bases:
BaseDQEngine
Engine for executing data quality checks on Spark DataFrames in batch mode.
This engine applies both row-level and aggregate-level checks using the
BatchCheckRunner
, and annotates the DataFrame with error metadata.- run_batch(df: DataFrame) BatchValidationResult [source]
Run all registered checks against the given DataFrame.
This method applies both row-level and aggregate-level checks and returns a validation result containing the annotated DataFrame and the aggregated check results.
- Parameters:
df (DataFrame) – The input Spark DataFrame to validate.
- Returns:
Object containing the validated DataFrame, aggregate check results, and the original input schema.
- Return type:
BatchValidationResult
- class BatchValidationResult(df: ~pyspark.sql.dataframe.DataFrame, aggregate_results: ~typing.List[~sparkdq.core.check_results.AggregateCheckResult], input_columns: ~typing.List[str], timestamp: ~datetime.datetime = <factory>)[source]
Bases:
object
Encapsulates the results of a batch data quality validation run.
Includes: - The validated Spark DataFrame, annotated with validation metadata. - A list of results from aggregate-level checks. - The original input column names, used to restore the pre-validation structure.
This class provides convenience methods to access only the passing or failing rows, making it easier to route or analyze validated data downstream.
- df
The DataFrame after validation, including:
_dq_passed (bool): Row-level pass/fail status.
_dq_errors (array): Structured errors from failed checks.
_dq_aggregate_errors (array, optional): Errors from failed aggregates.
- Type:
DataFrame
- aggregate_results
Results of all aggregate checks.
- Type:
List[AggregateCheckResult]
- input_columns
Names of the original input columns.
- Type:
List[str]
- timestamp
Timestamp of when the validation result was created.
- Type:
datetime
- fail_df() DataFrame [source]
Return only the rows that failed one or more critical checks.
This includes validation metadata such as
_dq_errors
,_dq_passed
, and, if present,_dq_aggregate_errors
. Additionally, a_dq_validation_ts
column is added for downstream auditing or tracking.- Returns:
DataFrame containing invalid rows and relevant error metadata.
- Return type:
DataFrame
- pass_df() DataFrame [source]
Return only the rows that passed all critical checks.
This method filters for rows where _dq_passed is true and restores the original column structure from the input DataFrame.
- Returns:
DataFrame containing only valid rows, with original schema.
- Return type:
DataFrame
- summary() ValidationSummary [source]
Create a summary of the validation results.
This includes total record count, pass/fail statistics, and number of rows with warning-level errors.
- Returns:
Structured summary of the validation outcome.
- Return type:
ValidationSummary
- warn_df() DataFrame [source]
Returns rows that passed all critical checks but contain warning-level violations.
These are rows where
_dq_passed
is True, but the_dq_errors
array contains at least one entry with severity == WARNING.- Returns:
Filtered DataFrame of rows with warnings.
- Return type:
DataFrame