sparkdq.engine¶
The sparkdq.engine subpackage contains the execution logic for data quality validation.
engine ¶
BatchDQEngine ¶
Bases: BaseDQEngine
Engine for executing data quality checks on Spark DataFrames in batch mode.
This engine applies both row-level and aggregate-level checks using the
BatchCheckRunner, and annotates the DataFrame with error metadata.
Source code in sparkdq/engine/batch/dq_engine.py
run_batch ¶
run_batch(
df: DataFrame,
reference_datasets: Optional[
ReferenceDatasetDict
] = None,
) -> BatchValidationResult
Run all registered checks against the given DataFrame.
This method applies both row-level and aggregate-level checks and returns a validation result containing the annotated DataFrame and the aggregated check results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input Spark DataFrame to validate. |
required |
reference_datasets
|
ReferenceDatasetDict
|
A dictionary of named reference DataFrames used by integrity checks.
Required for checks that compare values against external datasets
(e.g., foreign key validation). Each key should match the
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
BatchValidationResult |
BatchValidationResult
|
Object containing the validated DataFrame, |
BatchValidationResult
|
aggregate check results, and the original input schema. |
Source code in sparkdq/engine/batch/dq_engine.py
BatchValidationResult
dataclass
¶
Encapsulates the results of a batch data quality validation run.
Includes: - The validated Spark DataFrame, annotated with validation metadata. - A list of results from aggregate-level checks. - The original input column names, used to restore the pre-validation structure.
This class provides convenience methods to access only the passing or failing rows, making it easier to route or analyze validated data downstream.
Attributes:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
The DataFrame after validation, including:
|
aggregate_results |
List[AggregateCheckResult]
|
Results of all aggregate checks. |
input_columns |
List[str]
|
Names of the original input columns. |
timestamp |
datetime
|
Timestamp of when the validation result was created. |
Source code in sparkdq/engine/batch/validation_result.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
pass_df ¶
Return only the rows that passed all critical checks.
This method filters for rows where _dq_passed is true and restores
the original column structure from the input DataFrame.
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
DataFrame containing only valid rows, with original schema. |
Source code in sparkdq/engine/batch/validation_result.py
fail_df ¶
Return only the rows that failed one or more critical checks.
This includes validation metadata such as _dq_errors, _dq_passed, and,
if present, _dq_aggregate_errors. Additionally, a _dq_validation_ts column
is added for downstream auditing or tracking.
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
DataFrame containing invalid rows and relevant error metadata. |
Source code in sparkdq/engine/batch/validation_result.py
warn_df ¶
Returns rows that passed all critical checks but contain warning-level violations.
These are rows where _dq_passed is True, but the _dq_errors array contains
at least one entry with severity == WARNING.
Returns:
| Name | Type | Description |
|---|---|---|
DataFrame |
DataFrame
|
Filtered DataFrame of rows with warnings. |
Source code in sparkdq/engine/batch/validation_result.py
summary ¶
Create a summary of the validation results.
This includes total record count, pass/fail statistics, and number of rows with warning-level errors.
Returns:
| Name | Type | Description |
|---|---|---|
ValidationSummary |
ValidationSummary
|
Structured summary of the validation outcome. |