Aggregate Checks

Aggregate Checks#

:header: “Check”, “Description” :widths: 20, 80#

Columns Are Complete

Validates that a set of columns are fully populated. If any nulls are detected in the specified columns, the entire DataFrame is marked as invalid.

Column Presence

Verifies the existence of required columns in the DataFrame, independent of their data types.

Completeness Ratio

Validates that the ratio of non-null values in a column meets a minimum threshold, enabling soft completeness validation and early detection of partially missing data.

Count Min

Ensures that the DataFrame contains at least a defined minimum number of rows.

Count Max

Ensures that the DataFrame does not exceed a defined maximum number of rows.

Count Between

Ensures that the number of rows in the dataset falls within a defined inclusive range.

Count Exact

Ensures that the dataset contains exactly the specified number of rows.

Distinct Ratio

Validates that the ratio of distinct non-null values in a column exceeds a defined threshold, helping to detect overly uniform or low-cardinality fields.

Freshness Check

Validates that the most recent timestamp in a given column is within a defined freshness window relative to the current system time, helping detect outdated or stale data.

Schema Check

Ensures that a DataFrame matches an expected schema by verifying column names and data types, with optional strict enforcement against unexpected columns.

Unique Ratio

Validates that a specified column maintains a minimum ratio of unique (non-null) values, helping to detect excessive duplication and assess data entropy or feature distinctiveness.

Unique Rows

Validates that all rows in a DataFrame are unique, either across all columns or a defined subset, helping to detect unintended duplication and enforce row-level uniqueness.