Contributing to SparkDQ¶
Thank you for your interest in contributing to SparkDQ. This document outlines the process for reporting issues, proposing changes, and submitting pull requests.
Project Scope¶
SparkDQ provides declarative data quality checks for PySpark workloads. Contributions should align with the project's focus on:
- Simple, expressive, and testable check definitions
- Row-level and aggregate-level validations
- YAML/Pydantic-based configuration with strong test coverage
If you are unsure whether a contribution fits the project's direction, open an issue to discuss it before investing significant effort.
Reporting Issues¶
Use GitHub Issues to report bugs, request features, or suggest improvements. Before opening a new issue, search existing issues to avoid duplicates.
When reporting a bug, include:
- A minimal reproducible example
- The Python and PySpark versions in use
- The full error message or unexpected behavior
Development Workflow¶
1. Fork and clone the repository¶
2. Create a feature branch¶
3. Install dependencies¶
Dependencies are managed via uv:
4. Implement your changes¶
Include unit tests for all new functionality. Update the documentation if the change affects public APIs or user-facing behavior.
5. Run the test suite¶
All tests must pass before submitting a pull request.
6. Commit your changes¶
Follow the Conventional Commits specification:
git commit -m "feat: add null-ratio aggregate check"
git commit -m "fix: handle empty DataFrame in row-level engine"
git commit -m "docs: update custom check implementation guide"
7. Open a pull request¶
Push your branch and open a pull request against main. In the pull request description, include:
- A summary of the change and its motivation
- References to any related issues
- Notes on testing approach if non-obvious
Code Style¶
- Code formatting is enforced via
ruff - Type annotations are required for all public interfaces
- Docstrings follow the Google style convention
- Pull requests should be focused and minimal in scope
Good First Issues¶
Issues labeled good first issue are suitable entry points for new contributors.