-
Notifications
You must be signed in to change notification settings - Fork 88
Specification driven datagen #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #372 +/- ##
==========================================
+ Coverage 92.12% 92.28% +0.16%
==========================================
Files 47 55 +8
Lines 4217 4538 +321
Branches 766 836 +70
==========================================
+ Hits 3885 4188 +303
- Misses 186 195 +9
- Partials 146 155 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new Pydantic-based specification API for dbldatagen, providing a declarative, type-safe approach to synthetic data generation. The changes add comprehensive validation, test coverage, and example specifications while updating documentation and build configuration to support both Pydantic V1 and V2.
Key Changes:
- New spec-based API with Pydantic models for defining data generation configurations
- Comprehensive validation framework with error collection and reporting
- Pydantic V1/V2 compatibility layer for broad environment support
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_specs.py | Comprehensive test suite for ValidationResult, ColumnDefinition, DatagenSpec validation, and target configurations |
| tests/test_datasets_with_specs.py | Tests for Pydantic model validation with BasicUser and BasicStockTicker specifications |
| tests/test_datagen_specs.py | Tests for DatagenSpec creation, validation, and generator options |
| pyproject.toml | Added ipython dependency, test matrix for Pydantic versions, and disabled warn_unused_ignores |
| makefile | Updated to use Pydantic version-specific test environments and removed .venv target |
| examples/datagen_from_specs/basic_user_datagen_spec.py | Example DatagenSpec factory for generating basic user data with pre-configured specs |
| examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py | Complex example with OHLC stock data generation including time-series and volatility modeling |
| examples/datagen_from_specs/README.md | Documentation for Pydantic-based dataset specifications with usage examples |
| dbldatagen/spec/validation.py | ValidationResult class for collecting and reporting validation errors and warnings |
| dbldatagen/spec/output_targets.py | Pydantic models for UCSchemaTarget and FilePathTarget output destinations |
| dbldatagen/spec/generator_spec_impl.py | Generator class implementing the spec-to-DataFrame conversion logic |
| dbldatagen/spec/generator_spec.py | Core DatagenSpec and TableDefinition models with comprehensive validation |
| dbldatagen/spec/compat.py | Pydantic V1/V2 compatibility layer enabling cross-version support |
| dbldatagen/spec/column_spec.py | ColumnDefinition model with validation for primary keys and constraints |
| dbldatagen/spec/init.py | Module initialization with lazy imports to avoid heavy dependencies |
| README.md | Updated feature list and formatting to mention new Pydantic-based API |
| CHANGELOG.md | Added entry for Pydantic-based specification API feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Write data based on destination type | ||
| if isinstance(output_destination, FilePathTarget): | ||
| output_path = posixpath.join(output_destination.base_path, table_name) | ||
| df.write.format(output_destination.output_format).mode("overwrite").save(output_path) | ||
| logger.info(f"Wrote table '{table_name}' to file path: {output_path}") | ||
|
|
||
| elif isinstance(output_destination, UCSchemaTarget): | ||
| output_table = f"{output_destination.catalog}.{output_destination.schema_}.{table_name}" | ||
| df.write.mode("overwrite").saveAsTable(output_table) | ||
| logger.info(f"Wrote table '{table_name}' to Unity Catalog: {output_table}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use utils.write_data_to_output for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method in utils relies on OutputDataset. See the other comments above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have OutputDataset in config.py. I think we can reuse it here instead of creating new classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo, The output target in spec would be better suited for the spec as it does validations for UC and file paths volumes. This config.OutputDataset is generic, and getting specific errors would mean essientially copying this over to the new spec folder or other way round. Is there something I am missing here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OutputDataset could extend BaseModel and perform the same validations.
* added use of ABC to mark TextGenerator as abstract * Lint text generators module * Add persistence methods * Add tests and docs; Update PR template * Update hatch installation for push action * Refactor * Update method names and signatures --------- Co-authored-by: ronanstokes-db <[email protected]> Co-authored-by: Ronan Stokes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@anupkalburgi how hard it will be to extend this implementation to load spec from YAML files? I.e., with pydantic-yaml. I'm looking into PRs like #376 and thinking that it could be easier to provide these generators as a "standard" generation notebook that will receive a file name as a parameter and generate data... |
That is the idea with core idea behind this PR as far as we can get input format (yaml/json/py dict) to pydantic model, we would be able to pass that as the config to the generator that takes the pydantic object. Will be adding examples and helper methods in subsequent prs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 20 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ghanse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. We also need to add documentation.
| :param number_of_rows: Total number of data rows to generate for this table. | ||
| Must be a positive integer | ||
| :param partitions: Number of Spark partitions to use when generating data. | ||
| If None, defaults to Spark's default parallelism setting. | ||
| More partitions can improve generation speed for large datasets | ||
| :param columns: List of ColumnDefinition objects specifying the columns to generate | ||
| in this table. At least one column must be specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a user need to set the same property for every columnspec? For example for random seed value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OutputDataset could extend BaseModel and perform the same validations.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Changes
Linked issues
Resolves #..
Requirements