Skip to content

Conversation

@anupkalburgi
Copy link
Collaborator

Changes

Linked issues

Resolves #..

Requirements

  • manually tested
  • updated documentation
  • updated demos
  • updated tests

@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 95.01558% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.28%. Comparing base (4687b8c) to head (d527300).

Files with missing lines Patch % Lines
dbldatagen/spec/generator_spec_impl.py 91.89% 3 Missing and 6 partials ⚠️
dbldatagen/spec/generator_spec.py 95.41% 3 Missing and 2 partials ⚠️
dbldatagen/spec/compat.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #372      +/-   ##
==========================================
+ Coverage   92.12%   92.28%   +0.16%     
==========================================
  Files          47       55       +8     
  Lines        4217     4538     +321     
  Branches      766      836      +70     
==========================================
+ Hits         3885     4188     +303     
- Misses        186      195       +9     
- Partials      146      155       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anupkalburgi anupkalburgi marked this pull request as ready for review December 2, 2025 14:31
@anupkalburgi anupkalburgi requested review from a team as code owners December 2, 2025 14:31
@anupkalburgi anupkalburgi requested review from nfx and renardeinside and removed request for a team December 2, 2025 14:31
@ghanse ghanse requested a review from Copilot December 4, 2025 18:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new Pydantic-based specification API for dbldatagen, providing a declarative, type-safe approach to synthetic data generation. The changes add comprehensive validation, test coverage, and example specifications while updating documentation and build configuration to support both Pydantic V1 and V2.

Key Changes:

  • New spec-based API with Pydantic models for defining data generation configurations
  • Comprehensive validation framework with error collection and reporting
  • Pydantic V1/V2 compatibility layer for broad environment support

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_specs.py Comprehensive test suite for ValidationResult, ColumnDefinition, DatagenSpec validation, and target configurations
tests/test_datasets_with_specs.py Tests for Pydantic model validation with BasicUser and BasicStockTicker specifications
tests/test_datagen_specs.py Tests for DatagenSpec creation, validation, and generator options
pyproject.toml Added ipython dependency, test matrix for Pydantic versions, and disabled warn_unused_ignores
makefile Updated to use Pydantic version-specific test environments and removed .venv target
examples/datagen_from_specs/basic_user_datagen_spec.py Example DatagenSpec factory for generating basic user data with pre-configured specs
examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py Complex example with OHLC stock data generation including time-series and volatility modeling
examples/datagen_from_specs/README.md Documentation for Pydantic-based dataset specifications with usage examples
dbldatagen/spec/validation.py ValidationResult class for collecting and reporting validation errors and warnings
dbldatagen/spec/output_targets.py Pydantic models for UCSchemaTarget and FilePathTarget output destinations
dbldatagen/spec/generator_spec_impl.py Generator class implementing the spec-to-DataFrame conversion logic
dbldatagen/spec/generator_spec.py Core DatagenSpec and TableDefinition models with comprehensive validation
dbldatagen/spec/compat.py Pydantic V1/V2 compatibility layer enabling cross-version support
dbldatagen/spec/column_spec.py ColumnDefinition model with validation for primary keys and constraints
dbldatagen/spec/init.py Module initialization with lazy imports to avoid heavy dependencies
README.md Updated feature list and formatting to mention new Pydantic-based API
CHANGELOG.md Added entry for Pydantic-based specification API feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +263 to +272
# Write data based on destination type
if isinstance(output_destination, FilePathTarget):
output_path = posixpath.join(output_destination.base_path, table_name)
df.write.format(output_destination.output_format).mode("overwrite").save(output_path)
logger.info(f"Wrote table '{table_name}' to file path: {output_path}")

elif isinstance(output_destination, UCSchemaTarget):
output_table = f"{output_destination.catalog}.{output_destination.schema_}.{table_name}"
df.write.mode("overwrite").saveAsTable(output_table)
logger.info(f"Wrote table '{table_name}' to Unity Catalog: {output_table}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use utils.write_data_to_output for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method in utils relies on OutputDataset. See the other comments above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have OutputDataset in config.py. I think we can reuse it here instead of creating new classes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, The output target in spec would be better suited for the spec as it does validations for UC and file paths volumes. This config.OutputDataset is generic, and getting specific errors would mean essientially copying this over to the new spec folder or other way round. Is there something I am missing here ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputDataset could extend BaseModel and perform the same validations.

ghanse and others added 2 commits December 10, 2025 14:36
* added use of ABC to mark TextGenerator as abstract

* Lint text generators module

* Add persistence methods

* Add tests and docs; Update PR template

* Update hatch installation for push action

* Refactor

* Update method names and signatures

---------

Co-authored-by: ronanstokes-db <[email protected]>
Co-authored-by: Ronan Stokes <[email protected]>
@anupkalburgi anupkalburgi changed the title Ak/spec Specification driven datagneration Dec 18, 2025
@anupkalburgi anupkalburgi changed the title Specification driven datagneration Specification driven datagen Dec 18, 2025
@alexott alexott requested a review from Copilot December 23, 2025 13:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@alexott
Copy link
Contributor

alexott commented Dec 23, 2025

@anupkalburgi how hard it will be to extend this implementation to load spec from YAML files? I.e., with pydantic-yaml.

I'm looking into PRs like #376 and thinking that it could be easier to provide these generators as a "standard" generation notebook that will receive a file name as a parameter and generate data...

@anupkalburgi
Copy link
Collaborator Author

@anupkalburgi how hard it will be to extend this implementation to load spec from YAML files? I.e., with pydantic-yaml.

I'm looking into PRs like #376 and thinking that it could be easier to provide these generators as a "standard" generation notebook that will receive a file name as a parameter and generate data...

That is the idea with core idea behind this PR as far as we can get input format (yaml/json/py dict) to pydantic model, we would be able to pass that as the config to the generator that takes the pydantic object. Will be adding examples and helper methods in subsequent prs

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 20 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@alexott alexott requested a review from GeekSheikh January 13, 2026 16:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@ghanse ghanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. We also need to add documentation.

Comment on lines +25 to +31
:param number_of_rows: Total number of data rows to generate for this table.
Must be a positive integer
:param partitions: Number of Spark partitions to use when generating data.
If None, defaults to Spark's default parallelism setting.
More partitions can improve generation speed for large datasets
:param columns: List of ColumnDefinition objects specifying the columns to generate
in this table. At least one column must be specified
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a user need to set the same property for every columnspec? For example for random seed value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputDataset could extend BaseModel and perform the same validations.

@anupkalburgi anupkalburgi requested a review from ghanse January 23, 2026 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants