Specification driven datagen #372

anupkalburgi · 2025-11-04T14:28:33Z

Changes

Linked issues

Resolves #..

Requirements

manually tested
updated documentation
updated demos
updated tests

codecov · 2025-11-06T16:14:20Z

Codecov Report

❌ Patch coverage is 95.01558% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.28%. Comparing base (4687b8c) to head (d527300).

Files with missing lines	Patch %	Lines
dbldatagen/spec/generator_spec_impl.py	91.89%	3 Missing and 6 partials ⚠️
dbldatagen/spec/generator_spec.py	95.41%	3 Missing and 2 partials ⚠️
dbldatagen/spec/compat.py	60.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #372      +/-   ##
==========================================
+ Coverage   92.12%   92.28%   +0.16%     
==========================================
  Files          47       55       +8     
  Lines        4217     4538     +321     
  Branches      766      836      +70     
==========================================
+ Hits         3885     4188     +303     
- Misses        186      195       +9     
- Partials      146      155       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dbldatagen/spec/generator_spec.py

dbldatagen/spec/generator_spec_impl.py

Copilot

Pull request overview

This PR introduces a new Pydantic-based specification API for dbldatagen, providing a declarative, type-safe approach to synthetic data generation. The changes add comprehensive validation, test coverage, and example specifications while updating documentation and build configuration to support both Pydantic V1 and V2.

Key Changes:

New spec-based API with Pydantic models for defining data generation configurations
Comprehensive validation framework with error collection and reporting
Pydantic V1/V2 compatibility layer for broad environment support

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_specs.py	Comprehensive test suite for ValidationResult, ColumnDefinition, DatagenSpec validation, and target configurations
tests/test_datasets_with_specs.py	Tests for Pydantic model validation with BasicUser and BasicStockTicker specifications
tests/test_datagen_specs.py	Tests for DatagenSpec creation, validation, and generator options
pyproject.toml	Added ipython dependency, test matrix for Pydantic versions, and disabled warn_unused_ignores
makefile	Updated to use Pydantic version-specific test environments and removed .venv target
examples/datagen_from_specs/basic_user_datagen_spec.py	Example DatagenSpec factory for generating basic user data with pre-configured specs
examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py	Complex example with OHLC stock data generation including time-series and volatility modeling
examples/datagen_from_specs/README.md	Documentation for Pydantic-based dataset specifications with usage examples
dbldatagen/spec/validation.py	ValidationResult class for collecting and reporting validation errors and warnings
dbldatagen/spec/output_targets.py	Pydantic models for UCSchemaTarget and FilePathTarget output destinations
dbldatagen/spec/generator_spec_impl.py	Generator class implementing the spec-to-DataFrame conversion logic
dbldatagen/spec/generator_spec.py	Core DatagenSpec and TableDefinition models with comprehensive validation
dbldatagen/spec/compat.py	Pydantic V1/V2 compatibility layer enabling cross-version support
dbldatagen/spec/column_spec.py	ColumnDefinition model with validation for primary keys and constraints
dbldatagen/spec/init.py	Module initialization with lazy imports to avoid heavy dependencies
README.md	Updated feature list and formatting to mention new Pydantic-based API
CHANGELOG.md	Added entry for Pydantic-based specification API feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_specs.py

examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py

dbldatagen/spec/generator_spec_impl.py

makefile

dbldatagen/spec/__init__.py

dbldatagen/spec/column_spec.py

dbldatagen/spec/generator_spec_impl.py

ghanse · 2025-12-04T19:58:35Z

dbldatagen/spec/generator_spec_impl.py

+                # Write data based on destination type
+                if isinstance(output_destination, FilePathTarget):
+                    output_path = posixpath.join(output_destination.base_path, table_name)
+                    df.write.format(output_destination.output_format).mode("overwrite").save(output_path)
+                    logger.info(f"Wrote table '{table_name}' to file path: {output_path}")
+
+                elif isinstance(output_destination, UCSchemaTarget):
+                    output_table = f"{output_destination.catalog}.{output_destination.schema_}.{table_name}"
+                    df.write.mode("overwrite").saveAsTable(output_table)
+                    logger.info(f"Wrote table '{table_name}' to Unity Catalog: {output_table}")


We should use utils.write_data_to_output for this.

The method in utils relies on OutputDataset. See the other comments above.

ghanse · 2025-12-04T20:00:33Z

dbldatagen/spec/output_targets.py

We have OutputDataset in config.py. I think we can reuse it here instead of creating new classes?

imo, The output target in spec would be better suited for the spec as it does validations for UC and file paths volumes. This config.OutputDataset is generic, and getting specific errors would mean essientially copying this over to the new spec folder or other way round. Is there something I am missing here ?

OutputDataset could extend BaseModel and perform the same validations.

* added use of ABC to mark TextGenerator as abstract * Lint text generators module * Add persistence methods * Add tests and docs; Update PR template * Update hatch installation for push action * Refactor * Update method names and signatures --------- Co-authored-by: ronanstokes-db <[email protected]> Co-authored-by: Ronan Stokes <[email protected]>

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_output.py

pyproject.toml

dbldatagen/spec/generator_spec.py

dbldatagen/spec/__init__.py

alexott · 2025-12-23T13:59:46Z

@anupkalburgi how hard it will be to extend this implementation to load spec from YAML files? I.e., with pydantic-yaml.

I'm looking into PRs like #376 and thinking that it could be easier to provide these generators as a "standard" generation notebook that will receive a file name as a parameter and generate data...

anupkalburgi · 2026-01-08T17:51:31Z

@anupkalburgi how hard it will be to extend this implementation to load spec from YAML files? I.e., with pydantic-yaml.

I'm looking into PRs like #376 and thinking that it could be easier to provide these generators as a "standard" generation notebook that will receive a file name as a parameter and generate data...

That is the idea with core idea behind this PR as far as we can get input format (yaml/json/py dict) to pydantic model, we would be able to pass that as the config to the generator that takes the pydantic object. Will be adding examples and helper methods in subsequent prs

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 20 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pyproject.toml

dbldatagen/spec/generator_spec_impl.py

dbldatagen/spec/__init__.py

tests/test_output.py

dbldatagen/spec/generator_spec.py

tests/test_spec_init.py

Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dbldatagen/spec/generator_spec_impl.py

dbldatagen/spec/__init__.py

ghanse

Left a few comments. We also need to add documentation.

ghanse · 2026-01-21T15:38:41Z

dbldatagen/spec/generator_spec.py

+    :param number_of_rows: Total number of data rows to generate for this table.
+                          Must be a positive integer
+    :param partitions: Number of Spark partitions to use when generating data.
+                      If None, defaults to Spark's default parallelism setting.
+                      More partitions can improve generation speed for large datasets
+    :param columns: List of ColumnDefinition objects specifying the columns to generate
+                   in this table. At least one column must be specified


Would a user need to set the same property for every columnspec? For example for random seed value?

dbldatagen/spec/generator_spec.py

dbldatagen/spec/generator_spec_impl.py

examples/datagen_from_specs/README.md

ghanse · 2026-01-21T16:22:45Z

dbldatagen/spec/output_targets.py

OutputDataset could extend BaseModel and perform the same validations.

Co-authored-by: Copilot <[email protected]>

anupkalburgi added 4 commits September 23, 2025 12:16

working test wireup

20db964

Initial code, spec and test, pushing for review

4e9cba5

fixing tests

d37de68

changes to make file

f5214ca

anupkalburgi force-pushed the ak/spec branch from 47e80f1 to f5214ca Compare November 6, 2025 15:55

updating docs

0a8fa2a

anupkalburgi commented Nov 10, 2025

View reviewed changes

dbldatagen/spec/generator_spec.py Outdated Show resolved Hide resolved

anupkalburgi commented Nov 10, 2025

View reviewed changes

dbldatagen/spec/generator_spec.py Outdated Show resolved Hide resolved

anupkalburgi commented Nov 10, 2025

View reviewed changes

dbldatagen/spec/generator_spec_impl.py Outdated Show resolved Hide resolved

anupkalburgi commented Nov 10, 2025

View reviewed changes

dbldatagen/spec/generator_spec_impl.py Outdated Show resolved Hide resolved

anupkalburgi added 3 commits November 17, 2025 13:56

fixing tests, removing the solved todos, targets to a diff module

61f676d

converting to camelCase

e139c8b

validation into a diff module

52a4283

anupkalburgi marked this pull request as ready for review December 2, 2025 14:31

anupkalburgi requested review from a team as code owners December 2, 2025 14:31

anupkalburgi requested review from nfx and renardeinside and removed request for a team December 2, 2025 14:31

removing compat/scratch notes

a0ce13b

anupkalburgi force-pushed the ak/spec branch from 1b90b6a to a0ce13b Compare December 2, 2025 14:56

marking the spec module experimental

f6c9a69

ghanse requested a review from Copilot December 4, 2025 18:04

Copilot AI reviewed Dec 4, 2025

View reviewed changes

tests/test_specs.py Outdated Show resolved Hide resolved

examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py Outdated Show resolved Hide resolved

dbldatagen/spec/generator_spec_impl.py Outdated Show resolved Hide resolved

makefile Show resolved Hide resolved

ghanse requested changes Dec 4, 2025

View reviewed changes

anupkalburgi requested a review from akshayamin December 5, 2025 14:00

ghanse and others added 2 commits December 10, 2025 14:36

Format modules (#367)

de1bb75

renaming tables to datasets in docs and test, and black fmt changes

7da16dd

anupkalburgi requested review from akshayamin and ghanse December 11, 2025 16:23

anupkalburgi added 5 commits December 11, 2025 13:03

fixing the makefile and the project toml after merge

59bd523

trying to fix the flaky test

23ae89c

Adding the missed spec impl tests

63dafcb

add more imp tests

b1ac085

adding missed test files

a2fa275

anupkalburgi changed the title ~~Ak/spec~~ Specification driven datagneration Dec 18, 2025

anupkalburgi changed the title ~~Specification driven datagneration~~ Specification driven datagen Dec 18, 2025

alexott requested a review from Copilot December 23, 2025 13:54

Copilot AI reviewed Dec 23, 2025

View reviewed changes

alexott requested a review from Copilot January 12, 2026 09:40

Copilot started reviewing on behalf of alexott January 12, 2026 09:40 View session

Copilot AI reviewed Jan 12, 2026

View reviewed changes

anupkalburgi and others added 2 commits January 12, 2026 10:31

Update dbldatagen/spec/generator_spec.py

6fb279d

Co-authored-by: Copilot <[email protected]>

Update README.md

d33757d

Co-authored-by: Copilot <[email protected]>

alexott requested a review from GeekSheikh January 13, 2026 16:11

Merge branch 'master' into ak/spec

311e468

ghanse requested a review from Copilot January 21, 2026 14:50

Copilot started reviewing on behalf of ghanse January 21, 2026 14:51 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

dbldatagen/spec/generator_spec_impl.py Outdated Show resolved Hide resolved

dbldatagen/spec/__init__.py Show resolved Hide resolved

dbldatagen/spec/__init__.py Show resolved Hide resolved

dbldatagen/spec/__init__.py Show resolved Hide resolved

ghanse requested changes Jan 21, 2026

View reviewed changes

anupkalburgi and others added 3 commits January 21, 2026 13:37

Update dbldatagen/spec/generator_spec_impl.py

9730a4d

Co-authored-by: Copilot <[email protected]>

Update dbldatagen/spec/generator_spec.py

e059f8c

Co-authored-by: Copilot <[email protected]>

adding rst doc for using datagen spec

7a7c01a

anupkalburgi requested a review from ghanse January 23, 2026 21:28

Merge branch 'master' into ak/spec

d527300

Specification driven datagen #372

Are you sure you want to change the base?

Specification driven datagen #372

Uh oh!

Conversation

anupkalburgi commented Nov 4, 2025

Changes

Linked issues

Requirements

Uh oh!

codecov bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ghanse Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

anupkalburgi Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

ghanse Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

anupkalburgi Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

ghanse Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexott commented Dec 23, 2025

Uh oh!

anupkalburgi commented Jan 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 6, 2025 •

edited

Loading