-
Notifications
You must be signed in to change notification settings - Fork 72
Add distributed data loader project and core interfaces #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+1,012
−1
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
41a0559
Add distributed data loader project and core interfaces
robreeves feb7fde
Remove docstrings from __init__ files
robreeves 2706fcf
update TableTransformer description
robreeves 4cfb892
Use Mapping for context parameter to signal read-only intent
robreeves 7d3a214
Use __iter__ instead of __call__ for DataLoaderSplit iteration
robreeves 1cca354
Use Sequence[str] for columns parameter to preserve order
robreeves a9f9ec2
Use Sequence for DataLoaderSplits._splits parameter
robreeves 15a7be9
Use astral-sh/setup-uv action and explicit Python 3.12
robreeves d691a70
Make columns and context optional for simpler API
robreeves cf35960
Simplify API: remove DataLoaderSplits, add table_properties to splits
robreeves 0d70798
Refactor API: add DataLoaderContext, use __iter__ and string fields
robreeves File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,6 +15,9 @@ hs_err_pid* | |
| *.iws | ||
| .idea | ||
|
|
||
| # VS Code | ||
| .vscode/ | ||
|
|
||
| # LinkedIn / Gradle / Hardware | ||
| .cache | ||
| .gradle | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # Python | ||
| __pycache__/ | ||
| *.py[cod] | ||
| *.so | ||
|
|
||
| # Virtual environments | ||
| .venv/ | ||
| venv/ | ||
|
|
||
| # Distribution / packaging | ||
| build/ | ||
| dist/ | ||
| *.egg-info/ | ||
|
|
||
| # Testing | ||
| .pytest_cache/ | ||
| .coverage | ||
| htmlcov/ | ||
|
|
||
| # IDEs | ||
| .vscode/ | ||
| .idea/ | ||
|
|
||
| # OS | ||
| .DS_Store |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # Claude Instructions for OpenHouse DataLoader | ||
|
|
||
| ## Project Overview | ||
|
|
||
| Python library for distributed data loading of OpenHouse tables. Uses DataFusion for query execution and PyIceberg for table access. | ||
|
|
||
| ## Common Commands | ||
|
|
||
| ```bash | ||
| make sync # Install dependencies | ||
| make check # Run lint + format checks | ||
| make test # Run tests | ||
| make all # Run all checks and tests | ||
| make format # Auto-format code | ||
| ``` | ||
|
|
||
| ## Workflows | ||
| When making a change run `make all` to ensure all tests and checks pass | ||
|
|
||
| ## Project Structure | ||
|
|
||
| ``` | ||
| src/openhouse/dataloader/ | ||
| ├── __init__.py # Public API exports | ||
| ├── data_loader.py # Main API: OpenHouseDataLoader | ||
| ├── data_loader_splits.py # DataLoaderSplits (iterable of splits) | ||
| ├── data_loader_split.py # DataLoaderSplit (single callable split) | ||
| ├── table_identifier.py # TableIdentifier dataclass | ||
| ├── table_transformer.py # TableTransformer ABC (internal) | ||
| └── udf_registry.py # UDFRegistry ABC (internal) | ||
| ``` | ||
|
|
||
| ## Public API | ||
|
|
||
| Only these are exported in `__init__.py`: | ||
| - `OpenHouseDataLoader` - Main entry point | ||
| - `TableIdentifier` - Table reference (database, table, branch) | ||
|
|
||
| Internal modules (TableTransformer, UDFRegistry) can be imported directly if needed but expose DataFusion types. | ||
|
|
||
| ## Code Style | ||
|
|
||
| - Uses `ruff` for linting and formatting | ||
| - Line length: 120 | ||
| - Python 3.12+ | ||
| - Use modern type hints (`list`, `dict`, `X | None` instead of `List`, `Dict`, `Optional`) | ||
| - Use `raise NotImplementedError` for unimplemented methods in concrete classes | ||
| - Use `pass` for abstract methods decorated with `@abstractmethod` | ||
|
|
||
| ## Versioning | ||
|
|
||
| - Version is in `pyproject.toml` (single source of truth) | ||
| - `__version__` in `__init__.py` reads from package metadata at runtime | ||
| - Major.minor aligns with OpenHouse monorepo, patch is independent |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| .PHONY: help sync clean lint format format-check check test all | ||
|
|
||
| help: | ||
| @echo "Available commands:" | ||
| @echo " make sync - Sync dependencies using uv" | ||
| @echo " make lint - Run ruff linter" | ||
| @echo " make format - Format code with ruff" | ||
| @echo " make check - Run all checks (lint + format check)" | ||
| @echo " make test - Run tests with pytest" | ||
| @echo " make all - Run all checks and tests" | ||
| @echo " make clean - Clean build artifacts" | ||
|
|
||
| sync: | ||
| uv sync --all-extras | ||
|
|
||
| lint: | ||
| uv run ruff check src/ tests/ | ||
|
|
||
| format: | ||
| uv run ruff format src/ tests/ | ||
|
|
||
| format-check: | ||
| uv run ruff format --check src/ tests/ | ||
|
|
||
| check: lint format-check | ||
|
|
||
| test: | ||
| uv run pytest | ||
|
|
||
| all: check test | ||
|
|
||
| clean: | ||
| rm -rf build/ | ||
| rm -rf dist/ | ||
| rm -rf *.egg-info | ||
| rm -rf .venv/ | ||
| find . -type d -name __pycache__ -exec rm -rf {} + | ||
| find . -type f -name "*.pyc" -delete |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| # OpenHouse DataLoader | ||
|
|
||
| A Python library for distributed data loading of OpenHouse tables | ||
|
|
||
| ## Quickstart | ||
|
|
||
| ```python | ||
| from openhouse.dataloader import OpenHouseDataLoader | ||
|
|
||
| loader = OpenHouseDataLoader("my_database", "my_table") | ||
|
|
||
| for split in loader: | ||
| # Get table properties | ||
| split.table_properties | ||
|
|
||
| # Load data | ||
| for batch in split: | ||
| process(batch) | ||
| ``` | ||
|
|
||
| ## Development | ||
|
|
||
| ```bash | ||
| # Set up environment | ||
| make sync | ||
|
|
||
| # Run tests | ||
| make test | ||
|
|
||
| # See all available commands | ||
| make help | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| [project] | ||
| name = "openhouse-dataloader" | ||
| version = "0.0.1" | ||
| description = "A Python library for distributed data loading of OpenHouse tables" | ||
| readme = "README.md" | ||
| requires-python = ">=3.12" | ||
| dependencies = ["datafusion==51.0.0", "pyiceberg==0.10.0"] | ||
|
|
||
| [project.optional-dependencies] | ||
| dev = ["ruff>=0.9.0", "pytest>=8.0.0"] | ||
|
|
||
| [tool.ruff] | ||
| line-length = 120 | ||
| target-version = "py312" | ||
|
|
||
| [tool.ruff.lint] | ||
| select = ["E", "F", "I", "UP", "B", "SIM"] | ||
|
|
||
| [tool.ruff.format] | ||
| quote-style = "double" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| __path__ = __import__("pkgutil").extend_path(__path__, __name__) |
6 changes: 6 additions & 0 deletions
6
integrations/python/dataloader/src/openhouse/dataloader/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| from importlib.metadata import version | ||
|
|
||
| from openhouse.dataloader.data_loader import DataLoaderContext, OpenHouseDataLoader | ||
|
|
||
| __version__ = version("openhouse-dataloader") | ||
| __all__ = ["OpenHouseDataLoader", "DataLoaderContext"] |
57 changes: 57 additions & 0 deletions
57
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| from collections.abc import Iterable, Mapping, Sequence | ||
| from dataclasses import dataclass | ||
|
|
||
| from openhouse.dataloader.data_loader_split import DataLoaderSplit | ||
| from openhouse.dataloader.table_identifier import TableIdentifier | ||
| from openhouse.dataloader.table_transformer import TableTransformer | ||
| from openhouse.dataloader.udf_registry import UDFRegistry | ||
|
|
||
|
|
||
| @dataclass | ||
| class DataLoaderContext: | ||
| """Context and customization for the DataLoader. | ||
|
|
||
| Provides execution context (e.g. tenant, environment) and optional customizations | ||
| like table transformations applied before loading data. | ||
|
|
||
| Args: | ||
| execution_context: Dictionary of execution context information (e.g. tenant, environment) | ||
| table_transformer: Transformation to apply to the table before loading (e.g. column masking) | ||
| udf_registry: UDFs required for the table transformation | ||
| """ | ||
|
|
||
| execution_context: Mapping[str, str] | None = None | ||
| table_transformer: TableTransformer | None = None | ||
| udf_registry: UDFRegistry | None = None | ||
|
|
||
|
|
||
| class OpenHouseDataLoader: | ||
robreeves marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """An API for distributed data loading of OpenHouse tables""" | ||
|
|
||
| def __init__( | ||
| self, | ||
| database: str, | ||
| table: str, | ||
| branch: str | None = None, | ||
| columns: Sequence[str] | None = None, | ||
| context: DataLoaderContext | None = None, | ||
| ): | ||
robreeves marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
| Args: | ||
| database: Database name | ||
| table: Table name | ||
| branch: Optional branch name | ||
| columns: Column names to load, or None to load all columns | ||
| context: Data loader context | ||
| """ | ||
| self._table = TableIdentifier(database, table, branch) | ||
| self._columns = columns | ||
| self._context = context or DataLoaderContext() | ||
|
|
||
| def __iter__(self) -> Iterable[DataLoaderSplit]: | ||
| """Iterate over data splits for distributed data loading of the table. | ||
|
|
||
| Returns: | ||
| Iterable of DataLoaderSplit, each containing table_properties | ||
| """ | ||
| raise NotImplementedError | ||
37 changes: 37 additions & 0 deletions
37
integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| from collections.abc import Iterator, Mapping | ||
|
|
||
| from datafusion.plan import LogicalPlan | ||
| from pyarrow import RecordBatch | ||
| from pyiceberg.io import FileScanTask | ||
|
|
||
| from openhouse.dataloader.udf_registry import UDFRegistry | ||
|
|
||
|
|
||
| class DataLoaderSplit: | ||
| """A single data split""" | ||
|
|
||
| def __init__( | ||
| self, | ||
| plan: LogicalPlan, | ||
| file_scan_task: FileScanTask, | ||
| udf_registry: UDFRegistry, | ||
| table_properties: Mapping[str, str], | ||
| ): | ||
| self._plan = plan | ||
| self._file_scan_task = file_scan_task | ||
robreeves marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| self._udf_registry = udf_registry | ||
| self._table_properties = table_properties | ||
|
|
||
| @property | ||
| def table_properties(self) -> Mapping[str, str]: | ||
| """Properties of the table being loaded""" | ||
| return self._table_properties | ||
|
|
||
| def __iter__(self) -> Iterator[RecordBatch]: | ||
| """Loads the split data after applying, including applying a prerequisite | ||
| table transformation if provided | ||
|
|
||
| Returns: | ||
| An iterator for batches of data in the split | ||
| """ | ||
| raise NotImplementedError | ||
21 changes: 21 additions & 0 deletions
21
integrations/python/dataloader/src/openhouse/dataloader/table_identifier.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| from dataclasses import dataclass | ||
|
|
||
|
|
||
| @dataclass | ||
| class TableIdentifier: | ||
| """Identifier for a table in OpenHouse | ||
|
|
||
| Args: | ||
| database: Database name | ||
| table: Table name | ||
| branch: Optional branch name | ||
| """ | ||
|
|
||
| database: str | ||
| table: str | ||
| branch: str | None = None | ||
|
|
||
| def __str__(self) -> str: | ||
| """Return the fully qualified table name.""" | ||
| base = f"{self.database}.{self.table}" | ||
robreeves marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| return f"{base}.{self.branch}" if self.branch else base | ||
29 changes: 29 additions & 0 deletions
29
integrations/python/dataloader/src/openhouse/dataloader/table_transformer.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| from abc import ABC, abstractmethod | ||
| from collections.abc import Mapping | ||
|
|
||
| from datafusion.context import SessionContext | ||
| from datafusion.dataframe import DataFrame | ||
|
|
||
| from openhouse.dataloader.table_identifier import TableIdentifier | ||
|
|
||
|
|
||
| class TableTransformer(ABC): | ||
| """Interface for applying additional transformation logic to the data | ||
| being loaded (e.g. column masking, row filtering) | ||
| """ | ||
|
|
||
| @abstractmethod | ||
| def transform( | ||
| self, session_context: SessionContext, table: TableIdentifier, context: Mapping[str, str] | ||
| ) -> DataFrame | None: | ||
| """Applies transformation logic to the base table that is being loaded. | ||
|
|
||
| Args: | ||
| table: Identifier for the table | ||
| context: Dictionary of context information (e.g. tenant, environment, etc.) | ||
|
|
||
| Returns: | ||
| The DataFrame representing the transformation. This is expected to read from the exact | ||
| base table identifier passed in as input. If no transformation is required, None is returned. | ||
| """ | ||
| pass |
16 changes: 16 additions & 0 deletions
16
integrations/python/dataloader/src/openhouse/dataloader/udf_registry.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| from abc import ABC, abstractmethod | ||
|
|
||
| from datafusion.context import SessionContext | ||
|
|
||
|
|
||
| class UDFRegistry(ABC): | ||
| """Used to register DataFusion UDFs""" | ||
|
|
||
| @abstractmethod | ||
| def register_udfs(self, session_context: SessionContext) -> None: | ||
| """Registers UDFs with DataFusion | ||
|
|
||
| Args: | ||
| session_context: The session context to register the UDFs in | ||
| """ | ||
| pass |
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| def test_data_loader(): | ||
| """Test placeholder until real tests are added""" | ||
| pass |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.