Add distributed data loader project and core interfaces#440
Add distributed data loader project and core interfaces#440cbb330 merged 11 commits intolinkedin:mainfrom
Conversation
0f2cd6c to
41a0559
Compare
Co-Authored-By: Claude Opus 4.5 <[email protected]>
sumedhsakdeo
left a comment
There was a problem hiding this comment.
Thanks Rob. Nice PR! This is really coming together. I left some comments please let me know what you think.
integrations/python/dataloader/src/openhouse/dataloader/table_transformer.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader_splits.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/table_identifier.py
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader_splits.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Sumedh Sakdeo <[email protected]>
Changed context parameter type from dict[str, str] to Mapping[str, str] in data_loader.py and table_transformer.py. This signals that the functions will not mutate the context passed by callers. Also made context a required parameter in create_splits(). Co-Authored-By: Claude Opus 4.5 <[email protected]>
Makes DataLoaderSplit directly iterable, which is the standard Python pattern for iterables. Removes __call__ as it's not idiomatic for iteration. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Column order matters in SQL (SELECT a, b != SELECT b, a). Using Sequence instead of set preserves ordering and gives callers flexibility in what collection type they pass. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Provides flexibility in what collection type can be passed while still guaranteeing iteration support. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Replace pip install uv with official astral-sh/setup-uv@v7 action - Enable caching for faster CI runs - Use explicit Python 3.12 instead of 3.x Co-Authored-By: Claude Opus 4.5 <[email protected]>
- columns defaults to None (SELECT * behavior) - context defaults to None (empty context) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Delete DataLoaderSplits class (unnecessary abstraction) - Add table_properties to DataLoaderSplit for driver/executor access - Return Iterable[DataLoaderSplit] directly from create_splits - Simpler API: splits = loader.create_splits(...); split.table_properties Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
I can't tell if the interfaces are under-specified or good but lacking implementation details. so i suspect that there are more attributes/methods that could be added to these interfaces. for example, these specific implementations are missing and hard to derive from the interfaces:
I think it would be helpful to have a concrete implementation of pyiceberg e.g. # Minimal viable implementation to validate the design
class PyIcebergDataLoader(OpenHouseDataLoader):
def create_splits(self, table, columns, context):
# 1. Load table from PyIceberg catalog
# 2. Get FileScanTasks via table.scan()
# 3. Apply transformer to get LogicalPlan
# 4. Return DataLoaderSplits
...
class PyIcebergSplit(DataLoaderSplit):
def __call__(self):
# 1. Create SessionContext
# 2. Register UDFs
# 3. Convert FileScanTask → RecordBatchReader
# 4. Register as table
# 5. Execute plan, yield batches
...perhaps in tests or a poc implementation module |
The goal of this is to define the public interfaces, not all internal classes. To me that falls into the implementation category. I disagree that we should include everything including an MVP implementation in a single PR. The PR already touches 17 files and has 50+ comments. Adding implementation will balloon it even more. We can iterate as needed in future PRs. Loading the table is an implementation detail the consumer should not be aware of in most cases. |
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
- Add DataLoaderContext dataclass to bundle execution context and customizations - Replace create_splits() with __iter__() for more Pythonic iteration - Accept database/table/branch as string parameters instead of TableIdentifier - Make TableIdentifier internal (not exported in __all__) - Update README quickstart to reflect new API Co-Authored-By: Claude Opus 4.5 <[email protected]>
sumedhsakdeo
left a comment
There was a problem hiding this comment.
Good to go from my standpoint. I expect we might iterate on internal integration points as we develop more. But ok with crossing that bridge in follow up PRs.
Public APIs look great.
ShreyeshArangath
left a comment
There was a problem hiding this comment.
LGTM, thanks for working on this!
ShreyeshArangath
left a comment
There was a problem hiding this comment.
One small discussion topic: should we bump the version to 0.6?
Summary
This is the initial commit for a Python data loader library for distributed loading of OpenHouse tables. This PR establishes the project structure, core interfaces, and CI integration.
Key Components
OpenHouseDataLoader- Main API that creates distributable splits for parallel table loadingTableIdentifier- Identifies tables by database, name, and optional branchDataLoaderSplits/DataLoaderSplit- Iterable splits that can be distributed across workersTableTransformer/UDFRegistry- Extension points for table transformations and UDFsProject Setup
uvfor dependency managementsync,check,test,alltargetsbuild-run-tests.ymlCI workflowNot included
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
I tested by running
make -C integrations/python/dataloader all. This PR is project setup and interfaces so no new functionality needs to be tested in this PR.Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.