Skip to content

Add distributed data loader project and core interfaces#440

Merged
cbb330 merged 11 commits intolinkedin:mainfrom
robreeves:dataloader_project_skeleton
Feb 4, 2026
Merged

Add distributed data loader project and core interfaces#440
cbb330 merged 11 commits intolinkedin:mainfrom
robreeves:dataloader_project_skeleton

Conversation

@robreeves
Copy link
Collaborator

@robreeves robreeves commented Jan 28, 2026

Summary

This is the initial commit for a Python data loader library for distributed loading of OpenHouse tables. This PR establishes the project structure, core interfaces, and CI integration.

Key Components

  • OpenHouseDataLoader - Main API that creates distributable splits for parallel table loading
  • TableIdentifier - Identifies tables by database, name, and optional branch
  • DataLoaderSplits / DataLoaderSplit - Iterable splits that can be distributed across workers
  • TableTransformer / UDFRegistry - Extension points for table transformations and UDFs

Project Setup

  • Python 3.12+ with uv for dependency management
  • Ruff for linting and formatting
  • Makefile with sync, check, test, all targets
  • Integrated into build-run-tests.yml CI workflow

Not included

  • Publishing the new python package to pypi. That will happen in a later PR.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

I tested by running make -C integrations/python/dataloader all. This PR is project setup and interfaces so no new functionality needs to be tested in this PR.

uv run ruff check src/ tests/
All checks passed!
uv run ruff format --check src/ tests/
10 files already formatted
uv run pytest
============================================================================ test session starts ============================================================================
platform darwin -- Python 3.14.0, pytest-9.0.2, pluggy-1.6.0
rootdir: /Users/roreeves/li/openhouse_oss/integrations/python/dataloader
configfile: pyproject.toml
collected 1 item                                                                                                                                                            

tests/test_data_loader.py .                                                                                                                                           [100%]

============================================================================= 1 passed in 0.01s =============================================================================

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

@robreeves robreeves changed the title [WIP] Data loader interfaces [WIP] Add distributed data loader project and core interfaces Jan 29, 2026
@robreeves robreeves force-pushed the dataloader_project_skeleton branch from 0f2cd6c to 41a0559 Compare January 30, 2026 00:14
@robreeves robreeves marked this pull request as ready for review January 30, 2026 00:17
@robreeves robreeves changed the title [WIP] Add distributed data loader project and core interfaces Add distributed data loader project and core interfaces Jan 30, 2026
Copy link
Collaborator

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rob. Nice PR! This is really coming together. I left some comments please let me know what you think.

robreeves and others added 8 commits February 2, 2026 13:32
Changed context parameter type from dict[str, str] to Mapping[str, str]
in data_loader.py and table_transformer.py. This signals that the
functions will not mutate the context passed by callers.

Also made context a required parameter in create_splits().

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Makes DataLoaderSplit directly iterable, which is the standard Python
pattern for iterables. Removes __call__ as it's not idiomatic for iteration.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Column order matters in SQL (SELECT a, b != SELECT b, a). Using Sequence
instead of set preserves ordering and gives callers flexibility in what
collection type they pass.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Provides flexibility in what collection type can be passed while still
guaranteeing iteration support.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Replace pip install uv with official astral-sh/setup-uv@v7 action
- Enable caching for faster CI runs
- Use explicit Python 3.12 instead of 3.x

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- columns defaults to None (SELECT * behavior)
- context defaults to None (empty context)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Delete DataLoaderSplits class (unnecessary abstraction)
- Add table_properties to DataLoaderSplit for driver/executor access
- Return Iterable[DataLoaderSplit] directly from create_splits
- Simpler API: splits = loader.create_splits(...); split.table_properties

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@cbb330
Copy link
Collaborator

cbb330 commented Feb 3, 2026

I can't tell if the interfaces are under-specified or good but lacking implementation details. so i suspect that there are more attributes/methods that could be added to these interfaces.

for example, these specific implementations are missing and hard to derive from the interfaces:

  1. loading a table from the catalog
  2. applying transform and using read splits

I think it would be helpful to have a concrete implementation of pyiceberg

e.g.

# Minimal viable implementation to validate the design
  class PyIcebergDataLoader(OpenHouseDataLoader):
      def create_splits(self, table, columns, context):
          # 1. Load table from PyIceberg catalog
          # 2. Get FileScanTasks via table.scan()
          # 3. Apply transformer to get LogicalPlan
          # 4. Return DataLoaderSplits
          ...

  class PyIcebergSplit(DataLoaderSplit):
      def __call__(self):
          # 1. Create SessionContext
          # 2. Register UDFs
          # 3. Convert FileScanTask → RecordBatchReader
          # 4. Register as table
          # 5. Execute plan, yield batches
          ...

perhaps in tests or a poc implementation module

@robreeves
Copy link
Collaborator Author

I can't tell if the interfaces are under-specified or good but lacking implementation details. so i suspect that there are more attributes/methods that could be added to these interfaces.

for example, these specific implementations are missing and hard to derive from the interfaces:

  1. loading a table from the catalog
  2. applying transform and using read splits

I think it would be helpful to have a concrete implementation of pyiceberg

e.g.

# Minimal viable implementation to validate the design
  class PyIcebergDataLoader(OpenHouseDataLoader):
      def create_splits(self, table, columns, context):
          # 1. Load table from PyIceberg catalog
          # 2. Get FileScanTasks via table.scan()
          # 3. Apply transformer to get LogicalPlan
          # 4. Return DataLoaderSplits
          ...

  class PyIcebergSplit(DataLoaderSplit):
      def __call__(self):
          # 1. Create SessionContext
          # 2. Register UDFs
          # 3. Convert FileScanTask → RecordBatchReader
          # 4. Register as table
          # 5. Execute plan, yield batches
          ...

perhaps in tests or a poc implementation module

The goal of this is to define the public interfaces, not all internal classes. To me that falls into the implementation category. I disagree that we should include everything including an MVP implementation in a single PR. The PR already touches 17 files and has 50+ comments. Adding implementation will balloon it even more. We can iterate as needed in future PRs. Loading the table is an implementation detail the consumer should not be aware of in most cases.

- Add DataLoaderContext dataclass to bundle execution context and customizations
- Replace create_splits() with __iter__() for more Pythonic iteration
- Accept database/table/branch as string parameters instead of TableIdentifier
- Make TableIdentifier internal (not exported in __all__)
- Update README quickstart to reflect new API

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link
Collaborator

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go from my standpoint. I expect we might iterate on internal integration points as we develop more. But ok with crossing that bridge in follow up PRs.

Public APIs look great.

Copy link
Collaborator

@ShreyeshArangath ShreyeshArangath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this!

Copy link
Collaborator

@ShreyeshArangath ShreyeshArangath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small discussion topic: should we bump the version to 0.6?

@cbb330 cbb330 merged commit 581b704 into linkedin:main Feb 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants