Add distributed data loader project and core interfaces by robreeves · Pull Request #440 · linkedin/openhouse

robreeves · 2026-01-28T17:52:40Z

Summary

This is the initial commit for a Python data loader library for distributed loading of OpenHouse tables. This PR establishes the project structure, core interfaces, and CI integration.

Key Components

OpenHouseDataLoader - Main API that creates distributable splits for parallel table loading
TableIdentifier - Identifies tables by database, name, and optional branch
DataLoaderSplits / DataLoaderSplit - Iterable splits that can be distributed across workers
TableTransformer / UDFRegistry - Extension points for table transformations and UDFs

Project Setup

Python 3.12+ with uv for dependency management
Ruff for linting and formatting
Makefile with sync, check, test, all targets
Integrated into build-run-tests.yml CI workflow

Not included

Publishing the new python package to pypi. That will happen in a later PR.

Changes

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

I tested by running make -C integrations/python/dataloader all. This PR is project setup and interfaces so no new functionality needs to be tested in this PR.

uv run ruff check src/ tests/
All checks passed!
uv run ruff format --check src/ tests/
10 files already formatted
uv run pytest
============================================================================ test session starts ============================================================================
platform darwin -- Python 3.14.0, pytest-9.0.2, pluggy-1.6.0
rootdir: /Users/roreeves/li/openhouse_oss/integrations/python/dataloader
configfile: pyproject.toml
collected 1 item                                                                                                                                                            

tests/test_data_loader.py .                                                                                                                                           [100%]

============================================================================= 1 passed in 0.01s =============================================================================

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

.github/workflows/build-run-tests.yml

sumedhsakdeo

Thanks Rob. Nice PR! This is really coming together. I left some comments please let me know what you think.

integrations/python/dataloader/src/openhouse/dataloader/table_transformer.py

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py

integrations/python/dataloader/src/openhouse/dataloader/data_loader_splits.py

integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py

.github/workflows/build-run-tests.yml

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py

.github/workflows/build-run-tests.yml

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py

integrations/python/dataloader/src/openhouse/dataloader/data_loader_split.py

integrations/python/dataloader/src/openhouse/dataloader/table_identifier.py

integrations/python/dataloader/src/openhouse/dataloader/data_loader_splits.py

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py

Co-authored-by: Sumedh Sakdeo <[email protected]>

Changed context parameter type from dict[str, str] to Mapping[str, str] in data_loader.py and table_transformer.py. This signals that the functions will not mutate the context passed by callers. Also made context a required parameter in create_splits(). Co-Authored-By: Claude Opus 4.5 <[email protected]>

Makes DataLoaderSplit directly iterable, which is the standard Python pattern for iterables. Removes __call__ as it's not idiomatic for iteration. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Column order matters in SQL (SELECT a, b != SELECT b, a). Using Sequence instead of set preserves ordering and gives callers flexibility in what collection type they pass. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Provides flexibility in what collection type can be passed while still guaranteeing iteration support. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Replace pip install uv with official astral-sh/setup-uv@v7 action - Enable caching for faster CI runs - Use explicit Python 3.12 instead of 3.x Co-Authored-By: Claude Opus 4.5 <[email protected]>

- columns defaults to None (SELECT * behavior) - context defaults to None (empty context) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Delete DataLoaderSplits class (unnecessary abstraction) - Add table_properties to DataLoaderSplit for driver/executor access - Return Iterable[DataLoaderSplit] directly from create_splits - Simpler API: splits = loader.create_splits(...); split.table_properties Co-Authored-By: Claude Opus 4.5 <[email protected]>

cbb330 · 2026-02-03T06:55:20Z

I can't tell if the interfaces are under-specified or good but lacking implementation details. so i suspect that there are more attributes/methods that could be added to these interfaces.

for example, these specific implementations are missing and hard to derive from the interfaces:

loading a table from the catalog
applying transform and using read splits

I think it would be helpful to have a concrete implementation of pyiceberg

e.g.

# Minimal viable implementation to validate the design
  class PyIcebergDataLoader(OpenHouseDataLoader):
      def create_splits(self, table, columns, context):
          # 1. Load table from PyIceberg catalog
          # 2. Get FileScanTasks via table.scan()
          # 3. Apply transformer to get LogicalPlan
          # 4. Return DataLoaderSplits
          ...

  class PyIcebergSplit(DataLoaderSplit):
      def __call__(self):
          # 1. Create SessionContext
          # 2. Register UDFs
          # 3. Convert FileScanTask → RecordBatchReader
          # 4. Register as table
          # 5. Execute plan, yield batches
          ...

perhaps in tests or a poc implementation module

robreeves · 2026-02-03T19:09:01Z

I can't tell if the interfaces are under-specified or good but lacking implementation details. so i suspect that there are more attributes/methods that could be added to these interfaces.

for example, these specific implementations are missing and hard to derive from the interfaces:

loading a table from the catalog

applying transform and using read splits

I think it would be helpful to have a concrete implementation of pyiceberg

e.g.
# Minimal viable implementation to validate the design
  class PyIcebergDataLoader(OpenHouseDataLoader):
      def create_splits(self, table, columns, context):
          # 1. Load table from PyIceberg catalog
          # 2. Get FileScanTasks via table.scan()
          # 3. Apply transformer to get LogicalPlan
          # 4. Return DataLoaderSplits
          ...

  class PyIcebergSplit(DataLoaderSplit):
      def __call__(self):
          # 1. Create SessionContext
          # 2. Register UDFs
          # 3. Convert FileScanTask → RecordBatchReader
          # 4. Register as table
          # 5. Execute plan, yield batches
          ...
perhaps in tests or a poc implementation module

The goal of this is to define the public interfaces, not all internal classes. To me that falls into the implementation category. I disagree that we should include everything including an MVP implementation in a single PR. The PR already touches 17 files and has 50+ comments. Adding implementation will balloon it even more. We can iterate as needed in future PRs. Loading the table is an implementation detail the consumer should not be aware of in most cases.

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py

- Add DataLoaderContext dataclass to bundle execution context and customizations - Replace create_splits() with __iter__() for more Pythonic iteration - Accept database/table/branch as string parameters instead of TableIdentifier - Make TableIdentifier internal (not exported in __all__) - Update README quickstart to reflect new API Co-Authored-By: Claude Opus 4.5 <[email protected]>

sumedhsakdeo

Good to go from my standpoint. I expect we might iterate on internal integration points as we develop more. But ok with crossing that bridge in follow up PRs.

Public APIs look great.

ShreyeshArangath

LGTM, thanks for working on this!

ShreyeshArangath

One small discussion topic: should we bump the version to 0.6?

robreeves changed the title ~~[WIP] Data loader interfaces~~ [WIP] Add distributed data loader project and core interfaces Jan 29, 2026

Add distributed data loader project and core interfaces

41a0559

robreeves force-pushed the dataloader_project_skeleton branch from 0f2cd6c to 41a0559 Compare January 30, 2026 00:14

Remove docstrings from __init__ files

feb7fde

Co-Authored-By: Claude Opus 4.5 <[email protected]>

robreeves marked this pull request as ready for review January 30, 2026 00:17

robreeves changed the title ~~[WIP] Add distributed data loader project and core interfaces~~ Add distributed data loader project and core interfaces Jan 30, 2026

cbb330 reviewed Jan 30, 2026

View reviewed changes

.github/workflows/build-run-tests.yml Show resolved Hide resolved

sumedhsakdeo requested changes Feb 1, 2026

View reviewed changes

ShreyeshArangath reviewed Feb 1, 2026

View reviewed changes

bxji reviewed Feb 2, 2026

View reviewed changes

robreeves and others added 8 commits February 2, 2026 13:32

update TableTransformer description

2706fcf

Co-authored-by: Sumedh Sakdeo <[email protected]>

Use __iter__ instead of __call__ for DataLoaderSplit iteration

7d3a214

Makes DataLoaderSplit directly iterable, which is the standard Python pattern for iterables. Removes __call__ as it's not idiomatic for iteration. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Use Sequence[str] for columns parameter to preserve order

1cca354

Column order matters in SQL (SELECT a, b != SELECT b, a). Using Sequence instead of set preserves ordering and gives callers flexibility in what collection type they pass. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Use Sequence for DataLoaderSplits._splits parameter

a9f9ec2

Provides flexibility in what collection type can be passed while still guaranteeing iteration support. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Use astral-sh/setup-uv action and explicit Python 3.12

15a7be9

- Replace pip install uv with official astral-sh/setup-uv@v7 action - Enable caching for faster CI runs - Use explicit Python 3.12 instead of 3.x Co-Authored-By: Claude Opus 4.5 <[email protected]>

Make columns and context optional for simpler API

d691a70

- columns defaults to None (SELECT * behavior) - context defaults to None (empty context) Co-Authored-By: Claude Opus 4.5 <[email protected]>

sumedhsakdeo reviewed Feb 3, 2026

View reviewed changes

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py Outdated Show resolved Hide resolved

sumedhsakdeo approved these changes Feb 4, 2026

View reviewed changes

ShreyeshArangath approved these changes Feb 4, 2026

View reviewed changes

cbb330 approved these changes Feb 4, 2026

View reviewed changes

ShreyeshArangath reviewed Feb 4, 2026

View reviewed changes

cbb330 merged commit 581b704 into linkedin:main Feb 4, 2026
1 check passed

Conversation

robreeves commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

Uh oh!

sumedhsakdeo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cbb330 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robreeves commented Feb 3, 2026

Uh oh!

Uh oh!

sumedhsakdeo left a comment

Choose a reason for hiding this comment

Uh oh!

ShreyeshArangath left a comment

Choose a reason for hiding this comment

Uh oh!

ShreyeshArangath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

robreeves commented Jan 28, 2026 •

edited

Loading

cbb330 commented Feb 3, 2026 •

edited

Loading