Merge long running feature branch `epic-TIMX-515` by ghukill · Pull Request #167 · MITLibraries/timdex-dataset-api

ghukill · 2025-08-15T13:55:54Z

Purpose and background context

This PR merges the long running feature branch epic-TIMX-515 into main. Once merged, we will tag a v3.0 release.

This is safe to do anytime: TIMDEX components like Transmog, pipeline lambdas, and TIM which utilize this library all would need a) an explicit make update run, and b) a tagged release for production (not coincidentally, all work planned in TIMX-537).

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES and NO: without updated dependencies and a tagged release, no real changes in the TIMDEX ecosystem. But Transmog, pipeline lambdas, and TIM all have some minor adjustments to align with the v3.0 release and will pickup the new version then.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-537

Why these changes are being introduced: With the addition of dataset/metadata assets in S3, we will need to perform actions like downloading the static DB file, uploading a new one, and deleting append deltas. How this addresses that need: Creates new utility class S3Client that performs these actions. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530

…lient Timx 530 prep work and s3 client

Why these changes are being introduced: The current overarching work is to support the creation and reading of a static metadata database file and append deltas. To get there, very little of the original TIMDEXDatasetMetadata class is needed or wanted. This commit begins the process of rebuilding TIMDEXDatasetMetadata, oriented around managing a static metadata database file, and providing a readonly projection over that and append delta paqruet files. How this addresses that need: TIMDEXDatasetMetadata is almost completely rebuilt, with the first functionality being the creation of the static metadata file by scanning the ETL records. Then, the ability to remotely attach in readonly mode to this metadata database file for reading. Note: these changes are breaking. TIMDEXDataset cannot provide "current" records and many unit tests are broken. This will be addressed in future commits as we build this class back up with new functionality. Side effects of this change: * TIMDEXDataset cannot provide current records * Unit tests are either temporarily skipped or failing Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530

Why these changes are being introduced: While the TIMDEXDatasetMetadata class is rebuilt, TIMDEXDataset itself can no longer provide "current" records from the dataaset as it has no metadata to work with. This is temporary until TIMDEXDatasetMetadata is rebuilt, and TIMDEXDataset gets new functionality based on *that* new metadata. How this addresses that need: * Any reference to "current records" is removed Side effects of this change: * TIMDEXDataset cannot provide current records Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530

Why these changes are being introduced: This is a small change now, that will lead to a larger change later. The TIMDEX dataset is getting more structure, and this means we will want to initialize a TIMDEXDataset instance with the root of the dataset, but then internally there will be more opinionation about where files should be read and written to. How this addresses that need: A new property 'data_records_root' is added to TIMDEXDataset that mirrors similar properties in TIMDEXDatasetMetadata. This informs any operations that need to read or write ETL records where precisely they are in the dataset. At this time only .write() utilizes it, but in a future ticket the load method will be heavily reworked (if not outright removed) and this property will be fully integrated. This is needed now to continue updates to TIMDEXMetadataDataset for TIMX-530. Side effects of this change: * Initialization of TIMDEXDataset should provide the true dataset root, not point to /data/records. The pipeline lambda currently does this, but will be updated in TIMX-531. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530 * https://mitlibraries.atlassian.net/browse/TIMX-531

Why these changes are being introduced: With the big changes to TIMDEXMetadataDataset comes the need to virtually rewrite the test suite for that class. The changes too TIMDEXMetadataDataset are also influencing tests for TIMDEXDataset, both how its loaded and tested for 'current' record reading. How this addresses that need: This begins with some basic tests around the loading, creating, and attaching of a static database file for TIMDEXMetadataDataset. Future tests will more fully exercise the final views and tables created. This commit also *temporarily* skips a bunch of tests for TIMDEXDataset that will not pass until the ability to limit to 'current' records is reinsated with the updated TIMDEXMetadataDataset. Side effects of this change: * Test suite passes, but multiple tests are temporarily skipped. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530

…data-db-file TIMX 530 - rebuild TIMDEXDatasetMetadata with static database file approach

Why these changes are being introduced: With the new metadata approach, an important component are "append deltas". These are standalone parquet files that contain metadata about the records added to the ETL records data parquet files. These eventually are merged into the main static metadata file, but until then are needed for metadata queries. How this addresses that need: During TIMDEXDataset.write(), after each ETL parquet file is written we lean on new method TIMDEXDatasetMetadata.write_append_delta_duckdb() to read metadata from that file and write a new append delta parquet file. This is performed entirely in a DuckDB context, allowing for simple column selection. Side effects of this change: * During dataset write, append deltas will be created. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-527

TIMX 527 - write append deltas

Why these changes are being introduced: Much of the refactor work has been building to provide metadata views for all records and the current version of a given TIMDEX record, views we had previously but calculated on demand each time. How this addresses that need: When setting up the DuckDB context for TIMDEXDatasetMetadata, we create views that build from a) the static metadata database file and b) the append deltas, providing a projection over all metadata records. Two primary views are added: 'records': all records in the ETL parquet dataset 'current_records': filter to the most recent version of any timdex_record_id from 'records' These views will provide the metadata for future work that (re)implements filtering to current records during read. Side effects of this change: * Views are created on TIMDEXDatasetMetadata initialization Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-526

Why these changes are being introduced: The test suite was built piecemeal as the library grew, and over time the fixture names were becoming clunky and confusing. How this addresses that need: Rename, simplify, and reorganize test fixtures. This requires coordinated changes in tests, nearly entirely just pointing at new fixture names. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-526

TIMX 526 - projected views

Why these changes are being introduced: As the TIMDEXDatasetMetadata becomes more integrated, there is less need to be explicit about how we load the pyarrow dataset. Formerly, the method .load() needed to be called manually and supported options like 'current_records' or 'include_parquet_files'. This also reflected a time when 'TIMDEXDataset.load()' suggested that "loading" was the pyarrow dataset only. With the introduction of metadata, it is also better to be specific we are loading a pyarrow dataset which is only one of many assets associated with a TIMDEXDataset instance. How this addresses that need: Renames .load() to .load_pyarrow_dataset() to be explicit about what is happening. We no longer store the pyarrow dataset filesystem or paths on self, as they are only used briefly during this dataset load. We can get them anytime via .dataset. Really most important, we limit the root 'location' that we init a TIMDEXDataset instance to be a string only, the root of the dataset. Now that we don't allow a list of strings at that level, we can trust the nature of self.location to be a string, and the root of the TIMDEX dataset. Side effects of this change: * TIMDEXDataset and TIMDEXDatasetMetadata can only be initialized with a string, which is the root of the TIMDEX dataset. From there, both know where their assets can be found. * You cannot "pre-filter" the pyarrow dataset when loading, which had confusing overlap with the read methods; the read methods themselves may change somewhat dramatically now that we have metadata to use. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-533

TIMX 533 - Load pyarrow dataset on TIMDEXDataset init

Why these changes are being introduced: TIMDEXDatasetMetadata (TDM) has a DuckDB context for metadata attachments and views. TIMDEXDataset (TD) will need one for DuckDB queries that return actual ETL data, not just metadata. How this addresses that need: * TD reuses the TDM.conn DuckDB connection and builds upon it * TDM DuckDB connection builds metadata related views in a "metadata" schema * TD will build views in a "data" schema Side effects of this change: * All TDM metadata views are under a "metadata" schema Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-529

Why these changes are being introduced: This commit is a culmination of work to elevate metadata about ETL records to the point it can be used to improve the speed and efficiency of data queries. While the signature of the read methods will remain mostly the same, it exposes a 'where' clause that accepts raw SQL to filter the results, allowing for more advanced querying beyond the simple key/value DatasetFilters. Additionally, and equally important, data retrieval is now coming directly from DuckDB instead of more low level pyarrow dataset reads. Overall complexity remains about the same, but we have shifted focus into DuckDB table and view preperation and SQL construction, which also pays dividends in other contexts. It it anticipated this will set us up well for other data we may add to the TIMDEX dataset, e.g. vector embeddings or fulltext, which we may want to query and retrieve. How this addresses that need: As before, all read methods eventually call TIMDEXDataset.read_batches_iter() which now performs a two-part process of first quickly querying metadata records, then using that information to prune heavier data retrieved. SQLAlchemy is used to provide model DuckDB tables and views such that we can preserve the simpler key/value DatasetFilters, e.g. source='libguides' or run_type='daily', which will likely represent the majority of the public API needs by converting those key/value pairs into a SQL WHERE clause programatically. This is done without the need for complex string interpolation and escaping. The overall input and output signatures are largely the same, but the underlying approach to querying the ETL parquet records now utilizes DuckDB much more heavily, while also providing a SQL 'escape hatch' if the keyword filters don't suffice. Side effects of this change: * None! Transmog and TIM can call TDA in the same way as before. The underlying approach is different, but the signatures are mostly the same. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-529

Why these changes are being introduced: During the refactor to use dataset metadata for querying, we had to temporarily skip tests that tested for dataset filtering and current records limiting. With the SQL backed querying now in place, these tests can be reinstated. Note that a future commit will likely *add* a couple more tests for the new, optional 'WHERE' clause functionality. How this addresses that need: * No tests are skipped. * Dataset filtering tests still remain, but the key/value filters are just handled differently under the hood. * Tests for current records no longer use .load(current_records=True) but instead utilize the DuckDB table via table='current_records' within a read method. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-529

…hods TIMX 529 - read methods utilize SQL + metadata context

Why these changes are being introduced: Until we update the build approach, possibly when we migrate to uv, we need to add any dependencies to *both* Pipfile (for local dev) and pyproject.toml (for building by other applications). How this addresses that need: * Adds duckdb_engine and sqlalchemy to pyproject.toml Relevant ticket(s): * None

TIMX 515 - dependency hotfix

Why these changes are being introduced: It sounds like the best option for ECS tasks is using 'instance' as the provider chain type, where for local dev and/or lambdas it might be 'sso' or 'env'. Not having 'instance' appears to cause failures in the ECS task. How this addresses that need: By omitting the 'chain' option entirely from DuckDB secret creation we allow the default provider chain to take effect. Given our fairly normal usage of DuckDB and S3, this is probably the best approach. Side effects of this change: * DuckDB to S3 connections work in ECS Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-540

Why these changes are being introduced: In the AWS Lambda context, the HOME env var is empty string ''. DuckDB has a canned error response for this, suggesting to, "Specify a home directory using the SET home_directory='/path/to/dir' option". How this addresses that need: If HOME is unset or empty string, set an explicit secret and extension directory at `/tmp/.duckdb/*` locations. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-541

…ection TIMX 540 - adjust AWS credential chain for ECS context

…tion-lambda-context TIMX 541 - handle missing HOME env in AWS Lambda

Why these changes are being introduced: The shift to elevating dataset metadata as an expected and required asset, and utilizing DuckDB context's for metadata and data retrieval is a sizable and breaking change. Feels like this warrants a major version bump. How this addresses that need: * Bumps internal library version to 3.0 Side effects of this change: * `make update` commands from TIMDEX components will pickup this new version Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-537

…ion-3 Bump version to v3.0

ehanson8

Go forth and merge the behemoth

Why these changes are being introduced: * The method for merging append deltas into the static metadata database file needs the filenames of append deltas to easily identify which files to delete (once merged). Prior to this change, the append deltas view only had a 'filename' column, which referred to the path or S3 URI for the TIMDEXDataset parquet file. How this addresses that need: * Set filename='append_delta_filename' when creating metadata.append_deltas view * Explicitly select metadata column names when creating metadata.records view Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-528

Why these changes are being introduced: * TDA requires a method to support regular merging of append deltas into the static metadata.duckdb database file as new rows and then deletes (the append deltas) once merged. How this addresses that need: * Add 'merge_append_deltas' method * Add unit test Side effects of this change: * It's worth noting that this method, when run, will delete the append delta parquet files that existed in the directory at the time of execution. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-528

coveralls · 2025-08-15T17:52:11Z

Pull Request Test Coverage Report for Build 16995910964

Details

363 of 395 (91.9%) changed or added relevant lines in 5 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-3.0%) to 93.071%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
timdex_dataset_api/dataset.py	94	97	96.91%
timdex_dataset_api/config.py	3	8	37.5%
timdex_dataset_api/utils.py	94	105	89.52%
timdex_dataset_api/metadata.py	171	184	92.93%

Files with Coverage Reduction	New Missed Lines	%
timdex_dataset_api/exceptions.py	1	0.0%

Totals
Change from base Build 16269635068:	-3.0%
Covered Lines:	497
Relevant Lines:	534

💛 - Coveralls

ghukill added 30 commits August 1, 2025 15:28

Dependencies, formatting, and logging tweaks

5a582c0

Always prefix with source in test util

9ff7594

reorder Pipfile dependencies

b907a15

Merge pull request #156 from MITLibraries/TIMX-530-prep-work-and-s3-c…

3311978

…lient Timx 530 prep work and s3 client

Set DuckDB secret refresh to auto

aaaadd0

Merge pull request #157 from MITLibraries/TIMX-530-create-static-meta…

3efe8b7

…data-db-file TIMX 530 - rebuild TIMDEXDatasetMetadata with static database file approach

Update dependencies

a1b28b4

Merge pull request #158 from MITLibraries/TIMX-527-write-append-deltas

184cddc

TIMX 527 - write append deltas

Update tests to use temp paths

269b489

Merge pull request #159 from MITLibraries/TIMX-526-projected-views

0a80a24

TIMX 526 - projected views

Merge pull request #160 from MITLibraries/TIMX-533-rework-dataset-load

e63d29d

TIMX 533 - Load pyarrow dataset on TIMDEXDataset init

Speedup tests via fixture scoping

b468fc3

Add read method SQL WHERE tests

7ac193f

Read methods documentation

3448d12

Remove WHERE clause in example mermaid diagram

ea300b8

Typo and method ordering

b32f841

Merge pull request #161 from MITLibraries/TIMX-529-sql-based-read-met…

781e4e0

…hods TIMX 529 - read methods utilize SQL + metadata context

ghukill added 9 commits August 14, 2025 09:38

Move testing import to TYPE_CHECKING

e8e97b8

Merge pull request #162 from MITLibraries/TIMX-515-hotfix

2ba36a4

TIMX 515 - dependency hotfix

Install HTTPFS extension in DuckDB context

2fc088d

Merge pull request #164 from MITLibraries/TIMX-540-ecs-duckdb-s3-conn…

0b5c556

…ection TIMX 540 - adjust AWS credential chain for ECS context

Merge pull request #165 from MITLibraries/TIMX-541-extension-installa…

4ccc90a

…tion-lambda-context TIMX 541 - handle missing HOME env in AWS Lambda

Merge pull request #166 from MITLibraries/TIMX-537-bump-to-major-vers…

dba26cc

…ion-3 Bump version to v3.0

ghukill requested a review from a team August 15, 2025 13:57

ehanson8 approved these changes Aug 15, 2025

View reviewed changes

jonavellecuerdo added 3 commits August 15, 2025 13:37

Merge pull request #163 from MITLibraries/TIMX-528-merge-append-deltas

1fba058

ghukill merged commit b7c3350 into main Aug 15, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Merge long running feature branch `epic-TIMX-515`#167

Merge long running feature branch `epic-TIMX-515`#167
ghukill merged 42 commits intomainfrom
epic-TIMX-515

ghukill commented Aug 15, 2025 •

edited by atlassian bot

Loading

Uh oh!

ehanson8 left a comment

Uh oh!

coveralls commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

ghukill commented Aug 15, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Aug 15, 2025

Pull Request Test Coverage Report for Build 16995910964

Details

💛 - Coveralls

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Aug 15, 2025 •

edited by atlassian bot

Loading