Skip to content

Comments

Merge long running feature branch epic-TIMX-515#167

Merged
ghukill merged 42 commits intomainfrom
epic-TIMX-515
Aug 15, 2025
Merged

Merge long running feature branch epic-TIMX-515#167
ghukill merged 42 commits intomainfrom
epic-TIMX-515

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Aug 15, 2025

Purpose and background context

This PR merges the long running feature branch epic-TIMX-515 into main. Once merged, we will tag a v3.0 release.

This is safe to do anytime: TIMDEX components like Transmog, pipeline lambdas, and TIM which utilize this library all would need a) an explicit make update run, and b) a tagged release for production (not coincidentally, all work planned in TIMX-537).

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES and NO: without updated dependencies and a tagged release, no real changes in the TIMDEX ecosystem. But Transmog, pipeline lambdas, and TIM all have some minor adjustments to align with the v3.0 release and will pickup the new version then.

What are the relevant tickets?

ghukill added 30 commits August 1, 2025 15:28
Why these changes are being introduced:

With the addition of dataset/metadata assets in S3, we will
need to perform actions like downloading the static DB file,
uploading a new one, and deleting append deltas.

How this addresses that need:

Creates new utility class S3Client that performs these
actions.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-530
Why these changes are being introduced:

The current overarching work is to support the creation and
reading of a static metadata database file and append deltas.  To
get there, very little of the original TIMDEXDatasetMetadata class
is needed or wanted.

This commit begins the process of rebuilding TIMDEXDatasetMetadata,
oriented around managing a static metadata database file, and providing
a readonly projection over that and append delta paqruet files.

How this addresses that need:

TIMDEXDatasetMetadata is almost completely rebuilt, with the first
functionality being the creation of the static metadata file by
scanning the ETL records.  Then, the ability to remotely attach in
readonly mode to this metadata database file for reading.

Note: these changes are breaking.  TIMDEXDataset cannot provide
"current" records and many unit tests are broken.  This will be
addressed in future commits as we build this class back up
with new functionality.

Side effects of this change:
* TIMDEXDataset cannot provide current records
* Unit tests are either temporarily skipped or failing

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-530
Why these changes are being introduced:

While the TIMDEXDatasetMetadata class is rebuilt, TIMDEXDataset
itself can no longer provide "current" records from the dataaset
as it has no metadata to work with.

This is temporary until TIMDEXDatasetMetadata is rebuilt, and
TIMDEXDataset gets new functionality based on *that* new
metadata.

How this addresses that need:
* Any reference to "current records" is removed

Side effects of this change:
* TIMDEXDataset cannot provide current records

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-530
Why these changes are being introduced:

This is a small change now, that will lead to a larger change later.
The TIMDEX dataset is getting more structure, and this means we will
want to initialize a TIMDEXDataset instance with the root of the dataset,
but then internally there will be more opinionation about where files
should be read and written to.

How this addresses that need:

A new property 'data_records_root' is added to TIMDEXDataset that
mirrors similar properties in TIMDEXDatasetMetadata.  This informs any
operations that need to read or write ETL records where precisely they
are in the dataset.

At this time only .write() utilizes it, but in a future ticket the
load method will be heavily reworked (if not outright removed) and
this property will be fully integrated.

This is needed now to continue updates to TIMDEXMetadataDataset
for TIMX-530.

Side effects of this change:
* Initialization of TIMDEXDataset should provide the true dataset root,
not point to /data/records.  The pipeline lambda currently does this,
but will be updated in TIMX-531.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-530
* https://mitlibraries.atlassian.net/browse/TIMX-531
Why these changes are being introduced:

With the big changes to TIMDEXMetadataDataset comes the need to virtually
rewrite the test suite for that class.

The changes too TIMDEXMetadataDataset are also influencing tests for
TIMDEXDataset, both how its loaded and tested for 'current' record reading.

How this addresses that need:

This begins with some basic tests around the loading, creating, and attaching
of a static database file for TIMDEXMetadataDataset.  Future tests will
more fully exercise the final views and tables created.

This commit also *temporarily* skips a bunch of tests for TIMDEXDataset
that will not pass until the ability to limit to 'current' records
is reinsated with the updated TIMDEXMetadataDataset.

Side effects of this change:
* Test suite passes, but multiple tests are temporarily skipped.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-530
…data-db-file

TIMX 530 - rebuild TIMDEXDatasetMetadata with static database file approach
Why these changes are being introduced:

With the new metadata approach, an important component are
"append deltas".  These are standalone parquet files that contain
metadata about the records added to the ETL records data parquet
files.

These eventually are merged into the main static metadata file,
but until then are needed for metadata queries.

How this addresses that need:

During TIMDEXDataset.write(), after each ETL parquet file is
written we lean on new method TIMDEXDatasetMetadata.write_append_delta_duckdb()
to read metadata from that file and write a new append delta parquet
file.

This is performed entirely in a DuckDB context, allowing for simple
column selection.

Side effects of this change:
* During dataset write, append deltas will be created.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-527
Why these changes are being introduced:

Much of the refactor work has been building to provide metadata
views for all records and the current version of a given TIMDEX
record, views we had previously but calculated on demand each time.

How this addresses that need:

When setting up the DuckDB context for TIMDEXDatasetMetadata,
we create views that build from a) the static metadata database
file and b) the append deltas, providing a projection over all
metadata records.

Two primary views are added:
'records': all records in the ETL parquet dataset
'current_records': filter to the most recent version of any
timdex_record_id from 'records'

These views will provide the metadata for future work that
(re)implements filtering to current records during read.

Side effects of this change:
* Views are created on TIMDEXDatasetMetadata initialization

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-526
Why these changes are being introduced:

The test suite was built piecemeal as the library grew,
and over time the fixture names were becoming clunky and
confusing.

How this addresses that need:

Rename, simplify, and reorganize test fixtures.  This
requires coordinated changes in tests, nearly entirely
just pointing at new fixture names.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-526
Why these changes are being introduced:

As the TIMDEXDatasetMetadata becomes more integrated, there is
less need to be explicit about how we load the pyarrow dataset.

Formerly, the method .load() needed to be called manually and
supported options like 'current_records' or 'include_parquet_files'.
This also reflected a time when 'TIMDEXDataset.load()' suggested that
"loading" was the pyarrow dataset only.  With the introduction of
metadata, it is also better to be specific we are loading a pyarrow
dataset which is only one of many assets associated with a
TIMDEXDataset instance.

How this addresses that need:

Renames .load() to .load_pyarrow_dataset() to be explicit about
what is happening.

We no longer store the pyarrow dataset filesystem or paths on self,
as they are only used briefly during this dataset load.  We can get
them anytime via .dataset.

Really most important, we limit the root 'location' that we init
a TIMDEXDataset instance to be a string only, the root of the dataset.
Now that we don't allow a list of strings at that level, we can trust
the nature of self.location to be a string, and the root of the TIMDEX
dataset.

Side effects of this change:
* TIMDEXDataset and TIMDEXDatasetMetadata can only be initialized
with a string, which is the root of the TIMDEX dataset.  From there,
both know where their assets can be found.
* You cannot "pre-filter" the pyarrow dataset when loading, which had
confusing overlap with the read methods; the read methods themselves
may change somewhat dramatically now that we have metadata to use.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-533
TIMX 533 - Load pyarrow dataset on TIMDEXDataset init
Why these changes are being introduced:

TIMDEXDatasetMetadata (TDM) has a DuckDB context for metadata
attachments and views.  TIMDEXDataset (TD) will need one for
DuckDB queries that return actual ETL data, not just metadata.

How this addresses that need:
* TD reuses the TDM.conn DuckDB connection and builds upon it
* TDM DuckDB connection builds metadata related views in a
"metadata" schema
* TD will build views in a "data" schema

Side effects of this change:
* All TDM metadata views are under a "metadata" schema

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-529
Why these changes are being introduced:

This commit is a culmination of work to elevate metadata about
ETL records to the point it can be used to improve the speed
and efficiency of data queries.

While the signature of the read methods will remain mostly
the same, it exposes a 'where' clause that accepts raw SQL to
filter the results, allowing for more advanced querying beyond
the simple key/value DatasetFilters.

Additionally, and equally important, data retrieval is now
coming directly from DuckDB instead of more low level pyarrow
dataset reads.  Overall complexity remains about the same, but
we have shifted focus into DuckDB table and view preperation
and SQL construction, which also pays dividends in other contexts.

It it anticipated this will set us up well for other data we may
add to the TIMDEX dataset, e.g. vector embeddings or fulltext,
which we may want to query and retrieve.

How this addresses that need:

As before, all read methods eventually call
TIMDEXDataset.read_batches_iter() which now performs a two-part
process of first quickly querying metadata records, then using that
information to prune heavier data retrieved.

SQLAlchemy is used to provide model DuckDB tables and views such
that we can preserve the simpler key/value DatasetFilters, e.g.
source='libguides' or run_type='daily', which will likely
represent the majority of the public API needs by converting those
key/value pairs into a SQL WHERE clause programatically.  This is
done without the need for complex string interpolation and
escaping.

The overall input and output signatures are largely the same, but
the underlying approach to querying the ETL parquet records now
utilizes DuckDB much more heavily, while also providing a SQL
'escape hatch' if the keyword filters don't suffice.

Side effects of this change:
* None!  Transmog and TIM can call TDA in the same way as before.
The underlying approach is different, but the signatures are
mostly the same.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-529
Why these changes are being introduced:

During the refactor to use dataset metadata for querying,
we had to temporarily skip tests that tested for dataset
filtering and current records limiting.  With the SQL backed
querying now in place, these tests can be reinstated.

Note that a future commit will likely *add* a couple more
tests for the new, optional 'WHERE' clause functionality.

How this addresses that need:
* No tests are skipped.
* Dataset filtering tests still remain, but the key/value
filters are just handled differently under the hood.
* Tests for current records no longer use
.load(current_records=True) but instead utilize the
DuckDB table via table='current_records' within a
read method.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-529
…hods

TIMX 529 - read methods utilize SQL + metadata context
Why these changes are being introduced:

Until we update the build approach, possibly when we migrate to uv,
we need to add any dependencies to *both* Pipfile (for local dev)
and pyproject.toml (for building by other applications).

How this addresses that need:
* Adds duckdb_engine and sqlalchemy to pyproject.toml

Relevant ticket(s):
* None
Why these changes are being introduced:

It sounds like the best option for ECS tasks is using 'instance'
as the provider chain type, where for local dev and/or lambdas
it might be 'sso' or 'env'.  Not having 'instance' appears to cause
failures in the ECS task.

How this addresses that need:

By omitting the 'chain' option entirely from DuckDB secret creation
we allow the default provider chain to take effect.  Given our fairly
normal usage of DuckDB and S3, this is probably the best approach.

Side effects of this change:
* DuckDB to S3 connections work in ECS

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-540
Why these changes are being introduced:

In the AWS Lambda context, the HOME env var is empty string ''.
DuckDB has a canned error response for this, suggesting to,
"Specify a home directory using the SET
home_directory='/path/to/dir' option".

How this addresses that need:

If HOME is unset or empty string, set an explicit secret and extension
directory at `/tmp/.duckdb/*` locations.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-541
…ection

TIMX 540 - adjust AWS credential chain for ECS context
…tion-lambda-context

TIMX 541 - handle missing HOME env in AWS Lambda
Why these changes are being introduced:

The shift to elevating dataset metadata as an expected and required asset,
and utilizing DuckDB context's for metadata and data retrieval is a
sizable and breaking change.  Feels like this warrants a major version
bump.

How this addresses that need:
* Bumps internal library version to 3.0

Side effects of this change:
* `make update` commands from TIMDEX components will pickup this new version

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-537
@ghukill ghukill requested a review from a team August 15, 2025 13:57
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go forth and merge the behemoth

Why these changes are being introduced:
* The method for merging append deltas into the static metadata database file
needs the filenames of append deltas to easily identify which files
to delete (once merged). Prior to this change, the append deltas view
only had a 'filename' column, which referred to the path or S3 URI for
the TIMDEXDataset parquet file.

How this addresses that need:
* Set filename='append_delta_filename' when creating metadata.append_deltas view
* Explicitly select metadata column names when creating metadata.records view

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-528
Why these changes are being introduced:
* TDA requires a method to support regular merging of append
deltas into the static metadata.duckdb database file as new rows
and then deletes (the append deltas) once merged.

How this addresses that need:
* Add 'merge_append_deltas' method
* Add unit test

Side effects of this change:
* It's worth noting that this method, when run, will delete the append
delta parquet files that existed in the directory at the time of
execution.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-528
@coveralls
Copy link

Pull Request Test Coverage Report for Build 16995910964

Details

  • 363 of 395 (91.9%) changed or added relevant lines in 5 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-3.0%) to 93.071%

Changes Missing Coverage Covered Lines Changed/Added Lines %
timdex_dataset_api/dataset.py 94 97 96.91%
timdex_dataset_api/config.py 3 8 37.5%
timdex_dataset_api/utils.py 94 105 89.52%
timdex_dataset_api/metadata.py 171 184 92.93%
Files with Coverage Reduction New Missed Lines %
timdex_dataset_api/exceptions.py 1 0.0%
Totals Coverage Status
Change from base Build 16269635068: -3.0%
Covered Lines: 497
Relevant Lines: 534

💛 - Coveralls

@ghukill ghukill merged commit b7c3350 into main Aug 15, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants