TIMX 543 - Optimize current records metadata queries and data retreival by ghukill · Pull Request #168 · MITLibraries/timdex-dataset-api

ghukill · 2025-08-25T13:43:06Z

Purpose and background context

This PR makes some improvements to metadata queries and data retrieval for "current" records, recalling that a "current" record is the most recent version of a timdex_record_id in the dataset.

While working on a Marimo notebook that was making quite a few queries against metadata.current_records, there was often a 8-10 second delay before results started streaming in. This was mostly due to how the TIMDEXDatasetMetadata class constructed DuckDB tables and views for current records.

The most impactful change is the decision to change the DuckDB asset metadata.current_records from a view into a temporary table. This is implemented by a rewrite of TIMDEXDatasetMetadata. _create_current_records_view().

By materializing the "current" records metadata to a temporary table -- which actively loads into memory, but will spill to disk if memory constraints are reached -- we get orders of magnitude faster current records queries.

We may need to revisit this approach if we find ourselves with 25-50 million "current" records, but that is likely a long ways off in the TIMDEX ecosystem.

NOTE: it feels worth mentioning one of the strengths of the current DuckDB context approach. This entire change was 100% DuckDB SQL. There are relatively no moving parts in python code or data retrieval needed, this was just some refactoring of SQL tables, views, and approaches. While I'm confident this is an improvement to the current "current" records approach, it does not prohibit us from future improvements and changes.

The remaining commits do a little cleanup of method naming, and add a LIMIT clause when reading records, but they are pretty minor in comparison.

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS TimdexManagers credentials in terminal and set env vars:

TDA_LOG_LEVEL=DEBUG
WARNING_ONLY_LOGGERS=asyncio,botocore,urllib3,s3transfer,boto3,MARKDOWN
TIMDEX_DATASET_LOCATION=s3://timdex-extract-dev-222053980223/dataset_scratch/prod-clone

2- Open Ipython shell

pipenv run ipython

3- Load a TIMDEXDataset instance:

import os

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger

configure_dev_logger()

td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])

Note that the initial load is a bit slower, but still quite reasonable for connecting to a remote data lake architecture. We may want to explore the ability to load a TIMDEXDataset instance without some of these read capabilities, e.g. when only performing writes (e.g. Transmogrifier), but that feels like an optimiziation for a later time.

4- Perform a metadatda query that formerly took anywhere from 15-20 seconds:

df = td.conn.query("""select * from metadata.current_records where source='alma' and action='index';""").to_df()
print(len(df))
#  3960267

5- Retrieve data that relies on current records (can ctrl + c to kill early):

for _ in td.read_batches_iter(table='current_records', columns=['timdex_record_id','transformed_record'], source='alma', action='index'):
    pass
# ...
# DEBUG:timdex_dataset_api.dataset:Metadata query identified 3960267 rows, across 213 parquet files, elapsed: 4.23s
# ...
# ...

The thing to note here is the metadata query is quick, about 4.23 seconds. With that quick and complete, we can move into yielding data quicker.

6- Lastly, the 'ole needle-in-a-haystack query to demonstrate the real benefit of this query:

df = td.read_dataframe(table='current_records', source='libguides', action='index')
# ...
# DEBUG:timdex_dataset_api.dataset:Metadata query identified 372 rows, across 64 parquet files, elapsed: 0.04s
# DEBUG:timdex_dataset_api.dataset:read_batches_iter batch 1, yielded: 372 @ 68 records/second, total yielded: 372
# DEBUG:timdex_dataset_api.dataset:read_batches_iter() elapsed: 5.48s

Two things to note here:

a very quick metadata query, around 0.04s
a very quick data retrieval, even across 64 distinct parquet files

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES: faster current records querying and data retreival!

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-543

Why these changes are being introduced: The current_records view has always been a performance and resource bottleneck. Moving to metadata and SQL has helped, but there was a little kink left in relation to using that metadata for data retreival. We often would materialize a query to a pandas dataframe for to use to drive data retrieval. In that moment, we do not benefit from having current_records be a view, when we're going to materialize the data anyhow. How this addresses that need: Utilizing a DuckDB temp table, we take a small performance hit on TIMDEXDatasetMetadata load, but then have near instant current_records queries thereafter. Additionally, we remove ordering in the metadata query for data retrieval and perform this in-memory with the pandas dataframe. Often this may be quite small, but even if large, it's more efficient here and already in python memory. This will also set the stage for performing just-in-time metadata queries as chunks before data retrieval, versus pulling all metadata rows in one query and then chunking that in memory. Side effects of this change: * Quicker metadata queries, small performance hit on load. Appears similarly memory intensive. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-543

Why these changes are being introduced: The former TIMDEXDatasetMetadata method name recreate_static_database_file() was too narrowly focused. This method is responsible for rebuilding the entire dataset metadata structure. How this addresses that need: * Renames method * Updates refresh() methods on both TIMDEXDataset and TIMDEXDatasetMetadata to be more fully inclusive Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-543

Why these changes are being introduced: Sometimes it can be helpful to limit the results from a read method. How this addresses that need: Adds optional limit= arg to all read methods which is passed along to the metadata query. By limiting the metadata results, we limit the data records retrieved. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-543

ghukill · 2025-08-25T14:18:39Z

timdex_dataset_api/dataset.py

-        meta_query = self.metadata.build_meta_query(table, where, **filters)
+        meta_query = self.metadata.build_meta_query(table, limit, where, **filters)
        meta_df = self.metadata.conn.query(meta_query).to_df()
+        meta_df = meta_df.sort_values(by=["filename", "run_record_offset"])


This was another fairly big improvement, and an interesting one.

Instead of sorting the metadata results in DuckDB via ORDER BY, we complete the query unsorted but then do that here in python + pandas.

I anticipate a time in the not-too-distant future where we take another pass at this code to perform our metadata queries in batches. If we do that, maybe using LIMIT + OFFSET, then we might need to move the sort back into DuckDB via an ORDER BY.

But for now, this is a clear performance gain with minimal change, just changing where we do the sort.

ghukill · 2025-08-25T14:23:03Z

timdex_dataset_api/metadata.py

+            where rn = 1;
+
+            -- create view in metadata schema
+            create or replace view metadata.current_records as


While we do now create a temporary table temp.main.current_records, we still maintain this metadata.current_records view as our common entrypoint of use.

Why these changes are being introduced: The ordering of metadata records by filename + run_record_offset was moved into the python pandas context for a performance boost, but it was not ideal from the POV of keeping the majority of our logic in SQL. Upon learning that we could use `hash(filename)` to still order the filenames but with a dramatic speed and memory improvement, it makes sense to move this back into the SQL context. How this addresses that need: * Moves metadata query ordering back to SQL instead of python pandas context Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-543

ghukill · 2025-08-25T20:31:46Z

timdex_dataset_api/metadata.py

+        # NOTE: we use a hash of the filename for ordering for a dramatic speedup, where
+        #   we don't really care about the exact order, just that they are ordered
+        stmt = stmt.order_by(
+            func.hash(sa_table.c.filename),


This update to the ordering to use the SQL hash(filename) dramatically speeds up this ordering clause. We don't actually care about the true order of the filename column, we just want them to be ordered in some way in the final result.

Just a heads up that I think TIMDEXDataset.read_batches_iter() and TIMDEXDatasetMetadata.build_meta_query() may both get a fairly big update in a coming PR.

While this change is good and an improvement in this context -- and overall, pretty minor -- probablay not worth too much attention given its short lifespan. FWIW, we just continue to lean into this hash(...) approach more.

As a spoiler: the upcoming change will move away from reading all metadata results into memory as a dataframe before the data pull, and instead move into a pattern like:

perform a metadatda query chunk

use chunk for a data chunk pull

repeat...

coveralls · 2025-08-25T20:34:12Z

Pull Request Test Coverage Report for Build 17241896338

Details

15 of 16 (93.75%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.08%) to 92.989%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
timdex_dataset_api/dataset.py	4	5	80.0%

Totals
Change from base Build 16996041048:	-0.08%
Covered Lines:	504
Relevant Lines:	542

💛 - Coveralls

ehanson8

Fast is the word of this Tuesday morning! Works as expected, I also ran the same ipython commands to confirm the significant performance increase. Excellent update!

jonavellecuerdo

Looking good! Just had a few clarifying questions. 🤔

timdex_dataset_api/metadata.py

ghukill added 4 commits August 22, 2025 14:28

bump TDA version to 3.1

14c2b1b

ghukill marked this pull request as ready for review August 25, 2025 14:00

ghukill requested a review from a team August 25, 2025 14:00

ghukill commented Aug 25, 2025

View reviewed changes

jonavellecuerdo self-assigned this Aug 25, 2025

ghukill commented Aug 25, 2025

View reviewed changes

ehanson8 self-assigned this Aug 26, 2025

ehanson8 approved these changes Aug 26, 2025

View reviewed changes

jonavellecuerdo reviewed Aug 26, 2025

View reviewed changes

timdex_dataset_api/metadata.py Outdated Show resolved Hide resolved

timdex_dataset_api/metadata.py Show resolved Hide resolved

Remove imprecise 'tie breaker' language

140a8d5

ghukill requested a review from jonavellecuerdo August 26, 2025 14:31

Additional current_records view docstrings

c6ccd58

jonavellecuerdo approved these changes Aug 26, 2025

View reviewed changes

ghukill merged commit 7337f31 into main Aug 26, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

TIMX 543 - Optimize current records metadata queries and data retreival#168

TIMX 543 - Optimize current records metadata queries and data retreival#168
ghukill merged 7 commits intomainfrom
TIMX-543-cr-optimize-v2

ghukill commented Aug 25, 2025 •

edited

Loading

Uh oh!

ghukill Aug 25, 2025 •

edited

Loading

Uh oh!

ghukill Aug 25, 2025

Uh oh!

ghukill Aug 25, 2025

Uh oh!

ghukill Aug 26, 2025 •

edited

Loading

Uh oh!

coveralls commented Aug 25, 2025 •

edited

Loading

Uh oh!

ehanson8 left a comment

Uh oh!

jonavellecuerdo left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

ghukill commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Uh oh!

ghukill Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 17241896338

Details

💛 - Coveralls

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Aug 25, 2025 •

edited

Loading

ghukill Aug 25, 2025 •

edited

Loading

ghukill Aug 26, 2025 •

edited

Loading

coveralls commented Aug 25, 2025 •

edited

Loading