Skip to content

Comments

TIMX 543 - Optimize current records metadata queries and data retreival#168

Merged
ghukill merged 7 commits intomainfrom
TIMX-543-cr-optimize-v2
Aug 26, 2025
Merged

TIMX 543 - Optimize current records metadata queries and data retreival#168
ghukill merged 7 commits intomainfrom
TIMX-543-cr-optimize-v2

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Aug 25, 2025

Purpose and background context

This PR makes some improvements to metadata queries and data retrieval for "current" records, recalling that a "current" record is the most recent version of a timdex_record_id in the dataset.

While working on a Marimo notebook that was making quite a few queries against metadata.current_records, there was often a 8-10 second delay before results started streaming in. This was mostly due to how the TIMDEXDatasetMetadata class constructed DuckDB tables and views for current records.

The most impactful change is the decision to change the DuckDB asset metadata.current_records from a view into a temporary table. This is implemented by a rewrite of TIMDEXDatasetMetadata. _create_current_records_view().

By materializing the "current" records metadata to a temporary table -- which actively loads into memory, but will spill to disk if memory constraints are reached -- we get orders of magnitude faster current records queries.

We may need to revisit this approach if we find ourselves with 25-50 million "current" records, but that is likely a long ways off in the TIMDEX ecosystem.

NOTE: it feels worth mentioning one of the strengths of the current DuckDB context approach. This entire change was 100% DuckDB SQL. There are relatively no moving parts in python code or data retrieval needed, this was just some refactoring of SQL tables, views, and approaches. While I'm confident this is an improvement to the current "current" records approach, it does not prohibit us from future improvements and changes.

The remaining commits do a little cleanup of method naming, and add a LIMIT clause when reading records, but they are pretty minor in comparison.

How can a reviewer manually see the effects of these changes?

1- Set Dev1 AWS TimdexManagers credentials in terminal and set env vars:

TDA_LOG_LEVEL=DEBUG
WARNING_ONLY_LOGGERS=asyncio,botocore,urllib3,s3transfer,boto3,MARKDOWN
TIMDEX_DATASET_LOCATION=s3://timdex-extract-dev-222053980223/dataset_scratch/prod-clone

2- Open Ipython shell

pipenv run ipython

3- Load a TIMDEXDataset instance:

import os

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger

configure_dev_logger()

td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])

Note that the initial load is a bit slower, but still quite reasonable for connecting to a remote data lake architecture. We may want to explore the ability to load a TIMDEXDataset instance without some of these read capabilities, e.g. when only performing writes (e.g. Transmogrifier), but that feels like an optimiziation for a later time.

4- Perform a metadatda query that formerly took anywhere from 15-20 seconds:

df = td.conn.query("""select * from metadata.current_records where source='alma' and action='index';""").to_df()
print(len(df))
#  3960267

5- Retrieve data that relies on current records (can ctrl + c to kill early):

for _ in td.read_batches_iter(table='current_records', columns=['timdex_record_id','transformed_record'], source='alma', action='index'):
    pass
# ...
# DEBUG:timdex_dataset_api.dataset:Metadata query identified 3960267 rows, across 213 parquet files, elapsed: 4.23s
# ...
# ...

The thing to note here is the metadata query is quick, about 4.23 seconds. With that quick and complete, we can move into yielding data quicker.

6- Lastly, the 'ole needle-in-a-haystack query to demonstrate the real benefit of this query:

df = td.read_dataframe(table='current_records', source='libguides', action='index')
# ...
# DEBUG:timdex_dataset_api.dataset:Metadata query identified 372 rows, across 64 parquet files, elapsed: 0.04s
# DEBUG:timdex_dataset_api.dataset:read_batches_iter batch 1, yielded: 372 @ 68 records/second, total yielded: 372
# DEBUG:timdex_dataset_api.dataset:read_batches_iter() elapsed: 5.48s

Two things to note here:

  • a very quick metadata query, around 0.04s
  • a very quick data retrieval, even across 64 distinct parquet files

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES: faster current records querying and data retreival!

What are the relevant tickets?

Why these changes are being introduced:

The current_records view has always been a performance and resource
bottleneck.  Moving to metadata and SQL has helped, but there was a little
kink left in relation to using that metadata for data retreival.

We often would materialize a query to a pandas dataframe for to use
to drive data retrieval.  In that moment, we do not benefit from having
current_records be a view, when we're going to materialize the data
anyhow.

How this addresses that need:

Utilizing a DuckDB temp table, we take a small performance hit on
TIMDEXDatasetMetadata load, but then have near instant current_records
queries thereafter.

Additionally, we remove ordering in the metadata query for data
retrieval and perform this in-memory with the pandas dataframe.  Often
this may be quite small, but even if large, it's more efficient here
and already in python memory.

This will also set the stage for performing just-in-time metadata
queries as chunks before data retrieval, versus pulling all metadata
rows in one query and then chunking that in memory.

Side effects of this change:
* Quicker metadata queries, small performance hit on load.  Appears
similarly memory intensive.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-543
Why these changes are being introduced:

The former TIMDEXDatasetMetadata method name recreate_static_database_file()
was too narrowly focused.  This method is responsible for rebuilding the
entire dataset metadata structure.

How this addresses that need:
* Renames method
* Updates refresh() methods on both TIMDEXDataset and TIMDEXDatasetMetadata
to be more fully inclusive

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-543
Why these changes are being introduced:

Sometimes it can be helpful to limit the results from a read method.

How this addresses that need:
Adds optional limit= arg to all read methods which is passed along to
the metadata query.  By limiting the metadata results, we limit the
data records retrieved.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-543
@ghukill ghukill marked this pull request as ready for review August 25, 2025 14:00
@ghukill ghukill requested a review from a team August 25, 2025 14:00
meta_query = self.metadata.build_meta_query(table, where, **filters)
meta_query = self.metadata.build_meta_query(table, limit, where, **filters)
meta_df = self.metadata.conn.query(meta_query).to_df()
meta_df = meta_df.sort_values(by=["filename", "run_record_offset"])
Copy link
Contributor Author

@ghukill ghukill Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was another fairly big improvement, and an interesting one.

Instead of sorting the metadata results in DuckDB via ORDER BY, we complete the query unsorted but then do that here in python + pandas.

I anticipate a time in the not-too-distant future where we take another pass at this code to perform our metadata queries in batches. If we do that, maybe using LIMIT + OFFSET, then we might need to move the sort back into DuckDB via an ORDER BY.

But for now, this is a clear performance gain with minimal change, just changing where we do the sort.

where rn = 1;

-- create view in metadata schema
create or replace view metadata.current_records as
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we do now create a temporary table temp.main.current_records, we still maintain this metadata.current_records view as our common entrypoint of use.

@jonavellecuerdo jonavellecuerdo self-assigned this Aug 25, 2025
Why these changes are being introduced:

The ordering of metadata records by filename + run_record_offset was moved
into the python pandas context for a performance boost, but it was not ideal
from the POV of keeping the majority of our logic in SQL.

Upon learning that we could use `hash(filename)` to still order the filenames
but with a dramatic speed and memory improvement, it makes sense to move this
back into the SQL context.

How this addresses that need:
* Moves metadata query ordering back to SQL instead of python pandas context

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-543
# NOTE: we use a hash of the filename for ordering for a dramatic speedup, where
# we don't really care about the exact order, just that they are ordered
stmt = stmt.order_by(
func.hash(sa_table.c.filename),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update to the ordering to use the SQL hash(filename) dramatically speeds up this ordering clause. We don't actually care about the true order of the filename column, we just want them to be ordered in some way in the final result.

Copy link
Contributor Author

@ghukill ghukill Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up that I think TIMDEXDataset.read_batches_iter() and TIMDEXDatasetMetadata.build_meta_query() may both get a fairly big update in a coming PR.

While this change is good and an improvement in this context -- and overall, pretty minor -- probablay not worth too much attention given its short lifespan. FWIW, we just continue to lean into this hash(...) approach more.

As a spoiler: the upcoming change will move away from reading all metadata results into memory as a dataframe before the data pull, and instead move into a pattern like:

  1. perform a metadatda query chunk
  2. use chunk for a data chunk pull
  3. repeat...

@coveralls
Copy link

coveralls commented Aug 25, 2025

Pull Request Test Coverage Report for Build 17241896338

Details

  • 15 of 16 (93.75%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.08%) to 92.989%

Changes Missing Coverage Covered Lines Changed/Added Lines %
timdex_dataset_api/dataset.py 4 5 80.0%
Totals Coverage Status
Change from base Build 16996041048: -0.08%
Covered Lines: 504
Relevant Lines: 542

💛 - Coveralls

@ehanson8 ehanson8 self-assigned this Aug 26, 2025
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast is the word of this Tuesday morning! Works as expected, I also ran the same ipython commands to confirm the significant performance increase. Excellent update!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just had a few clarifying questions. 🤔

@ghukill ghukill merged commit 7337f31 into main Aug 26, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants