TIMX 543 - keyset pagination for read methods#169
Merged
Conversation
Why these changes are being introduced: For all read methods, the former approach was to perform a metadata query and store the entire results in memory, then loop through chunks of that metadata and build SQL queries to perform data retrieval. Even for metadata queries that may bring back 3-4 million results, this worked, but there is an upper limit. Ideally, we would perform all of our queries -- metadata and data -- in chunks to ease memory pressure. And in some cases, this can increase performance. How this addresses that need: This reworks the base read_batches_iter() method to perform smaller, chunked metadata queries. To paginate the results, instead of using the slow LIMIT / OFFSET approach, we use keyset pagination, which means we can look for values greater than a tuple of values that are ordered. This is often the preferred way to perform paginated querying when you have nicely ordered columns. In support of this, we also begin hashing the filename and run_id columns for ordering, providing almost an order magnitude speedup. The performance penalty for creating the hash is offset by the speedup of ordering integers versus very long strings. The net effect is no changes to the input/ouput signatures of the read methods, but improved memory usage and performance. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-543
efb3973 to
f439f18
Compare
Pull Request Test Coverage Report for Build 17333521896Details
💛 - Coveralls |
ghukill
commented
Sep 2, 2025
Comment on lines
+493
to
+499
| # update keyset value using the last row from this chunk | ||
| last_row = meta_chunk_df.iloc[-1] | ||
| keyset_value = ( | ||
| int(last_row.filename_hash), | ||
| int(last_row.run_id_hash), | ||
| int(last_row.run_record_offset), | ||
| ) |
Contributor
Author
There was a problem hiding this comment.
This is kind of the key to the new keyset pagination.
We have ordered the columns filename_hash, run_id_hash, run_record_offset for the batch, so by taking the last row from the batch, we can perform our next quere where (tuple of values) > (last batch tuple of values).
Because we do convert the batch results to a pandas dataframe, getting this last row from the batch is trivial.
ghukill
commented
Sep 2, 2025
Comment on lines
+513
to
+531
| # build list of explicit parquet files to read from | ||
| filenames = list(meta_chunk_df["filename"].unique()) | ||
| if self.location_scheme == "s3": | ||
| filenames = [ | ||
| f"s3://{f.removeprefix('s3://')}" for f in filenames # type: ignore[union-attr] | ||
| ] | ||
| parquet_list_sql = "[" + ",".join(f"'{f}'" for f in filenames) + "]" | ||
|
|
||
| # build run_record_offset WHERE clause to leverage row group pruning | ||
| rro_values = meta_chunk_df["run_record_offset"].unique() | ||
| rro_values.sort() | ||
| if len(rro_values) <= 1_000: # noqa: PLR2004 | ||
| rro_clause = ( | ||
| f"and run_record_offset in ({','.join(str(rro) for rro in rro_values)})" | ||
| ) | ||
| else: | ||
| rro_clause = ( | ||
| f"and run_record_offset between {rro_values[0]} and {rro_values[-1]}" | ||
| ) |
Contributor
Author
There was a problem hiding this comment.
Very little meaningful change here. This is primarily code from sub-methods that were removed, where the misdirection of jumping around to them felt worse than just having it here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose and background context
For all read methods, the former approach was to perform a metadata query and store the entire results in memory, then loop through chunks of that metadata and build SQL queries to perform data retrieval. Even for metadata queries that may bring back 3-4 million results, this worked, but there is an upper limit.
Ideally, we would perform all of our queries -- metadata and data -- in chunks to ease memory pressure. And in some cases, this can increase performance.
This reworks the base
read_batches_iter()method to perform smaller, chunked metadata queries. To paginate the results, instead of using the slowLIMIT/OFFSETapproach, we use keyset pagination, which means we can look for values greater than a tuple of values that are ordered. This is often the preferred way to perform paginated querying when you have nicely ordered columns.In support of this, we ask the method
_iter_meta_chunks()to do substantially more work, but less work elsewhere.In support of this, we also begin hashing the
filenameandrun_idcolumns for ordering, providing almost an order magnitude speedup. The performance penalty for creating the hash is offset by the speedup of ordering integers versus very long strings.The net effect is no changes to the input/ouput signatures of the read methods, but improved memory usage and performance.
How can a reviewer manually see the effects of these changes?
The following will perform some individual record retreival and bulk record retrieval, demonstrating that read operations are essentially as quick, though sometimes quicker, than the previous single metadata pull. It's difficutl to compare directly in this PR without branch hopping, but lots of testing has been done to confirm it's as quick but much more memory effecient.
1- Set Dev1 AWS
TimdexManagerscredentials in terminal and set env vars:2- Open Ipython shell
3- Load a
TIMDEXDatasetinstance:4- Perform a "needle in a haystack" query that finds all versions of three records with lots of versions:
Note that the logging has changed slightly, given what we know and when we know it, but this particularly query is retrieving 114 records from 114 parquet files! That is good and fast projection.
5- Perform some deep pagination, getting 500k
timdex_record_id's for currentalmarecords:6- Increase the metadata <--> data join size and repeat the last deep pagination:
Note the speed increase! There is also a memory increase, but it's not substantial. Roughly speaking, between the "join" size of metadata <--> data and the "read chunk" size which is how many records we stream back from DuckDB, we have quite a bit of control over the data flow that will serve us well going forward. Current default configurations are slightly conservative, but more than sufficient for our purposes.
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?