TIMX 543 - Optimize current records metadata queries and data retreival#168
TIMX 543 - Optimize current records metadata queries and data retreival#168
Conversation
Why these changes are being introduced: The current_records view has always been a performance and resource bottleneck. Moving to metadata and SQL has helped, but there was a little kink left in relation to using that metadata for data retreival. We often would materialize a query to a pandas dataframe for to use to drive data retrieval. In that moment, we do not benefit from having current_records be a view, when we're going to materialize the data anyhow. How this addresses that need: Utilizing a DuckDB temp table, we take a small performance hit on TIMDEXDatasetMetadata load, but then have near instant current_records queries thereafter. Additionally, we remove ordering in the metadata query for data retrieval and perform this in-memory with the pandas dataframe. Often this may be quite small, but even if large, it's more efficient here and already in python memory. This will also set the stage for performing just-in-time metadata queries as chunks before data retrieval, versus pulling all metadata rows in one query and then chunking that in memory. Side effects of this change: * Quicker metadata queries, small performance hit on load. Appears similarly memory intensive. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-543
Why these changes are being introduced: The former TIMDEXDatasetMetadata method name recreate_static_database_file() was too narrowly focused. This method is responsible for rebuilding the entire dataset metadata structure. How this addresses that need: * Renames method * Updates refresh() methods on both TIMDEXDataset and TIMDEXDatasetMetadata to be more fully inclusive Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-543
Why these changes are being introduced: Sometimes it can be helpful to limit the results from a read method. How this addresses that need: Adds optional limit= arg to all read methods which is passed along to the metadata query. By limiting the metadata results, we limit the data records retrieved. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-543
timdex_dataset_api/dataset.py
Outdated
| meta_query = self.metadata.build_meta_query(table, where, **filters) | ||
| meta_query = self.metadata.build_meta_query(table, limit, where, **filters) | ||
| meta_df = self.metadata.conn.query(meta_query).to_df() | ||
| meta_df = meta_df.sort_values(by=["filename", "run_record_offset"]) |
There was a problem hiding this comment.
This was another fairly big improvement, and an interesting one.
Instead of sorting the metadata results in DuckDB via ORDER BY, we complete the query unsorted but then do that here in python + pandas.
I anticipate a time in the not-too-distant future where we take another pass at this code to perform our metadata queries in batches. If we do that, maybe using LIMIT + OFFSET, then we might need to move the sort back into DuckDB via an ORDER BY.
But for now, this is a clear performance gain with minimal change, just changing where we do the sort.
| where rn = 1; | ||
|
|
||
| -- create view in metadata schema | ||
| create or replace view metadata.current_records as |
There was a problem hiding this comment.
While we do now create a temporary table temp.main.current_records, we still maintain this metadata.current_records view as our common entrypoint of use.
Why these changes are being introduced: The ordering of metadata records by filename + run_record_offset was moved into the python pandas context for a performance boost, but it was not ideal from the POV of keeping the majority of our logic in SQL. Upon learning that we could use `hash(filename)` to still order the filenames but with a dramatic speed and memory improvement, it makes sense to move this back into the SQL context. How this addresses that need: * Moves metadata query ordering back to SQL instead of python pandas context Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-543
| # NOTE: we use a hash of the filename for ordering for a dramatic speedup, where | ||
| # we don't really care about the exact order, just that they are ordered | ||
| stmt = stmt.order_by( | ||
| func.hash(sa_table.c.filename), |
There was a problem hiding this comment.
This update to the ordering to use the SQL hash(filename) dramatically speeds up this ordering clause. We don't actually care about the true order of the filename column, we just want them to be ordered in some way in the final result.
There was a problem hiding this comment.
Just a heads up that I think TIMDEXDataset.read_batches_iter() and TIMDEXDatasetMetadata.build_meta_query() may both get a fairly big update in a coming PR.
While this change is good and an improvement in this context -- and overall, pretty minor -- probablay not worth too much attention given its short lifespan. FWIW, we just continue to lean into this hash(...) approach more.
As a spoiler: the upcoming change will move away from reading all metadata results into memory as a dataframe before the data pull, and instead move into a pattern like:
- perform a metadatda query chunk
- use chunk for a data chunk pull
- repeat...
Pull Request Test Coverage Report for Build 17241896338Details
💛 - Coveralls |
ehanson8
left a comment
There was a problem hiding this comment.
Fast is the word of this Tuesday morning! Works as expected, I also ran the same ipython commands to confirm the significant performance increase. Excellent update!
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Looking good! Just had a few clarifying questions. 🤔
Purpose and background context
This PR makes some improvements to metadata queries and data retrieval for "current" records, recalling that a "current" record is the most recent version of a
timdex_record_idin the dataset.While working on a Marimo notebook that was making quite a few queries against
metadata.current_records, there was often a 8-10 second delay before results started streaming in. This was mostly due to how theTIMDEXDatasetMetadataclass constructed DuckDB tables and views for current records.The most impactful change is the decision to change the DuckDB asset
metadata.current_recordsfrom a view into a temporary table. This is implemented by a rewrite ofTIMDEXDatasetMetadata. _create_current_records_view().By materializing the "current" records metadata to a temporary table -- which actively loads into memory, but will spill to disk if memory constraints are reached -- we get orders of magnitude faster current records queries.
We may need to revisit this approach if we find ourselves with 25-50 million "current" records, but that is likely a long ways off in the TIMDEX ecosystem.
NOTE: it feels worth mentioning one of the strengths of the current DuckDB context approach. This entire change was 100% DuckDB SQL. There are relatively no moving parts in python code or data retrieval needed, this was just some refactoring of SQL tables, views, and approaches. While I'm confident this is an improvement to the current "current" records approach, it does not prohibit us from future improvements and changes.
The remaining commits do a little cleanup of method naming, and add a
LIMITclause when reading records, but they are pretty minor in comparison.How can a reviewer manually see the effects of these changes?
1- Set Dev1 AWS
TimdexManagerscredentials in terminal and set env vars:2- Open Ipython shell
3- Load a
TIMDEXDatasetinstance:Note that the initial load is a bit slower, but still quite reasonable for connecting to a remote data lake architecture. We may want to explore the ability to load a
TIMDEXDatasetinstance without some of these read capabilities, e.g. when only performing writes (e.g. Transmogrifier), but that feels like an optimiziation for a later time.4- Perform a metadatda query that formerly took anywhere from 15-20 seconds:
5- Retrieve data that relies on current records (can
ctrl + cto kill early):The thing to note here is the metadata query is quick, about 4.23 seconds. With that quick and complete, we can move into yielding data quicker.
6- Lastly, the 'ole needle-in-a-haystack query to demonstrate the real benefit of this query:
Two things to note here:
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES: faster current records querying and data retreival!
What are the relevant tickets?