Skip to content

Commit 51a35ac

Browse files
committed
Move read ordering back to SQL query
Why these changes are being introduced: The ordering of metadata records by filename + run_record_offset was moved into the python pandas context for a performance boost, but it was not ideal from the POV of keeping the majority of our logic in SQL. Upon learning that we could use `hash(filename)` to still order the filenames but with a dramatic speed and memory improvement, it makes sense to move this back into the SQL context. How this addresses that need: * Moves metadata query ordering back to SQL instead of python pandas context Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-543
1 parent 14c2b1b commit 51a35ac

File tree

2 files changed

+9
-2
lines changed

2 files changed

+9
-2
lines changed

timdex_dataset_api/dataset.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,6 @@ def read_batches_iter(
387387
metadata_time = time.perf_counter()
388388
meta_query = self.metadata.build_meta_query(table, limit, where, **filters)
389389
meta_df = self.metadata.conn.query(meta_query).to_df()
390-
meta_df = meta_df.sort_values(by=["filename", "run_record_offset"])
391390
logger.debug(
392391
f"Metadata query identified {len(meta_df)} rows, "
393392
f"across {len(meta_df.filename.unique())} parquet files, "

timdex_dataset_api/metadata.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
import duckdb
1313
from duckdb import DuckDBPyConnection
1414
from duckdb_engine import Dialect as DuckDBDialect
15-
from sqlalchemy import Table, and_, select, text
15+
from sqlalchemy import Table, and_, func, select, text
1616

1717
from timdex_dataset_api.config import configure_logger
1818
from timdex_dataset_api.utils import (
@@ -642,6 +642,14 @@ def build_meta_query(
642642
if combined is not None:
643643
stmt = stmt.where(combined)
644644

645+
# order by filename + run_record_offset
646+
# NOTE: we use a hash of the filename for ordering for a dramatic speedup, where
647+
# we don't really care about the exact order, just that they are ordered
648+
stmt = stmt.order_by(
649+
func.hash(sa_table.c.filename),
650+
sa_table.c.run_record_offset,
651+
)
652+
645653
# apply limit if present
646654
if limit:
647655
stmt = stmt.limit(limit)

0 commit comments

Comments
 (0)