Merge long running feature branch epic-TIMX-515#167
Merged
Conversation
Why these changes are being introduced: With the addition of dataset/metadata assets in S3, we will need to perform actions like downloading the static DB file, uploading a new one, and deleting append deltas. How this addresses that need: Creates new utility class S3Client that performs these actions. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530
…lient Timx 530 prep work and s3 client
Why these changes are being introduced: The current overarching work is to support the creation and reading of a static metadata database file and append deltas. To get there, very little of the original TIMDEXDatasetMetadata class is needed or wanted. This commit begins the process of rebuilding TIMDEXDatasetMetadata, oriented around managing a static metadata database file, and providing a readonly projection over that and append delta paqruet files. How this addresses that need: TIMDEXDatasetMetadata is almost completely rebuilt, with the first functionality being the creation of the static metadata file by scanning the ETL records. Then, the ability to remotely attach in readonly mode to this metadata database file for reading. Note: these changes are breaking. TIMDEXDataset cannot provide "current" records and many unit tests are broken. This will be addressed in future commits as we build this class back up with new functionality. Side effects of this change: * TIMDEXDataset cannot provide current records * Unit tests are either temporarily skipped or failing Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530
Why these changes are being introduced: While the TIMDEXDatasetMetadata class is rebuilt, TIMDEXDataset itself can no longer provide "current" records from the dataaset as it has no metadata to work with. This is temporary until TIMDEXDatasetMetadata is rebuilt, and TIMDEXDataset gets new functionality based on *that* new metadata. How this addresses that need: * Any reference to "current records" is removed Side effects of this change: * TIMDEXDataset cannot provide current records Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530
Why these changes are being introduced: This is a small change now, that will lead to a larger change later. The TIMDEX dataset is getting more structure, and this means we will want to initialize a TIMDEXDataset instance with the root of the dataset, but then internally there will be more opinionation about where files should be read and written to. How this addresses that need: A new property 'data_records_root' is added to TIMDEXDataset that mirrors similar properties in TIMDEXDatasetMetadata. This informs any operations that need to read or write ETL records where precisely they are in the dataset. At this time only .write() utilizes it, but in a future ticket the load method will be heavily reworked (if not outright removed) and this property will be fully integrated. This is needed now to continue updates to TIMDEXMetadataDataset for TIMX-530. Side effects of this change: * Initialization of TIMDEXDataset should provide the true dataset root, not point to /data/records. The pipeline lambda currently does this, but will be updated in TIMX-531. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530 * https://mitlibraries.atlassian.net/browse/TIMX-531
Why these changes are being introduced: With the big changes to TIMDEXMetadataDataset comes the need to virtually rewrite the test suite for that class. The changes too TIMDEXMetadataDataset are also influencing tests for TIMDEXDataset, both how its loaded and tested for 'current' record reading. How this addresses that need: This begins with some basic tests around the loading, creating, and attaching of a static database file for TIMDEXMetadataDataset. Future tests will more fully exercise the final views and tables created. This commit also *temporarily* skips a bunch of tests for TIMDEXDataset that will not pass until the ability to limit to 'current' records is reinsated with the updated TIMDEXMetadataDataset. Side effects of this change: * Test suite passes, but multiple tests are temporarily skipped. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530
…data-db-file TIMX 530 - rebuild TIMDEXDatasetMetadata with static database file approach
Why these changes are being introduced: With the new metadata approach, an important component are "append deltas". These are standalone parquet files that contain metadata about the records added to the ETL records data parquet files. These eventually are merged into the main static metadata file, but until then are needed for metadata queries. How this addresses that need: During TIMDEXDataset.write(), after each ETL parquet file is written we lean on new method TIMDEXDatasetMetadata.write_append_delta_duckdb() to read metadata from that file and write a new append delta parquet file. This is performed entirely in a DuckDB context, allowing for simple column selection. Side effects of this change: * During dataset write, append deltas will be created. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-527
TIMX 527 - write append deltas
Why these changes are being introduced: Much of the refactor work has been building to provide metadata views for all records and the current version of a given TIMDEX record, views we had previously but calculated on demand each time. How this addresses that need: When setting up the DuckDB context for TIMDEXDatasetMetadata, we create views that build from a) the static metadata database file and b) the append deltas, providing a projection over all metadata records. Two primary views are added: 'records': all records in the ETL parquet dataset 'current_records': filter to the most recent version of any timdex_record_id from 'records' These views will provide the metadata for future work that (re)implements filtering to current records during read. Side effects of this change: * Views are created on TIMDEXDatasetMetadata initialization Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-526
Why these changes are being introduced: The test suite was built piecemeal as the library grew, and over time the fixture names were becoming clunky and confusing. How this addresses that need: Rename, simplify, and reorganize test fixtures. This requires coordinated changes in tests, nearly entirely just pointing at new fixture names. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-526
TIMX 526 - projected views
Why these changes are being introduced: As the TIMDEXDatasetMetadata becomes more integrated, there is less need to be explicit about how we load the pyarrow dataset. Formerly, the method .load() needed to be called manually and supported options like 'current_records' or 'include_parquet_files'. This also reflected a time when 'TIMDEXDataset.load()' suggested that "loading" was the pyarrow dataset only. With the introduction of metadata, it is also better to be specific we are loading a pyarrow dataset which is only one of many assets associated with a TIMDEXDataset instance. How this addresses that need: Renames .load() to .load_pyarrow_dataset() to be explicit about what is happening. We no longer store the pyarrow dataset filesystem or paths on self, as they are only used briefly during this dataset load. We can get them anytime via .dataset. Really most important, we limit the root 'location' that we init a TIMDEXDataset instance to be a string only, the root of the dataset. Now that we don't allow a list of strings at that level, we can trust the nature of self.location to be a string, and the root of the TIMDEX dataset. Side effects of this change: * TIMDEXDataset and TIMDEXDatasetMetadata can only be initialized with a string, which is the root of the TIMDEX dataset. From there, both know where their assets can be found. * You cannot "pre-filter" the pyarrow dataset when loading, which had confusing overlap with the read methods; the read methods themselves may change somewhat dramatically now that we have metadata to use. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-533
TIMX 533 - Load pyarrow dataset on TIMDEXDataset init
Why these changes are being introduced: TIMDEXDatasetMetadata (TDM) has a DuckDB context for metadata attachments and views. TIMDEXDataset (TD) will need one for DuckDB queries that return actual ETL data, not just metadata. How this addresses that need: * TD reuses the TDM.conn DuckDB connection and builds upon it * TDM DuckDB connection builds metadata related views in a "metadata" schema * TD will build views in a "data" schema Side effects of this change: * All TDM metadata views are under a "metadata" schema Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-529
Why these changes are being introduced: This commit is a culmination of work to elevate metadata about ETL records to the point it can be used to improve the speed and efficiency of data queries. While the signature of the read methods will remain mostly the same, it exposes a 'where' clause that accepts raw SQL to filter the results, allowing for more advanced querying beyond the simple key/value DatasetFilters. Additionally, and equally important, data retrieval is now coming directly from DuckDB instead of more low level pyarrow dataset reads. Overall complexity remains about the same, but we have shifted focus into DuckDB table and view preperation and SQL construction, which also pays dividends in other contexts. It it anticipated this will set us up well for other data we may add to the TIMDEX dataset, e.g. vector embeddings or fulltext, which we may want to query and retrieve. How this addresses that need: As before, all read methods eventually call TIMDEXDataset.read_batches_iter() which now performs a two-part process of first quickly querying metadata records, then using that information to prune heavier data retrieved. SQLAlchemy is used to provide model DuckDB tables and views such that we can preserve the simpler key/value DatasetFilters, e.g. source='libguides' or run_type='daily', which will likely represent the majority of the public API needs by converting those key/value pairs into a SQL WHERE clause programatically. This is done without the need for complex string interpolation and escaping. The overall input and output signatures are largely the same, but the underlying approach to querying the ETL parquet records now utilizes DuckDB much more heavily, while also providing a SQL 'escape hatch' if the keyword filters don't suffice. Side effects of this change: * None! Transmog and TIM can call TDA in the same way as before. The underlying approach is different, but the signatures are mostly the same. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-529
Why these changes are being introduced: During the refactor to use dataset metadata for querying, we had to temporarily skip tests that tested for dataset filtering and current records limiting. With the SQL backed querying now in place, these tests can be reinstated. Note that a future commit will likely *add* a couple more tests for the new, optional 'WHERE' clause functionality. How this addresses that need: * No tests are skipped. * Dataset filtering tests still remain, but the key/value filters are just handled differently under the hood. * Tests for current records no longer use .load(current_records=True) but instead utilize the DuckDB table via table='current_records' within a read method. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-529
…hods TIMX 529 - read methods utilize SQL + metadata context
Why these changes are being introduced: Until we update the build approach, possibly when we migrate to uv, we need to add any dependencies to *both* Pipfile (for local dev) and pyproject.toml (for building by other applications). How this addresses that need: * Adds duckdb_engine and sqlalchemy to pyproject.toml Relevant ticket(s): * None
TIMX 515 - dependency hotfix
Why these changes are being introduced: It sounds like the best option for ECS tasks is using 'instance' as the provider chain type, where for local dev and/or lambdas it might be 'sso' or 'env'. Not having 'instance' appears to cause failures in the ECS task. How this addresses that need: By omitting the 'chain' option entirely from DuckDB secret creation we allow the default provider chain to take effect. Given our fairly normal usage of DuckDB and S3, this is probably the best approach. Side effects of this change: * DuckDB to S3 connections work in ECS Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-540
Why these changes are being introduced: In the AWS Lambda context, the HOME env var is empty string ''. DuckDB has a canned error response for this, suggesting to, "Specify a home directory using the SET home_directory='/path/to/dir' option". How this addresses that need: If HOME is unset or empty string, set an explicit secret and extension directory at `/tmp/.duckdb/*` locations. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-541
…ection TIMX 540 - adjust AWS credential chain for ECS context
…tion-lambda-context TIMX 541 - handle missing HOME env in AWS Lambda
Why these changes are being introduced: The shift to elevating dataset metadata as an expected and required asset, and utilizing DuckDB context's for metadata and data retrieval is a sizable and breaking change. Feels like this warrants a major version bump. How this addresses that need: * Bumps internal library version to 3.0 Side effects of this change: * `make update` commands from TIMDEX components will pickup this new version Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-537
…ion-3 Bump version to v3.0
Why these changes are being introduced: * The method for merging append deltas into the static metadata database file needs the filenames of append deltas to easily identify which files to delete (once merged). Prior to this change, the append deltas view only had a 'filename' column, which referred to the path or S3 URI for the TIMDEXDataset parquet file. How this addresses that need: * Set filename='append_delta_filename' when creating metadata.append_deltas view * Explicitly select metadata column names when creating metadata.records view Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-528
Why these changes are being introduced: * TDA requires a method to support regular merging of append deltas into the static metadata.duckdb database file as new rows and then deletes (the append deltas) once merged. How this addresses that need: * Add 'merge_append_deltas' method * Add unit test Side effects of this change: * It's worth noting that this method, when run, will delete the append delta parquet files that existed in the directory at the time of execution. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-528
Pull Request Test Coverage Report for Build 16995910964Details
💛 - Coveralls |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose and background context
This PR merges the long running feature branch
epic-TIMX-515intomain. Once merged, we will tag av3.0release.This is safe to do anytime: TIMDEX components like Transmog, pipeline lambdas, and TIM which utilize this library all would need a) an explicit
make updaterun, and b) a tagged release for production (not coincidentally, all work planned in TIMX-537).Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES and NO: without updated dependencies and a tagged release, no real changes in the TIMDEX ecosystem. But Transmog, pipeline lambdas, and TIM all have some minor adjustments to align with the
v3.0release and will pickup the new version then.What are the relevant tickets?