Skip to content

Commit 0235289

Browse files
Refactor DatasetRecord to use attrs
Why these changes are being introduced: * Reworking the dataset partitions to use the [year, month, day] of the 'run_date' means that parquet files for different 'source' runs on the same 'run_date' get written to the same partition directory. Therefore, it is crucial that the timdex_dataset_api.write method retrieves the correct partition columns from the (batches) of DatasetRecord objects. The DatasetRecord class has been refactored to adhere to the following criteria: 1. When writing to the dataset, and therefore serializing DatasetRecord objects, year, month, day should be derived from the run_date and should not be modifiable 2. If possible, avoid parsing a datetime string 3 times for each partition column How this addresses that need: * Refactor DatasetRecord to use attrs * Define custom strict_date_parse converter method for 'run_date' field * Simplify serialization method to rely on converter for 'run_date' error handling * Remove DatasetRecord.validate * Include attrs as a dependency Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-432
1 parent 0849ee3 commit 0235289

File tree

5 files changed

+303
-307
lines changed

5 files changed

+303
-307
lines changed

Pipfile

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ verify_ssl = true
44
name = "pypi"
55

66
[packages]
7+
attrs = "*"
78
boto3 = "*"
89
duckdb = "*"
910
pandas = "*"
@@ -14,15 +15,14 @@ black = "*"
1415
boto3-stubs = {version = "*", extras = ["s3"]}
1516
coveralls = "*"
1617
ipython = "*"
18+
moto = "*"
1719
mypy = "*"
20+
pandas-stubs = "*"
1821
pre-commit = "*"
22+
pytest-mock = "*"
1923
pyarrow-stubs = "*"
2024
pytest = "*"
2125
ruff = "*"
2226
setuptools = "*"
23-
pandas-stubs = "*"
24-
moto = "*"
25-
pytest-mock = "*"
26-
2727
[requires]
2828
python_version = "3.12"

0 commit comments

Comments
 (0)