Commit 0235289
committed
Refactor DatasetRecord to use attrs
Why these changes are being introduced:
* Reworking the dataset partitions to use the [year, month, day]
of the 'run_date' means that parquet files for different 'source' runs
on the same 'run_date' get written to the same partition directory.
Therefore, it is crucial that the timdex_dataset_api.write method
retrieves the correct partition columns from the (batches) of DatasetRecord
objects. The DatasetRecord class has been refactored to adhere
to the following criteria:
1. When writing to the dataset, and therefore serializing DatasetRecord objects,
year, month, day should be derived from the run_date and should not be modifiable
2. If possible, avoid parsing a datetime string 3 times for each partition column
How this addresses that need:
* Refactor DatasetRecord to use attrs
* Define custom strict_date_parse converter method for 'run_date' field
* Simplify serialization method to rely on converter for 'run_date'
error handling
* Remove DatasetRecord.validate
* Include attrs as a dependency
Side effects of this change:
* None
Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-4321 parent 0849ee3 commit 0235289
File tree
5 files changed
+303
-307
lines changed- tests
- timdex_dataset_api
5 files changed
+303
-307
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| |||
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
| 18 | + | |
17 | 19 | | |
| 20 | + | |
18 | 21 | | |
| 22 | + | |
19 | 23 | | |
20 | 24 | | |
21 | 25 | | |
22 | 26 | | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | 27 | | |
28 | 28 | | |
0 commit comments