Skip to content

Commit 785d3a3

Browse files
committed
Update README
1 parent 3f90fc6 commit 785d3a3

File tree

1 file changed

+116
-2
lines changed

1 file changed

+116
-2
lines changed

README.md

Lines changed: 116 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1+
from timdex_dataset_api import TIMDEXDataset
2+
13
# timdex-dataset-api
2-
Python library for interacting with a TIMDEX parquet dataset located remotely or in S3.
4+
Python library for interacting with a TIMDEX parquet dataset located remotely or in S3. This library is often abbreviated as "TDA".
35

46
## Development
57

@@ -9,6 +11,13 @@ Python library for interacting with a TIMDEX parquet dataset located remotely or
911
- To run unit tests: `make test`
1012
- To lint the repo: `make lint`
1113

14+
The library version number is set in [`timdex_dataset_api/__init__.py`](timdex_dataset_api/__init__.py), e.g.:
15+
```python
16+
__version__ = "2.1.0"
17+
```
18+
19+
Updating the version number when making changes to the library will prompt applications that install it, when they have _their_ dependencies updated, to pickup the new version.
20+
1221
## Installation
1322

1423
This library is designed to be utilized by other projects, and can therefore be added as a dependency directly from the Github repository.
@@ -30,11 +39,116 @@ timdex_dataset_api = {git = "https://github.com/MITLibraries/timdex-dataset-api.
3039

3140
### Required
3241

42+
None at this time.
43+
3344
### Optional
3445
```shell
3546
TDA_LOG_LEVEL=# log level for timdex-dataset-api, accepts [DEBUG, INFO, WARNING, ERROR], default INFO
47+
WARNING_ONLY_LOGGERS=# Comma-seperated list of logger names to set as WARNING only, e.g. 'botocore,charset_normalizer,smart_open'
3648
```
3749

3850
## Usage
3951

40-
_TODO..._
52+
Currently, the most common use cases are:
53+
* **Transmogrifier**: uses TDA to **write** to the parquet dataset
54+
* **TIMDEX-Index-Manager (TIM)**: uses TDA to **read** from the parquet dataset
55+
56+
Beyond those two ETL run use cases, others are emerging where this library proves helpful:
57+
58+
* yielding only the current version of all records in the dataset, useful for quickly re-indexing to Opensearch
59+
* high throughput (time) + memory safe (space) access to the dataset for analysis
60+
61+
For both reading and writing, the following env vars are recommended:
62+
```shell
63+
TDA_LOG_LEVEL=INFO
64+
WARNING_ONLY_LOGGERS=asyncio,botocore,urllib3,s3transfer,boto3
65+
```
66+
67+
### Reading Data
68+
69+
First, import the library:
70+
```python
71+
from timdex_dataset_api import TIMDEXDataset
72+
```
73+
74+
Load a dataset instance:
75+
```python
76+
# dataset in S3
77+
timdex_dataset = TIMDEXDataset("s3://my-bucket/path/to/dataset")
78+
79+
# or, local dataset (e.g. testing or development)
80+
timdex_dataset = TIMDEXDataset("/path/to/dataset")
81+
82+
# load the dataset, which discovers all parquet files
83+
timdex_dataset.load()
84+
85+
# or, load the dataset but ensure that only current records are ever yielded
86+
timdex_dataset.load(current_records=True)
87+
```
88+
89+
All read methods for `TIMDEXDataset` allow for the same group of filters which are defined in `timdex_dataset_api.dataset.DatasetFilters`. Examples are shown below.
90+
91+
```python
92+
# read a single row, no filtering
93+
single_record_dict = next(timdex_dataset.read_dicts_iter())
94+
95+
96+
# get batches of records, filtering to a particular run
97+
for batch in timdex_dataset.read_batches_iter(
98+
source="alma",
99+
run_date="2025-06-01",
100+
run_id="abc123"
101+
):
102+
# do thing with pyarrow batch...
103+
104+
105+
# use convenience method to yield only transformed records
106+
# NOTE: this is what TIM uses for indexing to Opensearch for a given ETL run
107+
for transformed_record in timdex_dataset.read_transformed_records_iter(
108+
source="aspace",
109+
run_date="2025-06-01",
110+
run_id="ghi789"
111+
):
112+
# do something with transformed record dictionary...
113+
114+
115+
# load all records for a given run into a pandas dataframe
116+
# NOTE: this can be potentially expensive memory-wise if the run is large
117+
run_df = timdex_dataset.read_dataframe(
118+
source="dspace",
119+
run_date="2025-06-01",
120+
run_id="def456"
121+
)
122+
```
123+
124+
### Writing Data
125+
126+
At this time, the only application that writes to the ETL parquet dataset is Transmogrifier.
127+
128+
To write records to the dataset, you must prepare an iterator of `timdex_dataset_api.record.DatasetRecord`. Here is some pseudocode for how a dataset write can work:
129+
130+
```python
131+
from timdex_dataset_api import DatasetRecord, TIMDEXDataset
132+
133+
# different ways to achieve, just need some kind of iterator (e.g. list, generator, etc.)
134+
# of DatasetRecords for writing
135+
def records_to_write_iter() -> Iterator[DatasetRecord]:
136+
records = [...]
137+
for record in records:
138+
yield DatasetRecord(
139+
timdex_record_id=...,
140+
source_record=...,
141+
transformed_record=...,
142+
source=...,
143+
run_date=...,
144+
run_type=...,
145+
run_timestamp=...,
146+
action=...,
147+
run_record_offset=...
148+
)
149+
records_iter = records_to_write_iter()
150+
151+
# finally, perform the write, relying on the library to handle efficient batching
152+
timdex_dataset = TIMDEXDataset("/path/to/dataset")
153+
timdex_dataset.write(records_iter=records_iter)
154+
```

0 commit comments

Comments
 (0)