Skip to content

Commit b5279be

Browse files
committed
Update README files for multimedia RAG and preference alignment with GCP authentication and dataset download instructions
1 parent 483d285 commit b5279be

File tree

2 files changed

+91
-32
lines changed

2 files changed

+91
-32
lines changed

implementations/multimedia_rag/README.md

Lines changed: 60 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -51,35 +51,81 @@ This is required for embedding video/audio segments.
5151
mkdir -p data
5252
```
5353

54-
### VQA JSON Files (Place in `data/`)
54+
### VQA JSON Files
5555

56-
Download:
56+
These are included in the GCP download below for convenience — same files as the SONIC-O1 HuggingFace dataset. They define the multiple-choice video QA tasks.
5757

58-
* [https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/02_Job_Interviews.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/02_Job_Interviews.json)
59-
* [https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/04_Customer_Service_Interactions.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/04_Customer_Service_Interactions.json)
60-
* [https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/01_Patient-Doctor_Consultations.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/01_Patient-Doctor_Consultations.json)
58+
If you prefer to download them individually:
6159

62-
These define the multiple-choice video QA tasks.
60+
- [02_Job_Interviews.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/02_Job_Interviews.json)
61+
- [04_Customer_Service_Interactions.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/04_Customer_Service_Interactions.json)
62+
- [01_Patient-Doctor_Consultations.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/01_Patient-Doctor_Consultations.json)
6363

6464
### Video / Audio / Caption Files
6565

66-
Download from:
66+
The media dataset is hosted in a GCP bucket.
6767

68+
#### 1) Authenticate with GCP
69+
70+
```bash
71+
gcloud auth login
72+
gcloud auth application-default login
73+
# When prompted, enter the email you used to log into the coder platform.
74+
# A browser window will open for Google sign-in. After signing in, you will receive a code.
75+
# Copy that code back into the terminal to complete authentication.
76+
gcloud config set account YOUR_EMAIL
6877
```
69-
<GOOGLE_DRIVE_LINK_PLACEHOLDER>
78+
79+
Verify active account:
80+
81+
```bash
82+
gcloud auth list
7083
```
7184

72-
Extract contents into:
85+
#### 2) Download Dataset
7386

7487
```bash
75-
data/
88+
cd implementations/multimedia_rag
89+
gcloud storage cp gs://interp-bootcamp-data/multimedia_rag/data.zip .
90+
unzip data.zip
7691
```
7792

78-
The expected structure includes:
93+
Files are placed correctly after extraction — no manual reorganisation needed.
94+
95+
#### 3) Cleanup temporary files
7996

80-
* video
81-
* audio
82-
* caption
97+
```bash
98+
rm -f __MACOSX data.zip data/.DS_Store
99+
```
100+
101+
The zip contains everything needed to run the notebooks:
102+
103+
```
104+
data/
105+
├── Customer_Service_Interactions/
106+
│ ├── audio/ # base audio files
107+
│ ├── video/ # base video files
108+
│ ├── caption/ # base caption files
109+
│ ├── process-audio/ # pre-generated, can be regenerated
110+
│ ├── process-video/ # pre-generated, can be regenerated
111+
│ ├── segment-audio_30s/ # pre-generated, can be regenerated
112+
│ ├── segment-video_30s/ # pre-generated, can be regenerated
113+
│ ├── segment-caption_30s/ # pre-generated, can be regenerated
114+
│ ├── audio_embeddings.pt # pre-generated, can be regenerated
115+
│ ├── video_embeddings.pt # pre-generated, can be regenerated
116+
│ └── caption_embeddings.pt # pre-generated, can be regenerated
117+
├── Job_Interviews/ # same structure as above
118+
├── Patient-Doctor_Consultations/ # same structure as above
119+
├── global_embeddings/ # pre-generated, can be regenerated
120+
├── Customer_Service_Interactions.json
121+
├── Customer_Service_Interactions_filtered.json # pre-generated, can be regenerated
122+
├── Job_Interviews.json
123+
├── Job_Interviews_filtered.json # pre-generated, can be regenerated
124+
├── Patient-Doctor_Consultations.json
125+
└── Patient-Doctor_Consultations_filtered.json # pre-generated, can be regenerated
126+
```
127+
128+
Pre-generated files (`process-*`, `segment-*`, `*.pt` embeddings, `global_embeddings/`, `*_filtered.json`) are included to save time, but can all be reproduced by running the notebooks from scratch.
83129

84130
---
85131

@@ -122,11 +168,6 @@ This installs everything needed for both the Video RAG (ImageBind embedding + re
122168
jupyter lab
123169
```
124170

125-
Choose:
126-
127-
* **Ref5 (Video RAG)** → retrieval pipeline
128-
* **Ref5 (Video QA)** → multimodal QA evaluation
129-
130171
---
131172

132173
## Implementation Overview

implementations/preference_alignment/README.md

Lines changed: 31 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -76,28 +76,46 @@ This directly optimizes the model to prefer correct judgments while maintaining
7676

7777
The filtered `.parquet` files are not included in this repository.
7878

79-
Please follow one of the options below to obtain the dataset.
79+
### Download Pre-Filtered Dataset (Recommended)
8080

81-
---
81+
The filtered dataset used in this implementation is hosted in a GCP bucket.
8282

83-
## Download Pre-Filtered Dataset (Recommended)
83+
#### 1) Authenticate with GCP
8484

85-
The filtered dataset used in this implementation is hosted in a GCP bucket.
85+
```bash
86+
gcloud auth login
87+
gcloud auth application-default login
88+
# When prompted, enter the email you used to log into the coder platform.
89+
# A browser window will open for Google sign-in. After signing in, you will receive a code.
90+
# Copy that code back into the terminal to complete authentication.
91+
gcloud config set account YOUR_EMAIL
92+
```
8693

87-
Download the `.parquet` files using:
94+
Verify active account:
8895

8996
```bash
90-
gsutil cp gs://<bucket-name>/reference_implementation_4/*.parquet .
97+
gcloud auth list
9198
```
9299

93-
***Do not download the ```train_raw.parquet```, use the ```train_sponsor_filtered.parquet``` for data_sky or ```train_singleturn_sponsor_filtered.parquet``` for data_hh_rlhf***
100+
#### 2) Download Dataset
101+
102+
```bash
103+
cd implementations/preference_alignment
104+
gcloud storage cp gs://interp-bootcamp-data/preference-alignment/data.zip .
105+
unzip data.zip
106+
```
107+
108+
Files are placed correctly after extraction — no manual reorganisation needed.
109+
110+
#### 3) Cleanup temporary files:
111+
112+
```bash
113+
rm -f __MACOSX data.zip data/.DS_Store
114+
```
94115

95-
After downloading, place the ```.parquet``` file inside one of the following folders (create the folder if it does not exist):
96-
```data_sky/``` or
97-
```data_hh_rlhf/```
98-
Then proceed with:
116+
> **Note:** Use `train_sponsor_filtered.parquet` (for `data_sky`) and `train_singleturn_sponsor_filtered.parquet` (for `data_hh_rlhf`).
99117
100-
```01_dataset_construction.ipynb```
118+
Then proceed with `01_dataset_construction.ipynb`.
101119

102120
## Using Your Own Dataset
103121

@@ -196,7 +214,7 @@ source .venv/bin/activate
196214
`flash-attn` requires CUDA headers and `setuptools` at compile time and cannot be installed via `uv sync`. After activating the venv, install it manually:
197215
198216
```bash
199-
pip install flash-attn==2.7.3 --no-build-isolation
217+
uv pip install flash-attn==2.7.3 --no-build-isolation
200218
```
201219
202220
> **Note:** This step requires a GPU node with CUDA available. Skip it if you are running on a CPU-only machine.

0 commit comments

Comments
 (0)