Update README files for multimedia RAG and preference alignment with GCP authentication and dataset download instructions

aravind-3105 · aravind-3105 · commit b5279bec0c08 · 2026-03-23T19:07:45.000-04:00
diff --git a/implementations/multimedia_rag/README.md b/implementations/multimedia_rag/README.md
@@ -51,35 +51,81 @@ This is required for embedding video/audio segments.
 mkdir -p data
 ```
 
-### VQA JSON Files (Place in `data/`)
+### VQA JSON Files
 
-Download:
+These are included in the GCP download below for convenience — same files as the SONIC-O1 HuggingFace dataset. They define the multiple-choice video QA tasks.
 
-* [https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/02_Job_Interviews.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/02_Job_Interviews.json)
-* [https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/04_Customer_Service_Interactions.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/04_Customer_Service_Interactions.json)
-* [https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/01_Patient-Doctor_Consultations.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/01_Patient-Doctor_Consultations.json)
+If you prefer to download them individually:
 
-These define the multiple-choice video QA tasks.
+- [02_Job_Interviews.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/02_Job_Interviews.json)
+- [04_Customer_Service_Interactions.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/04_Customer_Service_Interactions.json)
+- [01_Patient-Doctor_Consultations.json](https://huggingface.co/datasets/vector-institute/sonic-o1/blob/main/vqa/task2_mcq/01_Patient-Doctor_Consultations.json)
 
 ### Video / Audio / Caption Files
 
-Download from:
+The media dataset is hosted in a GCP bucket.
 
+#### 1) Authenticate with GCP
+
+```bash
+gcloud auth login
+gcloud auth application-default login
+# When prompted, enter the email you used to log into the coder platform.
+# A browser window will open for Google sign-in. After signing in, you will receive a code.
+# Copy that code back into the terminal to complete authentication.
+gcloud config set account YOUR_EMAIL
 ```
-<GOOGLE_DRIVE_LINK_PLACEHOLDER>
+
+Verify active account:
+
+```bash
+gcloud auth list
 ```
 
-Extract contents into:
+#### 2) Download Dataset
 
 ```bash
-data/
+cd implementations/multimedia_rag
+gcloud storage cp gs://interp-bootcamp-data/multimedia_rag/data.zip .
+unzip data.zip
 ```
 
-The expected structure includes:
+Files are placed correctly after extraction — no manual reorganisation needed.
+
+#### 3) Cleanup temporary files
 
-* video
-* audio
-* caption
+```bash
+rm -f __MACOSX data.zip data/.DS_Store
+```
+
+The zip contains everything needed to run the notebooks:
+
+```
+data/
+├── Customer_Service_Interactions/
+│   ├── audio/                   # base audio files
+│   ├── video/                   # base video files
+│   ├── caption/                 # base caption files
+│   ├── process-audio/           # pre-generated, can be regenerated
+│   ├── process-video/           # pre-generated, can be regenerated
+│   ├── segment-audio_30s/       # pre-generated, can be regenerated
+│   ├── segment-video_30s/       # pre-generated, can be regenerated
+│   ├── segment-caption_30s/     # pre-generated, can be regenerated
+│   ├── audio_embeddings.pt      # pre-generated, can be regenerated
+│   ├── video_embeddings.pt      # pre-generated, can be regenerated
+│   └── caption_embeddings.pt    # pre-generated, can be regenerated
+├── Job_Interviews/              # same structure as above
+├── Patient-Doctor_Consultations/  # same structure as above
+├── global_embeddings/           # pre-generated, can be regenerated
+├── Customer_Service_Interactions.json
+├── Customer_Service_Interactions_filtered.json  # pre-generated, can be regenerated
+├── Job_Interviews.json
+├── Job_Interviews_filtered.json               # pre-generated, can be regenerated
+├── Patient-Doctor_Consultations.json
+└── Patient-Doctor_Consultations_filtered.json # pre-generated, can be regenerated
+```
+
+Pre-generated files (`process-*`, `segment-*`, `*.pt` embeddings, `global_embeddings/`, `*_filtered.json`) are included to save time, but can all be reproduced by running the notebooks from scratch.
 
 ---
 
@@ -122,11 +168,6 @@ This installs everything needed for both the Video RAG (ImageBind embedding + re
 jupyter lab
 ```
 
-Choose:
-
-* **Ref5 (Video RAG)** → retrieval pipeline
-* **Ref5 (Video QA)** → multimodal QA evaluation
-
 ---
 
 ## Implementation Overview
diff --git a/implementations/preference_alignment/README.md b/implementations/preference_alignment/README.md
@@ -76,28 +76,46 @@ This directly optimizes the model to prefer correct judgments while maintaining
 
 The filtered `.parquet` files are not included in this repository.
 
-Please follow one of the options below to obtain the dataset.
+### Download Pre-Filtered Dataset (Recommended)
 
----
+The filtered dataset used in this implementation is hosted in a GCP bucket.
 
-## Download Pre-Filtered Dataset (Recommended)
+#### 1) Authenticate with GCP
 
-The filtered dataset used in this implementation is hosted in a GCP bucket.
+```bash
+gcloud auth login
+gcloud auth application-default login
+# When prompted, enter the email you used to log into the coder platform.
+# A browser window will open for Google sign-in. After signing in, you will receive a code.
+# Copy that code back into the terminal to complete authentication.
+gcloud config set account YOUR_EMAIL
+```
 
-Download the `.parquet` files using:
+Verify active account:
 
 ```bash
-gsutil cp gs://<bucket-name>/reference_implementation_4/*.parquet .
+gcloud auth list
 ```
 
-***Do not download the ```train_raw.parquet```, use the ```train_sponsor_filtered.parquet``` for data_sky or ```train_singleturn_sponsor_filtered.parquet``` for data_hh_rlhf***
+#### 2) Download Dataset
+
+```bash
+cd implementations/preference_alignment
+gcloud storage cp gs://interp-bootcamp-data/preference-alignment/data.zip .
+unzip data.zip
+```
+
+Files are placed correctly after extraction — no manual reorganisation needed.
+
+#### 3) Cleanup temporary files:
+
+```bash
+rm -f __MACOSX data.zip data/.DS_Store
+```
 
-After downloading, place the ```.parquet``` file inside one of the following folders (create the folder if it does not exist):
-```data_sky/```  or
-```data_hh_rlhf/```
-Then proceed with:
+> **Note:** Use `train_sponsor_filtered.parquet` (for `data_sky`) and `train_singleturn_sponsor_filtered.parquet` (for `data_hh_rlhf`).
 
-```01_dataset_construction.ipynb```
+Then proceed with `01_dataset_construction.ipynb`.
 
 ## Using Your Own Dataset
 
@@ -196,7 +214,7 @@ source .venv/bin/activate
 `flash-attn` requires CUDA headers and `setuptools` at compile time and cannot be installed via `uv sync`. After activating the venv, install it manually:
 
 ```bash
-pip install flash-attn==2.7.3 --no-build-isolation
+uv pip install flash-attn==2.7.3 --no-build-isolation
 ```
 
 > **Note:** This step requires a GPU node with CUDA available. Skip it if you are running on a CPU-only machine.