Skip to content

Commit 10d35bf

Browse files
babblebeyCopilot
andauthored
feat(dev): incremental update mechanism for vector store (#283)
## Description <!-- Please add PR description (don't leave blank) - example: This PR [adds/removes/fixes/replaces] the [feature/bug/etc] --> This pull request introduces an incremental update system for the ✨jAI vector store, allowing dictionary changes to be efficiently propagated to Qdrant without a full reseed. The update process is automated via a new GitHub Actions workflow and can be triggered manually or on production deployments tied to specific PR labels. The documentation and scripts have been refactored for clarity and maintainability, and the seeding process now attaches metadata for future incremental updates. **Incremental Vector Store Update System** * Added a new GitHub Actions workflow (`.github/workflows/update-vector-store.yml`) to automate incremental updates to the Qdrant vector store. It triggers on production deployments or manual dispatch, gates on PR labels, detects dictionary file changes, and runs the update script only when necessary. * Introduced `dev/update-vector-store.js`, a script that upserts or deletes only the changed dictionary words in Qdrant. It uses CLI arguments for slugs, fetches live API data, deletes old chunks by `metadata.slug`, splits content, and updates the vector store with robust error handling. * Updated `package.json` to add `update:jai` and `update:jai:ci` npm scripts for local and CI/CD usage of the incremental update script. **Seeding and Documentation Improvements** * Refactored the seeding script (`dev/seed-vector-store.js`) to create LangChain `Document` objects directly from API data, attach `metadata.slug` for all words, and remove file system dependencies, ensuring compatibility with incremental updates. [[1]](diffhunk://#diff-c420bb1052f5ea7c049fcefbe36fe8175c070064782bfd5f2556caa142822488L2-R19) [[2]](diffhunk://#diff-c420bb1052f5ea7c049fcefbe36fe8175c070064782bfd5f2556caa142822488L33-R29) [[3]](diffhunk://#diff-c420bb1052f5ea7c049fcefbe36fe8175c070064782bfd5f2556caa142822488L47-L50) * Overhauled `dev/README.md` to document the new incremental update workflow, CLI usage, error handling, and example outputs. The seed and update processes are clearly differentiated, and the vector store requirements for incremental updates are explained. [[1]](diffhunk://#diff-e14025c0fa40d4857e4b40fc96ea5ee995afe300014626b68cb55a479fa5b8fcL43) [[2]](diffhunk://#diff-e14025c0fa40d4857e4b40fc96ea5ee995afe300014626b68cb55a479fa5b8fcL56-L68) [[3]](diffhunk://#diff-e14025c0fa40d4857e4b40fc96ea5ee995afe300014626b68cb55a479fa5b8fcL88-R221) [[4]](diffhunk://#diff-e14025c0fa40d4857e4b40fc96ea5ee995afe300014626b68cb55a479fa5b8fcL167-R282) ## Related Issue <!-- Please prefix the issue number with Fixes/Resolves - example: Fixes #123 or Resolves #123 --> Fixes #196 ## Screenshots/Screencasts <!-- Please provide screenshots or video recording that demos your changes (especially if it's a visual change) --> NA ## Notes to Reviewer <!-- Please state here if you added a new npm packages, or any extra information that can help reviewer better review you changes --> Add new env entry to github secrets and variable under actions - `OPENAI_API_KEY` - secret - `OPENAI_EMBEDDINGS_MODEL` - variable --------- Co-authored-by: Copilot <[email protected]>
1 parent 4548f87 commit 10d35bf

File tree

5 files changed

+533
-35
lines changed

5 files changed

+533
-35
lines changed
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
name: Update Vector Store (Qdrant)
2+
3+
on:
4+
deployment_status:
5+
6+
# Allow manual triggering with custom slugs
7+
workflow_dispatch:
8+
inputs:
9+
upsert_slugs:
10+
description: "Comma-separated slugs to upsert (e.g. api,closure)"
11+
required: false
12+
delete_slugs:
13+
description: "Comma-separated slugs to delete (e.g. old-term)"
14+
required: false
15+
16+
jobs:
17+
update-vector-store:
18+
runs-on: ubuntu-latest
19+
20+
# Only run on successful production deployments (or manual trigger)
21+
if: >
22+
github.event_name == 'workflow_dispatch' ||
23+
(
24+
github.event.deployment_status.state == 'success' &&
25+
github.event.deployment.environment == 'Production'
26+
)
27+
28+
steps:
29+
# ── Gate: Check that the merged PR has a dictionary label ──────────
30+
- name: Check PR labels
31+
if: github.event_name != 'workflow_dispatch'
32+
id: pr-check
33+
uses: actions/github-script@v7
34+
with:
35+
script: |
36+
const sha = context.payload.deployment.sha;
37+
38+
// Find PRs associated with this deployment commit
39+
const { data: prs } = await github.rest.repos.listPullRequestsAssociatedWithCommit({
40+
owner: context.repo.owner,
41+
repo: context.repo.repo,
42+
commit_sha: sha,
43+
});
44+
45+
// Find the merged PR targeting main
46+
const mergedPR = prs.find(pr => pr.merged_at && pr.base.ref === 'main');
47+
48+
if (!mergedPR) {
49+
core.info('No merged PR found for this deployment. Skipping.');
50+
core.setOutput('should_continue', 'false');
51+
return;
52+
}
53+
54+
const labels = mergedPR.labels.map(l => l.name);
55+
core.info(`PR #${mergedPR.number}: ${mergedPR.title}`);
56+
core.info(`Labels: ${labels.join(', ')}`);
57+
58+
const requiredLabels = ['📖edit-word', '📖new-word'];
59+
const hasRequiredLabel = labels.some(l => requiredLabels.includes(l));
60+
61+
if (!hasRequiredLabel) {
62+
core.info(`PR does not have required labels (${requiredLabels.join(', ')}). Skipping.`);
63+
core.setOutput('should_continue', 'false');
64+
return;
65+
}
66+
67+
core.info('✅ PR has required label. Proceeding with update.');
68+
core.setOutput('should_continue', 'true');
69+
70+
- name: Skip — PR lacks required labels
71+
if: github.event_name != 'workflow_dispatch' && steps.pr-check.outputs.should_continue != 'true'
72+
run: |
73+
echo "⏭️ Skipping: deployment is not from a 📖new-word or 📖edit-word PR."
74+
75+
# ── Checkout & detect changed dictionary files ─────────────────────
76+
- name: Checkout repository
77+
if: github.event_name == 'workflow_dispatch' || steps.pr-check.outputs.should_continue == 'true'
78+
uses: actions/checkout@v4
79+
with:
80+
ref: ${{ github.event.deployment.sha || github.sha }}
81+
fetch-depth: 2
82+
83+
- name: Detect changed dictionary files
84+
if: github.event_name == 'workflow_dispatch' || steps.pr-check.outputs.should_continue == 'true'
85+
id: detect
86+
run: |
87+
# For manual triggers, use the provided inputs directly
88+
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
89+
UPSERT="${{ github.event.inputs.upsert_slugs }}"
90+
DELETE="${{ github.event.inputs.delete_slugs }}"
91+
92+
if [ -z "$UPSERT" ] && [ -z "$DELETE" ]; then
93+
echo "has_changes=false" >> "$GITHUB_OUTPUT"
94+
echo "No slugs provided for manual trigger."
95+
else
96+
echo "upsert=$UPSERT" >> "$GITHUB_OUTPUT"
97+
echo "delete=$DELETE" >> "$GITHUB_OUTPUT"
98+
echo "has_changes=true" >> "$GITHUB_OUTPUT"
99+
fi
100+
exit 0
101+
fi
102+
103+
# For deployment triggers, diff against the parent commit
104+
echo "Detecting dictionary file changes..."
105+
UPSERT_SLUGS=""
106+
DELETE_SLUGS=""
107+
108+
while IFS=$'\t' read -r status file; do
109+
if [[ "$file" == src/content/dictionary/*.mdx ]]; then
110+
slug=$(basename "$file" .mdx)
111+
112+
if [[ "$status" == "D" ]]; then
113+
DELETE_SLUGS="${DELETE_SLUGS:+$DELETE_SLUGS,}$slug"
114+
else
115+
UPSERT_SLUGS="${UPSERT_SLUGS:+$UPSERT_SLUGS,}$slug"
116+
fi
117+
fi
118+
done < <(git diff --name-status HEAD~1 -- src/content/dictionary/)
119+
120+
if [ -z "$UPSERT_SLUGS" ] && [ -z "$DELETE_SLUGS" ]; then
121+
echo "has_changes=false" >> "$GITHUB_OUTPUT"
122+
echo "No dictionary file changes detected. Skipping update."
123+
else
124+
echo "upsert=$UPSERT_SLUGS" >> "$GITHUB_OUTPUT"
125+
echo "delete=$DELETE_SLUGS" >> "$GITHUB_OUTPUT"
126+
echo "has_changes=true" >> "$GITHUB_OUTPUT"
127+
echo "Upsert slugs: $UPSERT_SLUGS"
128+
echo "Delete slugs: $DELETE_SLUGS"
129+
fi
130+
131+
- name: Skip — no dictionary changes
132+
if: steps.detect.outputs.has_changes != 'true' && (github.event_name == 'workflow_dispatch' || steps.pr-check.outputs.should_continue == 'true')
133+
run: echo "⏭️ No dictionary changes to process. Skipping."
134+
135+
# ── Run the incremental update ─────────────────────────────────────
136+
- name: Setup Node.js
137+
if: steps.detect.outputs.has_changes == 'true'
138+
uses: actions/setup-node@v4
139+
with:
140+
node-version: "20"
141+
cache: "npm"
142+
143+
- name: Install dependencies
144+
if: steps.detect.outputs.has_changes == 'true'
145+
run: npm ci
146+
147+
- name: Update vector store
148+
if: steps.detect.outputs.has_changes == 'true'
149+
env:
150+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
151+
OPENAI_EMBEDDINGS_MODEL: ${{ vars.OPENAI_EMBEDDINGS_MODEL }}
152+
QDRANT_URL: ${{ secrets.QDRANT_URL }}
153+
QDRANT_API_KEY: ${{ secrets.QDRANT_API_KEY }}
154+
run: |
155+
ARGS=""
156+
157+
if [ -n "${{ steps.detect.outputs.upsert }}" ]; then
158+
ARGS="$ARGS --upsert ${{ steps.detect.outputs.upsert }}"
159+
fi
160+
161+
if [ -n "${{ steps.detect.outputs.delete }}" ]; then
162+
ARGS="$ARGS --delete ${{ steps.detect.outputs.delete }}"
163+
fi
164+
165+
echo "Running: npm run update:jai:ci -- $ARGS"
166+
npm run update:jai:ci -- $ARGS
167+
168+
- name: Update successful
169+
if: steps.detect.outputs.has_changes == 'true'
170+
run: echo "✅ Vector store update completed successfully"

dev/README.md

Lines changed: 128 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,6 @@ Before running this script, ensure you have:
4040
- All dependencies installed (`npm ci`)
4141
- `OPENAI_API_KEY`, `QDRANT_URL` and `QDRANT_API_KEY` environment variables properly configured in your `.env` file
4242
- Network access to fetch from jargons.dev API
43-
- Sufficient disk space for temporary dictionary file
4443

4544
### Usage
4645

@@ -53,19 +52,17 @@ npm run seed:jai
5352
The script performs these steps to prepare ✨jAI's knowledge base:
5453

5554
1. **Data Fetching**: Downloads the complete dictionary from `https://jargons.dev/api/v1/browse`
56-
2. **File Processing**: Saves data locally and loads it using LangChain's JSONLoader
57-
3. **Document Splitting**: Breaks content into optimally-sized chunks (1000 chars with 200 overlap)
55+
2. **Document Creation**: Creates LangChain `Document` objects directly from the API response, attaching `slug` metadata to each word for future incremental updates
56+
3. **Document Splitting**: Breaks content into optimally-sized chunks (1000 chars with 200 overlap), preserving the slug metadata on every chunk
5857
4. **Vector Store Population**: Adds processed documents to ✨jAI's vector store in batches of 100
59-
5. **Cleanup**: Removes temporary files and provides completion summary
6058

6159
### Technical Implementation
6260

6361
The script leverages several key technologies:
6462

65-
- **LangChain JSONLoader**: Extracts title and content fields from dictionary entries
63+
- **LangChain Document**: Creates documents directly from API data with `metadata.slug` for traceability
6664
- **RecursiveCharacterTextSplitter**: Intelligently splits text while preserving context
6765
- **Batch Processing**: Prevents memory issues and provides progress feedback
68-
- **File System Operations**: Handles temporary file creation and cleanup
6966

7067
### Configuration Options
7168

@@ -85,25 +82,144 @@ Key parameters that can be adjusted:
8582

8683
The script includes robust error handling for:
8784
- Network connectivity issues during API calls
88-
- File system errors during temporary file operations
8985
- Vector store connection problems
9086
- Memory management during large batch processing
9187

9288
### Example Output
9389

9490
```
95-
Saved the dictionary file to /path/to/dev/dictionary.json
96-
Loaded 500 documents
97-
Split 1250 documents
91+
Fetched 500 words from the API
92+
Created 500 documents
93+
Split into 1250 chunks
9894
Added batch 1 of 13 (100 documents) to the vector store
9995
Added batch 2 of 13 (100 documents) to the vector store
10096
...
10197
Added 1250 splits to the vector store
102-
Cleaned up the dictionary file at /path/to/dev/dictionary.json
10398
```
10499

105100
Once completed, ✨jAI will have access to the processed dictionary content and can provide intelligent responses about software engineering terms.
106101

102+
> **Note:** After running a full seed, all vector points will include `metadata.slug`, which is required for incremental updates via the [Update Vector Store Script](#update-vector-store-script) to work correctly.
103+
104+
## Update Vector Store Script
105+
106+
This script performs **incremental updates** to ✨jAI's vector store when dictionary words are added, modified, or removed. Instead of re-seeding the entire collection, it targets only the changed words — making it fast and efficient for CI/CD use after new words are merged.
107+
108+
### When to Use
109+
110+
This script is primarily run automatically via the **Update Vector Store** GitHub Actions workflow when a new word PR is merged and the Vercel production deployment succeeds. You can also run it manually when you need to:
111+
- Add or update specific words in the vector store
112+
- Remove deleted words from the vector store
113+
- Fix vector store entries for particular terms
114+
115+
### Prerequisites
116+
117+
Before running this script, ensure you have:
118+
- All dependencies installed (`npm ci`)
119+
- `OPENAI_API_KEY`, `OPENAI_EMBEDDINGS_MODEL`, `QDRANT_URL` and `QDRANT_API_KEY` environment variables properly configured in your `.env` file
120+
- Network access to fetch from the jargons.dev production API
121+
- The vector store has been initially seeded with `metadata.slug` on all points (via `npm run seed:jai`)
122+
123+
### Usage
124+
125+
**Local Development:**
126+
```bash
127+
npm run update:jai -- --upsert slug1,slug2 --delete slug3,slug4
128+
```
129+
130+
**CI/CD (without .env file):**
131+
```bash
132+
npm run update:jai:ci -- --upsert slug1,slug2 --delete slug3
133+
```
134+
135+
### Flags
136+
137+
- `--upsert <slugs>` — Comma-separated slugs of words to add or update. For each slug, the script deletes any existing chunks in Qdrant (by `metadata.slug` filter), fetches the latest content from the production API, splits it into chunks, and adds them to the vector store.
138+
- `--delete <slugs>` — Comma-separated slugs of words to remove. Deletes all chunks matching the slug from Qdrant.
139+
140+
Both flags are optional, but at least one must be provided for the script to do anything.
141+
142+
### How It Works
143+
144+
The script performs these steps for each word:
145+
146+
**For upserts (add/update):**
147+
1. **Delete Old Chunks**: Removes existing vector points matching `metadata.slug` via a Qdrant filter
148+
2. **Fetch Latest Content**: Downloads the word from `https://jargons.dev/api/v1/browse/{slug}`
149+
3. **Create Document**: Builds a LangChain `Document` with `metadata.slug` for traceability
150+
4. **Split into Chunks**: Breaks content into optimally-sized chunks (1000 chars with 200 overlap)
151+
5. **Add to Vector Store**: Upserts the new chunks into Qdrant
152+
153+
**For deletes:**
154+
1. **Delete Chunks**: Removes all vector points matching `metadata.slug` via a Qdrant filter
155+
156+
### Technical Implementation
157+
158+
The script leverages several key technologies:
159+
160+
- **LangChain Document**: Creates documents with `metadata.slug` for targeted updates
161+
- **Qdrant Filter-based Deletion**: Uses `vectorStore.delete({ filter })` with a `metadata.slug` match condition to precisely target existing chunks for a word
162+
- **RecursiveCharacterTextSplitter**: Same chunking config as the seed script (1000/200) for consistency
163+
- **Production API**: Fetches from the deployed site to ensure the vector store matches the live content
164+
165+
### Configuration Options
166+
167+
Required environment variables:
168+
169+
- **QDRANT_URL**: Your Qdrant cluster endpoint (e.g., `https://your-cluster.gcp.cloud.qdrant.io`)
170+
- **QDRANT_API_KEY**: Your Qdrant cluster API key for authentication
171+
- **OPENAI_API_KEY**: Your OpenAI API Key for generating embeddings
172+
- **OPENAI_EMBEDDINGS_MODEL**: The embeddings model to use (e.g., `text-embedding-3-small`)
173+
174+
### Automated via GitHub Actions
175+
176+
The **Update Vector Store** workflow (`.github/workflows/update-vector-store.yml`) runs this script automatically:
177+
178+
- **Trigger**: Fires on `deployment_status` events — specifically when Vercel reports a successful **Production** deployment
179+
- **PR Label Gate**: Uses the GitHub API to find the merged PR associated with the deployment commit and checks for the `📖new-word` or `📖edit-word` labels. Deployments from PRs without these labels are skipped early (before any Node.js setup or dependency installation)
180+
- **Change Detection**: Diffs `HEAD~1` to identify added, modified, or deleted `.mdx` files in `src/content/dictionary/`
181+
- **Skip Logic**: Exits early if no dictionary files were changed in the commit
182+
- **Manual Trigger**: Can also be run manually from the GitHub Actions tab with custom `upsert_slugs` and `delete_slugs` inputs (bypasses the label check)
183+
- **Required Secrets**: `OPENAI_API_KEY`, `QDRANT_URL`, `QDRANT_API_KEY`
184+
- **Required Variables**: `OPENAI_EMBEDDINGS_MODEL`
185+
186+
### Error Handling
187+
188+
The script includes robust error handling for:
189+
- Unknown flags or flags missing required values (prints an error with usage instructions and exits with code 1)
190+
- No slugs provided (prints usage and exits gracefully with code 0)
191+
- Words not found on the production API (404 — warns and continues with remaining slugs)
192+
- Network connectivity issues
193+
- Vector store connection and deletion failures
194+
- Per-word error isolation (one failing slug doesn't block the others)
195+
- Non-zero exit code if any operation fails
196+
197+
### Example Output
198+
199+
```
200+
🚀 Starting incremental vector store update...
201+
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
202+
📝 Words to upsert: api, closure
203+
🗑️ Words to delete: old-term
204+
205+
🔄 Processing upsert for "api"...
206+
Deleting old chunks for "api"...
207+
Split into 3 chunk(s).
208+
✅ Upserted "api" (3 chunks)
209+
210+
🔄 Processing upsert for "closure"...
211+
Deleting old chunks for "closure"...
212+
Split into 2 chunk(s).
213+
✅ Upserted "closure" (2 chunks)
214+
215+
🗑️ Deleting "old-term" from vector store...
216+
✅ Deleted "old-term"
217+
218+
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
219+
✨ Done! Upserted: 2, Deleted: 1, Failed: 0
220+
🎉 Vector store update completed successfully!
221+
```
222+
107223
## Vector Store Cluster Ping Script
108224

109225
This script performs a lightweight health check on the Vector Store (Qdrant) cluster to keep it active and prevent automatic deletion due to inactivity. It's designed to be run both locally for testing and automatically via GitHub Actions.
@@ -164,7 +280,7 @@ Required environment variables:
164280
### Automated Scheduling
165281

166282
The script is automatically run via GitHub Actions:
167-
- **Schedule**: Every Sunday at 2 AM UTC
283+
- **Schedule**: Every Sunday and Wednesday at midnight UTC
168284
- **Manual Trigger**: Can be run manually from GitHub Actions tab
169285
- **Purpose**: Prevents cluster deletion due to inactivity
170286

0 commit comments

Comments
 (0)