Skip to content

Commit 0ae5b70

Browse files
committed
init
0 parents  commit 0ae5b70

File tree

16 files changed

+1299
-0
lines changed

16 files changed

+1299
-0
lines changed

.dockerignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
node_modules
2+
dist
3+
*.log
4+
.git
5+
.gitignore
6+
README.md

.github/workflows/ci.yml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
branches:
9+
- main
10+
11+
jobs:
12+
typecheck:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- name: Checkout repository
16+
uses: actions/checkout@v4
17+
18+
- name: Setup Bun
19+
uses: oven-sh/setup-bun@v2
20+
with:
21+
bun-version: latest
22+
23+
- name: Install dependencies
24+
run: bun install
25+
26+
- name: Type check
27+
run: bun run tsc --noEmit
28+
29+
- name: Build
30+
run: bun build src/index.ts --outdir dist

.github/workflows/docker.yml

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
name: Build and Push Docker Image
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
tags:
8+
- 'v*'
9+
pull_request:
10+
branches:
11+
- main
12+
13+
env:
14+
REGISTRY: ghcr.io
15+
IMAGE_NAME: ${{ github.repository }}
16+
17+
jobs:
18+
build:
19+
runs-on: ubuntu-latest
20+
permissions:
21+
contents: read
22+
packages: write
23+
24+
steps:
25+
- name: Checkout repository
26+
uses: actions/checkout@v4
27+
28+
- name: Set up Docker Buildx
29+
uses: docker/setup-buildx-action@v3
30+
31+
- name: Log in to Container Registry
32+
if: github.event_name != 'pull_request'
33+
uses: docker/login-action@v3
34+
with:
35+
registry: ${{ env.REGISTRY }}
36+
username: ${{ github.actor }}
37+
password: ${{ secrets.GITHUB_TOKEN }}
38+
39+
- name: Extract metadata for Docker
40+
id: meta
41+
uses: docker/metadata-action@v5
42+
with:
43+
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
44+
tags: |
45+
type=ref,event=branch
46+
type=ref,event=pr
47+
type=semver,pattern={{version}}
48+
type=semver,pattern={{major}}.{{minor}}
49+
type=sha,prefix=
50+
type=raw,value=latest,enable={{is_default_branch}}
51+
52+
- name: Build and push Docker image
53+
uses: docker/build-push-action@v5
54+
with:
55+
context: .
56+
push: ${{ github.event_name != 'pull_request' }}
57+
tags: ${{ steps.meta.outputs.tags }}
58+
labels: ${{ steps.meta.outputs.labels }}
59+
cache-from: type=gha
60+
cache-to: type=gha,mode=max

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
node_modules
2+
dist
3+
bun.lockb
4+
*.log
5+
.env

Dockerfile

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Lightweight Bun runtime image
2+
FROM oven/bun:1-alpine
3+
4+
WORKDIR /app
5+
6+
# Copy package files
7+
COPY package.json bun.lockb* ./
8+
9+
# Install dependencies
10+
RUN bun install --frozen-lockfile --production
11+
12+
# Copy source code
13+
COPY src ./src
14+
COPY tsconfig.json ./
15+
16+
# Default command expects config.json to be mounted
17+
ENTRYPOINT ["bun", "run", "src/index.ts"]
18+
CMD ["config.json"]

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2024 Health Samurai
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# Meilisearch Docs Scraper
2+
3+
A fast, lightweight documentation scraper for [Meilisearch](https://www.meilisearch.com/) built with [Bun](https://bun.sh/).
4+
5+
## Why not use the official docs-scraper?
6+
7+
The official [meilisearch/docs-scraper](https://github.com/meilisearch/docs-scraper) is a great tool, but it has some limitations:
8+
9+
| | docs-scraper | meilisearch-docs-scraper |
10+
| ----------------- | ---------------------------------- | ------------------------------ |
11+
| Runtime | Python + Scrapy + Chromium | Bun (single binary) |
12+
| Docker image | ~1GB | **~150MB** |
13+
| JS rendering | Yes (Chromium) | No |
14+
| Speed (500 pages) | ~3-5 min | **~1 min** |
15+
| Zero-downtime | No (overwrites index) | **Yes (atomic index swap)** |
16+
| Deps count | ~50+ (Chromium, OpenSSL, libxml2) | **3 (linkedom, meilisearch)** |
17+
18+
**Use this scraper if:**
19+
20+
- Your documentation is server-side rendered (SSR)
21+
- You don't need JavaScript rendering
22+
- You want a smaller, faster Docker image
23+
- You want to minimize security vulnerabilities
24+
- You need zero-downtime reindexing (atomic index swap)
25+
26+
**Use the official docs-scraper if:**
27+
28+
- Your documentation requires JavaScript to render (SPA)
29+
- You need advanced authentication (Cloudflare, IAP, Keycloak)
30+
31+
## Features
32+
33+
- **100% compatible** with docs-scraper config format
34+
- **Multi-config support** — index multiple sites in one run
35+
- **Sitemap-based** URL discovery
36+
- **CSS selector-based** content extraction
37+
- **Hierarchical heading structure** (lvl0-lvl6) with anchor links
38+
- **Zero-downtime reindexing** with atomic index swapping
39+
- **Concurrent scraping** (10 pages in parallel)
40+
- **Batch indexing** (100 documents per batch)
41+
- **Stop URLs** filtering
42+
- **Proper error handling** with task waiting (no `sleep()` hacks)
43+
- **Automatic cleanup** of failed previous runs
44+
45+
## Quick Start
46+
47+
```bash
48+
docker run --rm \
49+
-e MEILISEARCH_HOST_URL=http://host.docker.internal:7700 \
50+
-e MEILISEARCH_API_KEY=your-api-key \
51+
-v $(pwd)/config.json:/app/config.json \
52+
ghcr.io/healthsamurai/meilisearch-docs-scraper:latest
53+
```
54+
55+
## Usage
56+
57+
### Docker (recommended)
58+
59+
```bash
60+
docker run --rm \
61+
-e MEILISEARCH_HOST_URL=http://meilisearch:7700 \
62+
-e MEILISEARCH_API_KEY=your-api-key \
63+
-e INDEX_NAME=docs \
64+
-v $(pwd)/config.json:/app/config.json \
65+
ghcr.io/healthsamurai/meilisearch-docs-scraper:latest
66+
```
67+
68+
### Environment Variables
69+
70+
| Variable | Required | Description |
71+
| ---------------------- | -------- | ------------------------------- |
72+
| `MEILISEARCH_HOST_URL` | Yes | Meilisearch server URL |
73+
| `MEILISEARCH_API_KEY` | Yes | Meilisearch API key |
74+
| `INDEX_NAME` | No | Override index name from config |
75+
76+
### Kubernetes CronJob
77+
78+
```yaml
79+
apiVersion: batch/v1
80+
kind: CronJob
81+
metadata:
82+
name: meilisearch-reindex
83+
spec:
84+
schedule: "0 * * * *"
85+
jobTemplate:
86+
spec:
87+
template:
88+
spec:
89+
restartPolicy: OnFailure
90+
containers:
91+
- name: scraper
92+
image: ghcr.io/healthsamurai/meilisearch-docs-scraper:latest
93+
env:
94+
- name: MEILISEARCH_HOST_URL
95+
value: "http://meilisearch:7700"
96+
- name: MEILISEARCH_API_KEY
97+
valueFrom:
98+
secretKeyRef:
99+
name: meilisearch-secret
100+
key: api-key
101+
- name: INDEX_NAME
102+
value: "docs"
103+
volumeMounts:
104+
- name: config
105+
mountPath: /app/config.json
106+
subPath: config.json
107+
volumes:
108+
- name: config
109+
configMap:
110+
name: scraper-config
111+
```
112+
113+
### Multiple Configs (Single Job)
114+
115+
Index multiple sites in one run — useful for reducing k8s jobs:
116+
117+
```bash
118+
docker run --rm \
119+
-e MEILISEARCH_HOST_URL=http://meilisearch:7700 \
120+
-e MEILISEARCH_API_KEY=your-api-key \
121+
-v $(pwd)/configs:/configs \
122+
ghcr.io/healthsamurai/meilisearch-docs-scraper:latest \
123+
/configs/docs.json /configs/fhirbase.json /configs/auditbox.json
124+
```
125+
126+
Each config creates its own index (from `index_uid` in config).
127+
128+
### Run from Source
129+
130+
```bash
131+
# Install Bun
132+
curl -fsSL https://bun.sh/install | bash
133+
134+
# Clone and install
135+
git clone https://github.com/HealthSamurai/meilisearch-docs-scraper.git
136+
cd meilisearch-docs-scraper
137+
bun install
138+
139+
# Single config
140+
MEILISEARCH_HOST_URL=http://localhost:7700 \
141+
MEILISEARCH_API_KEY=your-api-key \
142+
bun run src/index.ts config.json
143+
144+
# Multiple configs
145+
MEILISEARCH_HOST_URL=http://localhost:7700 \
146+
MEILISEARCH_API_KEY=your-api-key \
147+
bun run src/index.ts docs.json fhirbase.json auditbox.json
148+
```
149+
150+
## Configuration
151+
152+
Uses the same config format as the official docs-scraper:
153+
154+
```json
155+
{
156+
"index_uid": "docs",
157+
"sitemap_urls": ["https://example.com/sitemap.xml"],
158+
"start_urls": ["https://example.com/docs/"],
159+
"stop_urls": ["https://example.com/docs/deprecated"],
160+
"selectors": {
161+
"lvl0": {
162+
"selector": "nav li:last-child",
163+
"global": true,
164+
"default_value": "Documentation"
165+
},
166+
"lvl1": "article h1",
167+
"lvl2": "article h2",
168+
"lvl3": "article h3",
169+
"lvl4": "article h4",
170+
"lvl5": "article h5",
171+
"lvl6": "article h6",
172+
"text": "article p, article li, article td"
173+
},
174+
"custom_settings": {
175+
"searchableAttributes": [
176+
"hierarchy_lvl1",
177+
"hierarchy_lvl2",
178+
"hierarchy_lvl3",
179+
"content"
180+
],
181+
"rankingRules": [
182+
"attribute",
183+
"words",
184+
"typo",
185+
"proximity",
186+
"sort",
187+
"exactness"
188+
]
189+
}
190+
}
191+
```
192+
193+
## How It Works
194+
195+
1. **Fetch sitemap** — Parses sitemap.xml to get all documentation URLs
196+
2. **Filter URLs** — Excludes URLs matching `stop_urls` patterns
197+
3. **Scrape pages** — Fetches pages concurrently (10 at a time) and extracts content
198+
4. **Build hierarchy** — Tracks heading levels (h1→h6) with anchor links for deep linking
199+
5. **Index to temp** — Creates `{index}_temp` and pushes documents in batches of 100
200+
6. **Atomic swap** — Swaps temp with production index (zero downtime)
201+
7. **Cleanup** — Deletes the old temp index
202+
203+
### Zero-Downtime Index Swap
204+
205+
```
206+
┌─────────────┐
207+
Scrape pages ───►│ docs_temp │ (new data)
208+
└──────┬──────┘
209+
│ swap
210+
┌──────▼──────┐
211+
Search works ───►│ docs │ (now has new data)
212+
└─────────────┘
213+
│ delete
214+
┌──────▼──────┐
215+
│ docs_temp │ ✗ deleted
216+
└─────────────┘
217+
```
218+
219+
Search remains available throughout the entire reindex process.
220+
221+
## Development
222+
223+
```bash
224+
# Run locally
225+
bun run src/index.ts config.json
226+
227+
# Type check
228+
bun run tsc --noEmit
229+
230+
# Build
231+
bun build src/index.ts --outdir dist
232+
```
233+
234+
## License
235+
236+
MIT

0 commit comments

Comments
 (0)