|
| 1 | +# Meilisearch Docs Scraper |
| 2 | + |
| 3 | +A fast, lightweight documentation scraper for [Meilisearch](https://www.meilisearch.com/) built with [Bun](https://bun.sh/). |
| 4 | + |
| 5 | +## Why not use the official docs-scraper? |
| 6 | + |
| 7 | +The official [meilisearch/docs-scraper](https://github.com/meilisearch/docs-scraper) is a great tool, but it has some limitations: |
| 8 | + |
| 9 | +| | docs-scraper | meilisearch-docs-scraper | |
| 10 | +| ----------------- | ---------------------------------- | ------------------------------ | |
| 11 | +| Runtime | Python + Scrapy + Chromium | Bun (single binary) | |
| 12 | +| Docker image | ~1GB | **~150MB** | |
| 13 | +| JS rendering | Yes (Chromium) | No | |
| 14 | +| Speed (500 pages) | ~3-5 min | **~1 min** | |
| 15 | +| Zero-downtime | No (overwrites index) | **Yes (atomic index swap)** | |
| 16 | +| Deps count | ~50+ (Chromium, OpenSSL, libxml2) | **3 (linkedom, meilisearch)** | |
| 17 | + |
| 18 | +**Use this scraper if:** |
| 19 | + |
| 20 | +- Your documentation is server-side rendered (SSR) |
| 21 | +- You don't need JavaScript rendering |
| 22 | +- You want a smaller, faster Docker image |
| 23 | +- You want to minimize security vulnerabilities |
| 24 | +- You need zero-downtime reindexing (atomic index swap) |
| 25 | + |
| 26 | +**Use the official docs-scraper if:** |
| 27 | + |
| 28 | +- Your documentation requires JavaScript to render (SPA) |
| 29 | +- You need advanced authentication (Cloudflare, IAP, Keycloak) |
| 30 | + |
| 31 | +## Features |
| 32 | + |
| 33 | +- **100% compatible** with docs-scraper config format |
| 34 | +- **Multi-config support** — index multiple sites in one run |
| 35 | +- **Sitemap-based** URL discovery |
| 36 | +- **CSS selector-based** content extraction |
| 37 | +- **Hierarchical heading structure** (lvl0-lvl6) with anchor links |
| 38 | +- **Zero-downtime reindexing** with atomic index swapping |
| 39 | +- **Concurrent scraping** (10 pages in parallel) |
| 40 | +- **Batch indexing** (100 documents per batch) |
| 41 | +- **Stop URLs** filtering |
| 42 | +- **Proper error handling** with task waiting (no `sleep()` hacks) |
| 43 | +- **Automatic cleanup** of failed previous runs |
| 44 | + |
| 45 | +## Quick Start |
| 46 | + |
| 47 | +```bash |
| 48 | +docker run --rm \ |
| 49 | + -e MEILISEARCH_HOST_URL=http://host.docker.internal:7700 \ |
| 50 | + -e MEILISEARCH_API_KEY=your-api-key \ |
| 51 | + -v $(pwd)/config.json:/app/config.json \ |
| 52 | + ghcr.io/healthsamurai/meilisearch-docs-scraper:latest |
| 53 | +``` |
| 54 | + |
| 55 | +## Usage |
| 56 | + |
| 57 | +### Docker (recommended) |
| 58 | + |
| 59 | +```bash |
| 60 | +docker run --rm \ |
| 61 | + -e MEILISEARCH_HOST_URL=http://meilisearch:7700 \ |
| 62 | + -e MEILISEARCH_API_KEY=your-api-key \ |
| 63 | + -e INDEX_NAME=docs \ |
| 64 | + -v $(pwd)/config.json:/app/config.json \ |
| 65 | + ghcr.io/healthsamurai/meilisearch-docs-scraper:latest |
| 66 | +``` |
| 67 | + |
| 68 | +### Environment Variables |
| 69 | + |
| 70 | +| Variable | Required | Description | |
| 71 | +| ---------------------- | -------- | ------------------------------- | |
| 72 | +| `MEILISEARCH_HOST_URL` | Yes | Meilisearch server URL | |
| 73 | +| `MEILISEARCH_API_KEY` | Yes | Meilisearch API key | |
| 74 | +| `INDEX_NAME` | No | Override index name from config | |
| 75 | + |
| 76 | +### Kubernetes CronJob |
| 77 | + |
| 78 | +```yaml |
| 79 | +apiVersion: batch/v1 |
| 80 | +kind: CronJob |
| 81 | +metadata: |
| 82 | + name: meilisearch-reindex |
| 83 | +spec: |
| 84 | + schedule: "0 * * * *" |
| 85 | + jobTemplate: |
| 86 | + spec: |
| 87 | + template: |
| 88 | + spec: |
| 89 | + restartPolicy: OnFailure |
| 90 | + containers: |
| 91 | + - name: scraper |
| 92 | + image: ghcr.io/healthsamurai/meilisearch-docs-scraper:latest |
| 93 | + env: |
| 94 | + - name: MEILISEARCH_HOST_URL |
| 95 | + value: "http://meilisearch:7700" |
| 96 | + - name: MEILISEARCH_API_KEY |
| 97 | + valueFrom: |
| 98 | + secretKeyRef: |
| 99 | + name: meilisearch-secret |
| 100 | + key: api-key |
| 101 | + - name: INDEX_NAME |
| 102 | + value: "docs" |
| 103 | + volumeMounts: |
| 104 | + - name: config |
| 105 | + mountPath: /app/config.json |
| 106 | + subPath: config.json |
| 107 | + volumes: |
| 108 | + - name: config |
| 109 | + configMap: |
| 110 | + name: scraper-config |
| 111 | +``` |
| 112 | +
|
| 113 | +### Multiple Configs (Single Job) |
| 114 | +
|
| 115 | +Index multiple sites in one run — useful for reducing k8s jobs: |
| 116 | +
|
| 117 | +```bash |
| 118 | +docker run --rm \ |
| 119 | + -e MEILISEARCH_HOST_URL=http://meilisearch:7700 \ |
| 120 | + -e MEILISEARCH_API_KEY=your-api-key \ |
| 121 | + -v $(pwd)/configs:/configs \ |
| 122 | + ghcr.io/healthsamurai/meilisearch-docs-scraper:latest \ |
| 123 | + /configs/docs.json /configs/fhirbase.json /configs/auditbox.json |
| 124 | +``` |
| 125 | + |
| 126 | +Each config creates its own index (from `index_uid` in config). |
| 127 | + |
| 128 | +### Run from Source |
| 129 | + |
| 130 | +```bash |
| 131 | +# Install Bun |
| 132 | +curl -fsSL https://bun.sh/install | bash |
| 133 | + |
| 134 | +# Clone and install |
| 135 | +git clone https://github.com/HealthSamurai/meilisearch-docs-scraper.git |
| 136 | +cd meilisearch-docs-scraper |
| 137 | +bun install |
| 138 | + |
| 139 | +# Single config |
| 140 | +MEILISEARCH_HOST_URL=http://localhost:7700 \ |
| 141 | +MEILISEARCH_API_KEY=your-api-key \ |
| 142 | +bun run src/index.ts config.json |
| 143 | + |
| 144 | +# Multiple configs |
| 145 | +MEILISEARCH_HOST_URL=http://localhost:7700 \ |
| 146 | +MEILISEARCH_API_KEY=your-api-key \ |
| 147 | +bun run src/index.ts docs.json fhirbase.json auditbox.json |
| 148 | +``` |
| 149 | + |
| 150 | +## Configuration |
| 151 | + |
| 152 | +Uses the same config format as the official docs-scraper: |
| 153 | + |
| 154 | +```json |
| 155 | +{ |
| 156 | + "index_uid": "docs", |
| 157 | + "sitemap_urls": ["https://example.com/sitemap.xml"], |
| 158 | + "start_urls": ["https://example.com/docs/"], |
| 159 | + "stop_urls": ["https://example.com/docs/deprecated"], |
| 160 | + "selectors": { |
| 161 | + "lvl0": { |
| 162 | + "selector": "nav li:last-child", |
| 163 | + "global": true, |
| 164 | + "default_value": "Documentation" |
| 165 | + }, |
| 166 | + "lvl1": "article h1", |
| 167 | + "lvl2": "article h2", |
| 168 | + "lvl3": "article h3", |
| 169 | + "lvl4": "article h4", |
| 170 | + "lvl5": "article h5", |
| 171 | + "lvl6": "article h6", |
| 172 | + "text": "article p, article li, article td" |
| 173 | + }, |
| 174 | + "custom_settings": { |
| 175 | + "searchableAttributes": [ |
| 176 | + "hierarchy_lvl1", |
| 177 | + "hierarchy_lvl2", |
| 178 | + "hierarchy_lvl3", |
| 179 | + "content" |
| 180 | + ], |
| 181 | + "rankingRules": [ |
| 182 | + "attribute", |
| 183 | + "words", |
| 184 | + "typo", |
| 185 | + "proximity", |
| 186 | + "sort", |
| 187 | + "exactness" |
| 188 | + ] |
| 189 | + } |
| 190 | +} |
| 191 | +``` |
| 192 | + |
| 193 | +## How It Works |
| 194 | + |
| 195 | +1. **Fetch sitemap** — Parses sitemap.xml to get all documentation URLs |
| 196 | +2. **Filter URLs** — Excludes URLs matching `stop_urls` patterns |
| 197 | +3. **Scrape pages** — Fetches pages concurrently (10 at a time) and extracts content |
| 198 | +4. **Build hierarchy** — Tracks heading levels (h1→h6) with anchor links for deep linking |
| 199 | +5. **Index to temp** — Creates `{index}_temp` and pushes documents in batches of 100 |
| 200 | +6. **Atomic swap** — Swaps temp with production index (zero downtime) |
| 201 | +7. **Cleanup** — Deletes the old temp index |
| 202 | + |
| 203 | +### Zero-Downtime Index Swap |
| 204 | + |
| 205 | +``` |
| 206 | + ┌─────────────┐ |
| 207 | + Scrape pages ───►│ docs_temp │ (new data) |
| 208 | + └──────┬──────┘ |
| 209 | + │ swap |
| 210 | + ┌──────▼──────┐ |
| 211 | + Search works ───►│ docs │ (now has new data) |
| 212 | + └─────────────┘ |
| 213 | + │ delete |
| 214 | + ┌──────▼──────┐ |
| 215 | + │ docs_temp │ ✗ deleted |
| 216 | + └─────────────┘ |
| 217 | +``` |
| 218 | + |
| 219 | +Search remains available throughout the entire reindex process. |
| 220 | + |
| 221 | +## Development |
| 222 | + |
| 223 | +```bash |
| 224 | +# Run locally |
| 225 | +bun run src/index.ts config.json |
| 226 | + |
| 227 | +# Type check |
| 228 | +bun run tsc --noEmit |
| 229 | + |
| 230 | +# Build |
| 231 | +bun build src/index.ts --outdir dist |
| 232 | +``` |
| 233 | + |
| 234 | +## License |
| 235 | + |
| 236 | +MIT |
0 commit comments