FastAPI service that scrapes a URL into a simple, structured JSON shape. It starts with static HTML (Requests + BeautifulSoup) and falls back to JS rendering (Playwright/Chromium) when the page looks JS-heavy.
chmod +x run.sh
./run.shThen open:
- UI:
http://localhost:8000/ - Health:
http://localhost:8000/healthz
curl -sS -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com/"}' | python -m json.toolResponse shape:
result.meta: basic metadata (title/description/language)result.sections[]: grouped sections withtype,label,content, andrawHtmlresult.interactions: JS-only interaction log (scrolls,clicks,pages)
run.sh:
- Ensures
python3exists - Creates
.venv/(viapython3 -m venv .venv) if missing - Activates the venv
- Runs
pip install --upgrade pip - Installs dependencies from
requirements.txt - Starts the server:
uvicorn app.main:app --host 0.0.0.0 --port 8000
Important: run.sh does not install Playwright browser binaries.
- Python: works with Python 3.x (tested in this repo with a local venv)
- JS rendering: uses Playwright (Chromium)
One-time setup for JS rendering:
.venv/bin/python -m playwright install chromiumIf Chromium launches fail due to missing OS libs (common on Linux), try:
.venv/bin/python -m playwright install --with-deps chromium- https://www.ycombinator.com/ — Y Combinator homepage; tests static content extraction
- https://vercel.com/ — JS-heavy (Next.js); validates Playwright rendering + scroll + link clicks
- https://x.com/ — social media site; tests dynamic content and authentication walls
- https://www.reddit.com/ — link-dense, dynamic content; validates fallback rendering and infinite scroll handling
- Some sites block plain
requestswithout a browser-likeUser-Agent(e.g., Wikipedia returned HTTP 403 in local testing). - The JS fallback decision is heuristic-based (framework markers like React/Vue/Next + a crude “visible text” check); it can over/under-trigger.
- JS interactions are best-effort: fixed number of scrolls (default 3) and up to 2 link clicks. It does not implement dedicated “Load more” buttons, tab components, or systematic pagination crawling.
- Noise filtering is minimal (currently removes
script,style, andnoscripttags). Cookie banners/overlays aren’t specifically handled. rawHtmlis truncated per section (currently 3000 chars) and fallback content truncates more aggressively.