This document captures lessons learned from attempting to crawl various websites with bot protection.
| Method | Target Sites | Result |
|---|---|---|
Basic curl with UA |
G2, SoftwareAdvice, Indeed, Capterra | 403 Forbidden - Cloudflare block |
curl with browser UA |
Same sites | Still blocked - Cloudflare challenge page |
| Wayback Machine | G2 | Partial success - got HTML shell but JS-rendered content missing |
| Playwright headless | G2, Indeed | Blocked - DataDome/Cloudflare detect headless |
| Playwright with anti-detection | Indeed | Timeout - advanced bot detection |
| Playwright headless | CrunchTime.com (less protected) | SUCCESS - got full content |
browsh (text browser) |
N/A | Available but needs TTY, not suited for scripting |
- JavaScript challenge that requires browser execution
- Cookie verification (cf_clearance)
- IP reputation scoring
- Browser fingerprinting
Counter-measures:
- Use stealth browser mode
- Wayback Machine for heavily protected pages
- Residential proxies (not implemented)
- Advanced fingerprinting beyond basic checks
- Detects headless browsers through multiple signals
- Machine learning-based detection
Counter-measures:
- Very difficult to bypass
- Wayback Machine is often the only option
- Residential proxies with real browser profiles
- IP-based request throttling
- Progressive delays or blocks
Counter-measures:
- Retry with exponential backoff
- Respect rate limits
- Rotate IPs (not implemented)
- Basic but easily bypassed
- Some sites block common bot UAs
Counter-measures:
- Rotate realistic browser User-Agents
- Match UA with other headers (Accept, Accept-Language)
- Checks
navigator.webdriverproperty - Presence indicates automated browser
Counter-measures:
- Override property to undefined
- Use stealth plugins that patch this
- Multiple signals: missing plugins, screen dimensions
- Chrome-specific headless signatures
- WebGL/Canvas fingerprinting
Counter-measures:
- Use full browser mode when needed (
--headful) - Stealth scripts to fake plugins/properties
- Realistic viewport sizes
Modern bot protection looks at many signals including:
- Header order and completeness
- JavaScript execution patterns
- Mouse/keyboard events
- Browser fingerprint consistency
Playwright and Puppeteer leave signatures:
navigator.webdriver = true- Missing plugins
- Specific viewport characteristics
- Timing differences
Most modern sites load content via AJAX/fetch. Raw HTML often contains:
- Empty containers
- Loading placeholders
- Client-side routing shells
- CrunchTime.com: Basic protection, easy to crawl
- G2.com: Enterprise-grade protection, very difficult
- No one-size-fits-all solution
- Bypasses all client-side protection
- Historical snapshots available
- Limitation: doesn't capture JS-rendered content
- May have outdated information
- Use
sedorawkfor pattern matching - Or install GNU grep via Homebrew
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle', timeout: 30000 });
await page.waitForTimeout(3000);
const content = await page.textContent('body');
console.log(content);
await browser.close();
})();curl -sL \
-A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" \
-H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" \
-H "Accept-Language: en-US,en;q=0.9" \
"https://example.com"# Check if URL is archived
curl "https://archive.org/wayback/available?url=https://example.com"
# Fetch archived version
curl "https://web.archive.org/web/20231201/https://example.com"- Direct fetch works
- Example: example.com, many corporate sites
- User-Agent check only
- Solution: Use browser-like headers
- Example: crunchtime.com
- JavaScript challenges
- Rate limiting
- Solution: Stealth browser mode
- Example: trustpilot.com
- Cloudflare, DataDome, PerimeterX
- Advanced fingerprinting
- Solution: Wayback Machine, residential proxies
- Example: g2.com, capterra.com, indeed.com
- Requires login
- CAPTCHA challenges
- Solution: Manual intervention, official APIs
- Example: glassdoor.com (for full content)
- Residential proxy integration - Use real residential IPs
- Browser profile persistence - Maintain cookies and fingerprint
- CAPTCHA solving - 2captcha/Anti-Captcha integration
- Request signing - Reverse-engineer site-specific tokens
- Machine learning detection - Behavioral analysis to mimic humans