Web Crawling Learnings

This document captures lessons learned from attempting to crawl various websites with bot protection.

What We Tried & Results

Method	Target Sites	Result
Basic `curl` with UA	G2, SoftwareAdvice, Indeed, Capterra	403 Forbidden - Cloudflare block
`curl` with browser UA	Same sites	Still blocked - Cloudflare challenge page
Wayback Machine	G2	Partial success - got HTML shell but JS-rendered content missing
Playwright headless	G2, Indeed	Blocked - DataDome/Cloudflare detect headless
Playwright with anti-detection	Indeed	Timeout - advanced bot detection
Playwright headless	CrunchTime.com (less protected)	SUCCESS - got full content
`browsh` (text browser)	N/A	Available but needs TTY, not suited for scripting

Bot Detection Methods Encountered

1. Cloudflare

JavaScript challenge that requires browser execution
Cookie verification (cf_clearance)
IP reputation scoring
Browser fingerprinting

Counter-measures:

Use stealth browser mode
Wayback Machine for heavily protected pages
Residential proxies (not implemented)

2. DataDome

Advanced fingerprinting beyond basic checks
Detects headless browsers through multiple signals
Machine learning-based detection

Counter-measures:

Very difficult to bypass
Wayback Machine is often the only option
Residential proxies with real browser profiles

3. Rate Limiting

IP-based request throttling
Progressive delays or blocks

Counter-measures:

Retry with exponential backoff
Respect rate limits
Rotate IPs (not implemented)

4. User-Agent Checks

Basic but easily bypassed
Some sites block common bot UAs

Counter-measures:

Rotate realistic browser User-Agents
Match UA with other headers (Accept, Accept-Language)

5. WebDriver Detection

Checks navigator.webdriver property
Presence indicates automated browser

Counter-measures:

Override property to undefined
Use stealth plugins that patch this

6. Headless Detection

Multiple signals: missing plugins, screen dimensions
Chrome-specific headless signatures
WebGL/Canvas fingerprinting

Counter-measures:

Use full browser mode when needed (--headful)
Stealth scripts to fake plugins/properties
Realistic viewport sizes

Key Technical Learnings

User-Agent alone is insufficient

Modern bot protection looks at many signals including:

Header order and completeness
JavaScript execution patterns
Mouse/keyboard events
Browser fingerprint consistency

Headless browsers are detectable by default

Playwright and Puppeteer leave signatures:

navigator.webdriver = true
Missing plugins
Specific viewport characteristics
Timing differences

JavaScript rendering is required

Most modern sites load content via AJAX/fetch. Raw HTML often contains:

Empty containers
Loading placeholders
Client-side routing shells

Site protection varies widely

CrunchTime.com: Basic protection, easy to crawl
G2.com: Enterprise-grade protection, very difficult
No one-size-fits-all solution

Wayback Machine is valuable

Bypasses all client-side protection
Historical snapshots available
Limitation: doesn't capture JS-rendered content
May have outdated information

macOS grep lacks -P flag

Use sed or awk for pattern matching
Or install GNU grep via Homebrew

Scripts That Worked

Playwright basic fetch (for less-protected sites)

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle', timeout: 30000 });
  await page.waitForTimeout(3000);

  const content = await page.textContent('body');
  console.log(content);

  await browser.close();
})();

curl with browser headers (for basic sites)

curl -sL \
  -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" \
  -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" \
  -H "Accept-Language: en-US,en;q=0.9" \
  "https://example.com"

Wayback Machine API

# Check if URL is archived
curl "https://archive.org/wayback/available?url=https://example.com"

# Fetch archived version
curl "https://web.archive.org/web/20231201/https://example.com"

Protection Level Guide

No Protection

Direct fetch works
Example: example.com, many corporate sites

Low Protection

User-Agent check only
Solution: Use browser-like headers
Example: crunchtime.com

Medium Protection

JavaScript challenges
Rate limiting
Solution: Stealth browser mode
Example: trustpilot.com

High Protection

Cloudflare, DataDome, PerimeterX
Advanced fingerprinting
Solution: Wayback Machine, residential proxies
Example: g2.com, capterra.com, indeed.com

Very High Protection

Requires login
CAPTCHA challenges
Solution: Manual intervention, official APIs
Example: glassdoor.com (for full content)

Future Research

Residential proxy integration - Use real residential IPs
Browser profile persistence - Maintain cookies and fingerprint
CAPTCHA solving - 2captcha/Anti-Captcha integration
Request signing - Reverse-engineer site-specific tokens
Machine learning detection - Behavioral analysis to mimic humans

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Crawling Learnings

What We Tried & Results

Bot Detection Methods Encountered

1. Cloudflare

2. DataDome

3. Rate Limiting

4. User-Agent Checks

5. WebDriver Detection

6. Headless Detection

Key Technical Learnings

User-Agent alone is insufficient

Headless browsers are detectable by default

JavaScript rendering is required

Site protection varies widely

Wayback Machine is valuable

macOS grep lacks -P flag

Scripts That Worked

Playwright basic fetch (for less-protected sites)

curl with browser headers (for basic sites)

Wayback Machine API

Protection Level Guide

No Protection

Low Protection

Medium Protection

High Protection

Very High Protection

Future Research

FilesExpand file tree

LEARNINGS.md

Latest commit

History

LEARNINGS.md

File metadata and controls

Web Crawling Learnings

What We Tried & Results

Bot Detection Methods Encountered

1. Cloudflare

2. DataDome

3. Rate Limiting

4. User-Agent Checks

5. WebDriver Detection

6. Headless Detection

Key Technical Learnings

User-Agent alone is insufficient

Headless browsers are detectable by default

JavaScript rendering is required

Site protection varies widely

Wayback Machine is valuable

macOS grep lacks -P flag

Scripts That Worked

Playwright basic fetch (for less-protected sites)

curl with browser headers (for basic sites)

Wayback Machine API

Protection Level Guide

No Protection

Low Protection

Medium Protection

High Protection

Very High Protection

Future Research