Skip to content

Latest commit

 

History

History
188 lines (145 loc) · 5.39 KB

File metadata and controls

188 lines (145 loc) · 5.39 KB

Web Crawling Learnings

This document captures lessons learned from attempting to crawl various websites with bot protection.

What We Tried & Results

Method Target Sites Result
Basic curl with UA G2, SoftwareAdvice, Indeed, Capterra 403 Forbidden - Cloudflare block
curl with browser UA Same sites Still blocked - Cloudflare challenge page
Wayback Machine G2 Partial success - got HTML shell but JS-rendered content missing
Playwright headless G2, Indeed Blocked - DataDome/Cloudflare detect headless
Playwright with anti-detection Indeed Timeout - advanced bot detection
Playwright headless CrunchTime.com (less protected) SUCCESS - got full content
browsh (text browser) N/A Available but needs TTY, not suited for scripting

Bot Detection Methods Encountered

1. Cloudflare

  • JavaScript challenge that requires browser execution
  • Cookie verification (cf_clearance)
  • IP reputation scoring
  • Browser fingerprinting

Counter-measures:

  • Use stealth browser mode
  • Wayback Machine for heavily protected pages
  • Residential proxies (not implemented)

2. DataDome

  • Advanced fingerprinting beyond basic checks
  • Detects headless browsers through multiple signals
  • Machine learning-based detection

Counter-measures:

  • Very difficult to bypass
  • Wayback Machine is often the only option
  • Residential proxies with real browser profiles

3. Rate Limiting

  • IP-based request throttling
  • Progressive delays or blocks

Counter-measures:

  • Retry with exponential backoff
  • Respect rate limits
  • Rotate IPs (not implemented)

4. User-Agent Checks

  • Basic but easily bypassed
  • Some sites block common bot UAs

Counter-measures:

  • Rotate realistic browser User-Agents
  • Match UA with other headers (Accept, Accept-Language)

5. WebDriver Detection

  • Checks navigator.webdriver property
  • Presence indicates automated browser

Counter-measures:

  • Override property to undefined
  • Use stealth plugins that patch this

6. Headless Detection

  • Multiple signals: missing plugins, screen dimensions
  • Chrome-specific headless signatures
  • WebGL/Canvas fingerprinting

Counter-measures:

  • Use full browser mode when needed (--headful)
  • Stealth scripts to fake plugins/properties
  • Realistic viewport sizes

Key Technical Learnings

User-Agent alone is insufficient

Modern bot protection looks at many signals including:

  • Header order and completeness
  • JavaScript execution patterns
  • Mouse/keyboard events
  • Browser fingerprint consistency

Headless browsers are detectable by default

Playwright and Puppeteer leave signatures:

  • navigator.webdriver = true
  • Missing plugins
  • Specific viewport characteristics
  • Timing differences

JavaScript rendering is required

Most modern sites load content via AJAX/fetch. Raw HTML often contains:

  • Empty containers
  • Loading placeholders
  • Client-side routing shells

Site protection varies widely

  • CrunchTime.com: Basic protection, easy to crawl
  • G2.com: Enterprise-grade protection, very difficult
  • No one-size-fits-all solution

Wayback Machine is valuable

  • Bypasses all client-side protection
  • Historical snapshots available
  • Limitation: doesn't capture JS-rendered content
  • May have outdated information

macOS grep lacks -P flag

  • Use sed or awk for pattern matching
  • Or install GNU grep via Homebrew

Scripts That Worked

Playwright basic fetch (for less-protected sites)

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle', timeout: 30000 });
  await page.waitForTimeout(3000);

  const content = await page.textContent('body');
  console.log(content);

  await browser.close();
})();

curl with browser headers (for basic sites)

curl -sL \
  -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" \
  -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" \
  -H "Accept-Language: en-US,en;q=0.9" \
  "https://example.com"

Wayback Machine API

# Check if URL is archived
curl "https://archive.org/wayback/available?url=https://example.com"

# Fetch archived version
curl "https://web.archive.org/web/20231201/https://example.com"

Protection Level Guide

No Protection

  • Direct fetch works
  • Example: example.com, many corporate sites

Low Protection

  • User-Agent check only
  • Solution: Use browser-like headers
  • Example: crunchtime.com

Medium Protection

  • JavaScript challenges
  • Rate limiting
  • Solution: Stealth browser mode
  • Example: trustpilot.com

High Protection

  • Cloudflare, DataDome, PerimeterX
  • Advanced fingerprinting
  • Solution: Wayback Machine, residential proxies
  • Example: g2.com, capterra.com, indeed.com

Very High Protection

  • Requires login
  • CAPTCHA challenges
  • Solution: Manual intervention, official APIs
  • Example: glassdoor.com (for full content)

Future Research

  1. Residential proxy integration - Use real residential IPs
  2. Browser profile persistence - Maintain cookies and fingerprint
  3. CAPTCHA solving - 2captcha/Anti-Captcha integration
  4. Request signing - Reverse-engineer site-specific tokens
  5. Machine learning detection - Behavioral analysis to mimic humans