Skip to content

Fix Twitter thread scraping to stop returning empty results#12

Open
blackbuuurn wants to merge 1 commit intonirholas:mainfrom
blackbuuurn:xactions-thread-scrape-fix
Open

Fix Twitter thread scraping to stop returning empty results#12
blackbuuurn wants to merge 1 commit intonirholas:mainfrom
blackbuuurn:xactions-thread-scrape-fix

Conversation

@blackbuuurn
Copy link
Copy Markdown

This patch makes scrapeThread robust again by deriving the thread author from the tweet URL and by using more defensive author extraction from the page DOM.

What changed:

  • use the tweet URL to identify mainTweetId and mainAuthor
  • extract author handles from multiple DOM shapes instead of relying on one selector
  • keep the main tweet and same-author tweets, then sort chronologically
  • avoid the all-or-nothing empty result when X changes DOM structure slightly

Validation:

  • local syntax check passed
  • branch pushed: xactions-thread-scrape-fix

@blackbuuurn blackbuuurn requested a review from nirholas as a code owner April 2, 2026 09:08
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 2, 2026

@black7 is attempting to deploy a commit to the kaivocmenirehtacgmailcom's projects Team on Vercel.

A member of the Team first needs to authorize it.

nj-io added a commit to nj-io/XActions that referenced this pull request Apr 5, 2026
Replace DOM-based thread scraping with direct GraphQL API calls.
X doesn't render self-reply threads as article elements in the DOM,
causing empty results — especially for high-engagement tweets.

The new approach:
- Calls TweetDetail GraphQL API from the page context using session cookies
- Gets full_text (no truncation, no "Show more" needed)
- note_tweet support for long-form posts
- Filters to self-reply chain only (author replying to themselves)
- Chronological sorting

Also introduces shared helpers for future use by scrapePost:
- fetchTweetDetail() — GraphQL API caller
- parseTweetResult() — rich data extraction (text, media, article,
  card, external URLs, engagement stats)
- parseThreadFromEntries() — thread chain detection
- extractEntries(), unwrapResult(), getScreenName()

Fixes:
- screen_name moved from user.legacy to user.core in X's GraphQL schema
- Self-replies missing from API response for viral tweets (2000+ replies)
  now handled gracefully (returns available tweets)

Supersedes nirholas#12 which patches the DOM approach — this replaces it entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants