html-extraction

Script for extracting units from http://vocab.nerc.ac.uk/collection/P06/current/ to easily add units to the database (This should only be temporarily to demonstrate how units can work)

linked-open-data html-extraction

Updated Jul 27, 2020
HTML

jhontron6 / wordpress-bs4-theme-elements-scraper

Star

WordPress BS4 theme extractor

python theme wordpress scraper beautifulsoup bs4 elements html-extraction website-structure-analysis cms-theme-replication ui-component-scraping page-template-reverse-engineering

Updated Dec 5, 2025

romny5 / reasonkit-web

Star

🌐 Build high-performance web sensing and browser automation tools with ReasonKit Web, a Rust-native implementation for efficient solutions.

rust pdf screenshot async chromium tokio web-scraping developer-tools cdp web-automation chrome-devtools-protocol html-extraction headless-browser ai-agent llm-tools agent-tools model-context-protocol

Updated Jan 22, 2026
Rust

zanachka / python-readability

Star

fast python port of arc90's readability tool, updated to match latest readability.js!

text-extraction html-extraction

Updated May 4, 2025
Python

zanachka / html-text

Star

Extract text from HTML

text-extraction html-extraction

Updated Jan 10, 2026
HTML

zanachka / number-parser

Star

Parse numbers written in natural language

text-extraction html-extraction

Updated Oct 25, 2024
Python

zanachka / price-parser

Star

Extract price amount and currency symbol from a raw text string

text-extraction html-extraction

Updated Oct 6, 2025
Python

9dl / HTML-Dumper

Star

extracts and saves HTML, CSS, and JavaScript files from a specified URL.

web-scraping html-extraction

Updated Oct 14, 2024
C#

RayenMalouche / MCP-PDF-Extractor-server

Star

A Java-based server leveraging Apache Tika to extract content and metadata from files (PDF, DOCX, TXT, etc.) in a local files-to-extract directory. Supports HTML (with CSS styling) and text extraction, file listing, and metadata retrieval via MCP-compliant tools and REST APIs. Built with Spring Boot, Jetty, and MCP SDK.

java html pdf parser mcp extractor pdf-extractor html-extraction html-extractor pdf-extraction mcp-server modelcontextprotocol extractor-to-html

Updated Aug 30, 2025
Java

zanachka / jusText

Star

Heuristic based boilerplate removal tool

text-extraction html-extraction

Updated Oct 21, 2020
Python

Improve this page

Add a description, image, and links to the html-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the html-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html-extraction

Here are 19 public repositories matching this topic...

miso-belica / sumy

bookieio / breadability

html-extract / hext

Whomrx666 / Xtract-htmlV2

Whomrx666 / Xtract-html

zanachka / article-extraction-benchmark

zanachka / extruct

reasonkit / reasonkit-web

zanachka / dateparser

shmdoc / unit-parser

jhontron6 / wordpress-bs4-theme-elements-scraper

romny5 / reasonkit-web

zanachka / python-readability

zanachka / html-text

zanachka / number-parser

zanachka / price-parser

9dl / HTML-Dumper

RayenMalouche / MCP-PDF-Extractor-server

zanachka / jusText

Improve this page

Add this topic to your repo