Module for automatic summarization of text documents and HTML pages.
-
Updated
Dec 29, 2025 - Python
Module for automatic summarization of text documents and HTML pages.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Domain-specific language for extracting structured data from HTML documents
Xtract-htmlV2 is a tool for getting the HTML code from the website you want and is the successor to the previous version
Xtract-html is a tool for extracting HTML display code from a website, which you can also use for your website.
Article extraction benchmark: dataset and evaluation scripts
Extract embedded metadata from HTML markup
High-performance MCP server for browser automation, web capture, and content extraction. Rust-powered CDP client for AI agents.
Script for extracting units from http://vocab.nerc.ac.uk/collection/P06/current/ to easily add units to the database (This should only be temporarily to demonstrate how units can work)
WordPress BS4 theme extractor
🌐 Build high-performance web sensing and browser automation tools with ReasonKit Web, a Rust-native implementation for efficient solutions.
fast python port of arc90's readability tool, updated to match latest readability.js!
Parse numbers written in natural language
Extract price amount and currency symbol from a raw text string
extracts and saves HTML, CSS, and JavaScript files from a specified URL.
A Java-based server leveraging Apache Tika to extract content and metadata from files (PDF, DOCX, TXT, etc.) in a local files-to-extract directory. Supports HTML (with CSS styling) and text extraction, file listing, and metadata retrieval via MCP-compliant tools and REST APIs. Built with Spring Boot, Jetty, and MCP SDK.
Heuristic based boilerplate removal tool
Add a description, image, and links to the html-extraction topic page so that developers can more easily learn about it.
To associate your repository with the html-extraction topic, visit your repo's landing page and select "manage topics."