Skip to content

Latest commit

 

History

History
304 lines (246 loc) · 9.84 KB

File metadata and controls

304 lines (246 loc) · 9.84 KB

Relative API Extractor

Inspirred by jobertabma/relative-url-extractor

<<<<<<< HEAD

Overview

endpoint_extractor is a modular, plugin-based tool designed to extract API endpoints and path-like strings from arbitrary source code, configuration files, or binary-like assembly formats.
It is implemented in Python 3 for readability, maintainability, and extensibility.

<<<<<<< HEAD endpoint_extractor is a modular, plugin-based tool designed to extract API endpoints and path-like strings from arbitrary source code, configuration files, or binary-like formats. It is implemented in Python 3 with a streaming core and extensible plugin system.

Compared to naive regex approaches, this tool provides:

  • Streaming processing for very large files.
  • Unicode and escape decoding.
  • Smarter Base64 decoding (entropy, readability, and structural checks).
  • Language-aware parsing (Python AST, Java, Smali, Assembly).
  • Config parsing for JSON, YAML, and .env files.
  • Modular plugins for easy extension. =======

Overview

This program scans arbitrary source code or text files to identify and extract unique API endpoints or path-like strings. It is designed to detect strings enclosed in single (') or double (") quotes that begin with a forward slash (/). The tool supports both direct endpoint strings as well as obfuscated forms using escape sequences such as hexadecimal (\xNN) and Unicode (\uNNNN) representations.

The output is a list of unique endpoints. Optionally, the tool can also display the full line where each endpoint is found.

parent of 847e64d (rewrite the project) ======= This project improves upon naive regex-based approaches by offering:

  • Streaming processing of large files.
  • Unicode and escape decoding.
  • Base64 decoding with entropy and readability heuristics.
  • Language-specific parsers (Python, Java, Smali, Assembly).
  • Config and JSON parsing support.
  • Modular plugin architecture for easy extension.

parent of fcdffab (add some plugin)


Features

<<<<<<< HEAD

Core Engine

  • Streaming mode: Processes input line-by-line, without loading entire files into memory.
  • Deduplication: Tracks discovered endpoints to avoid duplicates.
  • Configurable output: Optionally show the original line that produced each endpoint.

Escape and Obfuscation Handling

  • Decodes escaped strings (\xNN, \uNNNN, etc.).
  • Detects Base64-like strings and decodes them into endpoints.
  • Uses entropy and printable character ratio checks to minimize false positives.

Supported Formats and Languages

  • General string literals: Detects quoted strings in any language.
  • JSON/config: Extracts string values that look like endpoints.
  • Python: Resolves concatenations, .format(), and % substitutions.
  • Java: Extracts quoted string literals.
  • Smali (Android bytecode): Extracts const-string instructions.
  • Assembly: Extracts .asciz inline string directives.

<<<<<<< HEAD

Supported Plugins

  • Simple string literals: Extracts quoted strings and checks for Base64.

  • JSON plugin: Extracts JSON values that look like endpoints.

  • Config plugin: Extracts from .env, YAML, and multi-line JSON.

  • Python plugin:

    • Resolves concatenations (a + b).
    • Supports .format() and % substitutions.
    • Evaluates f-strings.
    • Maintains a symbol table to resolve variables across lines.
  • Java plugin: Extracts string literals (grammar-based parsing can be enabled with tree-sitter).

  • Smali plugin: Extracts const-string assignments.

  • Assembly plugin: Extracts .asciz inline strings. =======

  • Reads input from standard input (stdin).

  • Replaces non-ASCII characters with an underscore (_) for safe processing.

  • Decodes:

    • Hexadecimal escapes (e.g., "\x2f\x61\x70\x69"/api).
    • Unicode escapes (e.g., "\u002fusers"/users).
  • Regex-based extraction of string literals containing endpoints.

  • Ensures uniqueness of extracted endpoints.

  • Optional display of the complete line containing the endpoint using a command-line flag.

parent of 847e64d (rewrite the project) =======

Extensible Plugins

  • Each language or format is handled by a plugin.
  • Plugins can be registered dynamically, making it easy to add new language handlers.

parent of fcdffab (add some plugin)


Workflow

<<<<<<< HEAD

endpoint_extractor/
├── README.md
├── main.py               # Entry point
├── core/
│   ├── engine.py         # Core engine (streaming, dedupe, emit)
│   └── utils.py          # Escape decoding, Base64 heuristics, entropy
├── plugins/
│   ├── simple_literal.py # Quoted string & Base64
│   ├── json_plugin.py    # JSON/config
│   ├── python_plugin.py  # Python parser
│   ├── java_plugin.py    # Java
│   ├── smali_plugin.py   # Smali
│   └── asm_plugin.py     # Assembly
└── examples/
    └── sample_input.txt

=======

  1. Input reading: The entire input is read from standard input and stored in memory.
  2. Sanitization: Any non-ASCII characters are replaced with _.
  3. Escape decoding:

parent of 847e64d (rewrite the project)

  • \xNN sequences are decoded into their ASCII equivalents.
  • \uNNNN sequences are decoded if within the ASCII range; otherwise replaced with _.
  1. Regex scanning: Each line of the input is checked against a POSIX regular expression that identifies quoted strings starting with /.
  2. Uniqueness filtering: Endpoints already printed are skipped.
  3. Output:

<<<<<<< HEAD

Installation

Clone the repository:

git clone https://github.com/0xme32/endpoint-extractor.git
cd endpoint-extractor

Run with Python 3.9+ (no external dependencies required):

python3 main.py < input_file

=======

  • The endpoint itself is always printed.
  • If the program is invoked with --show-line, the entire source line is also displayed under a separator.

parent of 847e64d (rewrite the project)


Usage

Compilation

gcc -o main main.c -Wall -O2 -std=c11

<<<<<<< HEAD <<<<<<< HEAD

Show Line Context

=======

Basic execution

parent of 847e64d (rewrite the project) =======

Show Extracting Line Context

parent of fcdffab (add some plugin)

cat source_file.js | ./main

With line output

cat source_file.py | ./main --show-line

Example

Input

<<<<<<< HEAD

const base = "/api";
const users = base + "/users";
const obf = "L2FwaS92MS9sb2dpbg=="; // Base64 encoded: /api/v1/login
{"service":"my","url":"/internal/health"}
path = "/v1/{}/info".format("status")
const-string v0, "/smali/endpoint"
.asciz "/asm/path"
=======
```c
char *a = "/api/v1/data";
char *b = "\x2f\x61\x70\x69\x2f\x75\x73\x65\x72\x73"; // "/api/users"
char *c = "\u002fitems";
>>>>>>> parent of 847e64d (rewrite the project)

Output (without --show-line)

<<<<<<< HEAD
/api
------------------------------------------------
1: const base = "/api";

/users
------------------------------------------------
2: const users = base + "/users";

/api/v1/login
------------------------------------------------
3: const obf = "L2FwaS92MS9sb2dpbg=="; // Base64 encoded: /api/v1/login

/internal/health
------------------------------------------------
4: {"service":"my","url":"/internal/health"}

/v1/status/info
------------------------------------------------
5: path = "/v1/{}/info".format("status")

/smali/endpoint
------------------------------------------------
6: const-string v0, "/smali/endpoint"

/asm/path
<<<<<<< HEAD
=======
/api/v1/data
/api/users
/items

Output (with --show-line)

/api/v1/data
------------------------------------------------
char *a = "/api/v1/data";
/api/users
------------------------------------------------
char *b = "\x2f\x61\x70\x69\x2f\x75\x73\x65\x72\x73";
/items
------------------------------------------------
char *c = "\u002fitems";
>>>>>>> parent of 847e64d (rewrite the project)
=======
------------------------------------------------
7: .asciz "/asm/path"
>>>>>>> parent of fcdffab (add some plugin)

Design Notes

<<<<<<< HEAD <<<<<<< HEAD

  • Core manages deduplication and streaming.

  • Plugins handle syntax-specific parsing.

  • Python plugin uses the ast module to safely evaluate partial string expressions.

  • Config plugin supports multiple formats:

    • .envKEY=VALUE
    • JSON (multi-line)
    • YAML (if pyyaml available). =======
  • Core engine: Manages streaming, deduplication, and output formatting.
  • Plugins: Encapsulate language-specific heuristics, making the system modular.
  • Base64 heuristics:
    • Validates input format.
    • Decodes safely.
    • Requires decoded string to have a high printable ratio.
    • Requires decoded string to contain endpoint-like substrings.
  • Python plugin: Uses ast module for safe parsing and partial evaluation of string expressions.

parent of fcdffab (add some plugin)


License

MIT License.

=======

  • Regex choice: The POSIX regular expression ([^\n]*?([\"'])(/[A-Za-z0-9\?\/&=#\.\!:_-]*)(\2).*) was selected to balance readability and functionality.

  • Memory management: Endpoints are dynamically allocated and tracked in an array, then freed before program termination.

  • Scalability: The program uses fixed-size buffers (MAX_CONTENT, MAX_MATCHES), which can be increased as needed.

  • Limitations:

    • Does not resolve variable concatenation (e.g., "/api" + "/users").
    • Only basic Unicode escapes are supported; characters outside ASCII are replaced with _.
    • Base64 or custom obfuscation methods are not decoded in this version.

parent of 847e64d (rewrite the project)


Author

Made by 0xme