Skip to content

Unauthenticated exponential-complexity DoS via filepath_globpattern on the document-store / RAG REST endpoints #241

@geo-chen

Description

@geo-chen

Steps to reproduce

Summary

The Pathway document-store / RAG REST servers accept a filepath_globpattern field in the request body and compile it into a custom globmatch JMESPath expression that is evaluated against every indexed document's path. The glob matcher recurses on two children for each ** segment with no memoization, giving O(2^k) behavior in the number of ** segments. A short, unauthenticated request containing a pattern with many ** segments and a non-matching suffix pins a worker CPU core for minutes or longer per indexed document, denying service. Confirmed against the real matcher: 14 ** segments produce roughly 40 million recursive calls and exceed 30 seconds.

Details

python/pathway/stdlib/ml/classifiers/_knn_lsh.py, _globmatch_impl (around lines 101 to 113): for a ** segment it recurses on two children (consume vs. descend) with no memoization, so matching a pattern with k ** segments against a non-matching path is exponential.

The pattern is attacker-controlled and uncapped. filepath_globpattern is a request column in InputsQuerySchema / RetrieveQuerySchema (python/pathway/xpacks/llm/document_store.py ~lines 236 to 262), bound to the unauthenticated REST routes /v1/inputs and /v1/retrieve (document_store.py ~lines 107 to 117) and /v2/answer (question_answering.py ~line 422). _get_jmespath_filter (document_store.py ~lines 34 to 42) builds globmatch('<filepath_globpattern>', path) with only quote sanitization (no length or **-segment cap) and runs it via jmespath.search against every indexed document path. PathwayWebserver (io/http/_server.py) has no authentication.

PoC

Impact

An unauthenticated client sends one small request body and exhausts a server CPU core for minutes to effectively indefinitely (compounded across the indexed corpus); a few requests deny service to the document-store / RAG endpoints. No authentication is required on the default configuration.

Remediation

Cap the number of ** segments (and total length) of filepath_globpattern at the request boundary, and rewrite _globmatch_impl to be non-exponential: memoize on (pattern_index, path_index) (dynamic programming) so matching is polynomial, or compile the glob to a linear-time automaton. Consider authenticating the REST endpoints and rejecting overly complex filter expressions.

Relevant log output

-

What did you expect to happen?

Version

0.31.1

Docker Versions (if used)

No response

OS

Linux

On which CPU architecture did you run the Pathway Live Data Framework?

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions