Steps to reproduce
Summary
The Pathway document-store / RAG REST servers accept a filepath_globpattern field in the request body and compile it into a custom globmatch JMESPath expression that is evaluated against every indexed document's path. The glob matcher recurses on two children for each ** segment with no memoization, giving O(2^k) behavior in the number of ** segments. A short, unauthenticated request containing a pattern with many ** segments and a non-matching suffix pins a worker CPU core for minutes or longer per indexed document, denying service. Confirmed against the real matcher: 14 ** segments produce roughly 40 million recursive calls and exceed 30 seconds.
Details
python/pathway/stdlib/ml/classifiers/_knn_lsh.py, _globmatch_impl (around lines 101 to 113): for a ** segment it recurses on two children (consume vs. descend) with no memoization, so matching a pattern with k ** segments against a non-matching path is exponential.
The pattern is attacker-controlled and uncapped. filepath_globpattern is a request column in InputsQuerySchema / RetrieveQuerySchema (python/pathway/xpacks/llm/document_store.py ~lines 236 to 262), bound to the unauthenticated REST routes /v1/inputs and /v1/retrieve (document_store.py ~lines 107 to 117) and /v2/answer (question_answering.py ~line 422). _get_jmespath_filter (document_store.py ~lines 34 to 42) builds globmatch('<filepath_globpattern>', path) with only quote sanitization (no length or **-segment cap) and runs it via jmespath.search against every indexed document path. PathwayWebserver (io/http/_server.py) has no authentication.
PoC
Impact
An unauthenticated client sends one small request body and exhausts a server CPU core for minutes to effectively indefinitely (compounded across the indexed corpus); a few requests deny service to the document-store / RAG endpoints. No authentication is required on the default configuration.
Remediation
Cap the number of ** segments (and total length) of filepath_globpattern at the request boundary, and rewrite _globmatch_impl to be non-exponential: memoize on (pattern_index, path_index) (dynamic programming) so matching is polynomial, or compile the glob to a linear-time automaton. Consider authenticating the REST endpoints and rejecting overly complex filter expressions.
Relevant log output
What did you expect to happen?
Version
0.31.1
Docker Versions (if used)
No response
OS
Linux
On which CPU architecture did you run the Pathway Live Data Framework?
None
Steps to reproduce
Summary
The Pathway document-store / RAG REST servers accept a
filepath_globpatternfield in the request body and compile it into a customglobmatchJMESPath expression that is evaluated against every indexed document's path. The glob matcher recurses on two children for each**segment with no memoization, giving O(2^k) behavior in the number of**segments. A short, unauthenticated request containing a pattern with many**segments and a non-matching suffix pins a worker CPU core for minutes or longer per indexed document, denying service. Confirmed against the real matcher: 14**segments produce roughly 40 million recursive calls and exceed 30 seconds.Details
python/pathway/stdlib/ml/classifiers/_knn_lsh.py,_globmatch_impl(around lines 101 to 113): for a**segment it recurses on two children (consume vs. descend) with no memoization, so matching a pattern with k**segments against a non-matching path is exponential.The pattern is attacker-controlled and uncapped.
filepath_globpatternis a request column inInputsQuerySchema/RetrieveQuerySchema(python/pathway/xpacks/llm/document_store.py~lines 236 to 262), bound to the unauthenticated REST routes/v1/inputsand/v1/retrieve(document_store.py~lines 107 to 117) and/v2/answer(question_answering.py~line 422)._get_jmespath_filter(document_store.py~lines 34 to 42) buildsglobmatch('<filepath_globpattern>', path)with only quote sanitization (no length or**-segment cap) and runs it viajmespath.searchagainst every indexed document path.PathwayWebserver(io/http/_server.py) has no authentication.PoC
Impact
An unauthenticated client sends one small request body and exhausts a server CPU core for minutes to effectively indefinitely (compounded across the indexed corpus); a few requests deny service to the document-store / RAG endpoints. No authentication is required on the default configuration.
Remediation
Cap the number of
**segments (and total length) offilepath_globpatternat the request boundary, and rewrite_globmatch_implto be non-exponential: memoize on(pattern_index, path_index)(dynamic programming) so matching is polynomial, or compile the glob to a linear-time automaton. Consider authenticating the REST endpoints and rejecting overly complex filter expressions.Relevant log output
What did you expect to happen?
Version
0.31.1
Docker Versions (if used)
No response
OS
Linux
On which CPU architecture did you run the Pathway Live Data Framework?
None