Skip to content

Parser: fix exponential parse time on nested function-call arguments#2378

Open
altmannmarcelo wants to merge 1 commit into
apache:mainfrom
readysettech:parse-function-args-exponential
Open

Parser: fix exponential parse time on nested function-call arguments#2378
altmannmarcelo wants to merge 1 commit into
apache:mainfrom
readysettech:parse-function-args-exponential

Conversation

@altmannmarcelo

Copy link
Copy Markdown
Contributor

Summary

Parser::parse_function_args runs in O(2^n) time on n-deep nested function-call arguments (e.g. replace(replace(replace(...)))) for any dialect where supports_named_fn_args_with_expr_name() is true — notably PostgreSQL and MSSQL. A 20-level chain takes ~43s; a 30-level chain would take ~12 hours. The fix makes parse time linear.

Root cause

To support expression-named arguments like JSON_OBJECT(a + b : c), parse_function_args speculatively parses the entire argument as an expression to see whether a named-argument operator (=>, :=, :) follows:

let arg = if self.dialect.supports_named_fn_args_with_expr_name() {
    self.maybe_parse(|p| {
        let name = p.parse_expr()?;                         // parse the WHOLE argument
        let operator = p.parse_function_named_arg_operator()?;  // needs =>/:=/:
        let arg = p.parse_wildcard_expr()?.into();
        Ok(FunctionArg::ExprNamed { name, arg, operator })
    })?
} else { ... };
if let Some(arg) = arg { return Ok(arg); }
let wildcard_expr = self.parse_wildcard_expr()?;            // RE-PARSE the same argument

For a plain positional argument that is itself a function call, the speculative parse_expr() consumes the whole nested subtree, finds no named-arg operator, maybe_parse rewinds, and parse_wildcard_expr() parses the identical subtree again. Each nesting level parses its child twice: T(n) = 2*T(n-1)O(2^n).

Introduced in commit 92be237 — "Add support for MSSQL's JSON_ARRAY/JSON_OBJECT expr (#1507)" — which changed named-argument detection from a single-token parse_identifier lookahead to this speculative full-expression parse.

Fix

Parse the leading expression once, then check for a named-argument operator (an O(1) token check). If one follows, it is a named argument; otherwise the same already-parsed expression is the positional argument. The positional tail (* EXCLUDE, expr AS alias) is factored into parse_unnamed_function_arg, shared by both paths. No re-parse; behavior is otherwise unchanged.

Evidence

Pathological input: SELECT replace(replace(... x ..., 'a','b'), 'a','b'), PostgreSQL dialect. Parse time per nesting depth, before vs after the fix (milliseconds):

depth before after
10 48 ms 0.025 ms
20 42,600 ms 0.049 ms
30 ~43,000,000 ms (extrapolated, not run) 0.071 ms

Both columns are parse-only (min of 3 runs). Before: parse time doubles per level — O(2^n) (additional measured points: depth 15 = 1,350 ms, depth 18 = 10,600 ms). After: linear, ~0.0024 ms per level. At depth 20 that is roughly an 870,000x speedup. The depth-30 case was not run before the fix; extrapolating the doubling (42,600 ms x 2^10) puts it near 12 hours.

On dialects with expression-named function arguments
(supports_named_fn_args_with_expr_name, e.g. PostgreSQL and MSSQL),
parse_function_args speculatively parsed the full argument expression to
detect named-argument syntax (name => value), then rewound and re-parsed
it as a positional argument when no operator followed. For deeply nested
function-call arguments such as replace(replace(replace(...))), this
doubled the work at every nesting level, giving O(2^n) parse time.

Measured parse time for an n-deep replace() chain (PostgreSQL) roughly
doubles per level: depth 5 ~0.002s, depth 10 ~0.05s, depth 15 ~1.3s,
depth 20 ~43s. A 30-level chain was not run; extrapolating the same
doubling puts it near 12 hours.

Parse the leading expression once, then check for a named-argument
operator: when one follows it is a named argument, otherwise the same
expression is the positional argument. The shared positional-argument
tail (wildcard options, AS aliasing) moves into parse_unnamed_function_arg.

test_reserved_keywords_for_identifiers previously asserted that
expression-named dialects report 'found: interval' for SELECT
MAX(interval); that message was an artifact of the second parse hitting
the position-keyed failure cache. With a single parse the error matches
every other dialect: 'found: )'. Adds a regression test and a benchmark.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant