fix: OpenCypher aggregation with CASE returns multiple rows (#3858)#3859
fix: OpenCypher aggregation with CASE returns multiple rows (#3858)#3859
Conversation
The expression parser's recursive CASE detection found CaseExpression nodes nested inside function arguments (e.g. sum(CASE...)) and returned the bare CASE, discarding the outer aggregation wrapper. This caused the planner to treat the expression as a non-aggregation grouping key, splitting results into multiple rows instead of one. Add text-length guards to tryParseSpecialFunctions and parseExpressionFromText so a found CASE context is only used when it covers the full expression, matching the pattern already used for reduce and pattern comprehension. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Code ReviewGood fix for the reported issue. The root cause is well identified: Approach consistencyThe length-based guard Minor concern: magic number
|
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 4 |
TIP This summary will be updated as you push new changes. Give us feedback
There was a problem hiding this comment.
Code Review
This pull request addresses a parsing bug (issue #3858) where CASE expressions nested inside aggregations were mis-parsed as the primary expression, leading to incorrect implicit grouping. The fix introduces a length guard in CypherExpressionBuilder and ExpressionTypeDetector to ensure that CASE expressions are only matched when they span the full expression text. Regression tests were added to confirm the fix. Review feedback indicates that the same length guard should be applied to EXISTS and shortestPath expressions to prevent similar mis-parsing issues, and notes a minor efficiency improvement regarding the placement of the node.getText() call.
| final String nodeText = node.getText(); | ||
| final Cypher25Parser.CaseExpressionContext caseCtx = findCaseExpressionRecursive(node); | ||
| if (caseCtx != null) | ||
| if (caseCtx != null && caseCtx.getText().length() >= nodeText.length() - 2) | ||
| return parseCaseExpression(caseCtx); | ||
|
|
||
| final Cypher25Parser.ExtendedCaseExpressionContext extCaseCtx = findExtendedCaseExpressionRecursive(node); | ||
| if (extCaseCtx != null) | ||
| if (extCaseCtx != null && extCaseCtx.getText().length() >= nodeText.length() - 2) | ||
| return parseExtendedCaseExpression(extCaseCtx); |
There was a problem hiding this comment.
The node.getText() call is performed eagerly, which is inefficient if no CASE expression is found in the subtree. More importantly, the recursive search logic used here is also applied to EXISTS (line 258) and shortestPath expressions, but they lack the length guard. This means expressions like count(EXISTS { ... }) or my_func(shortestPath(...)) will still be mis-parsed as just the inner special expression, losing the outer wrapper and potentially causing incorrect query results. You should apply the same guard to all special expressions that use recursive search to ensure consistent and correct parsing.
There was a problem hiding this comment.
Applied the length guard to EXISTS in both tryParseSpecialFunctions and parseExpressionFromText, and to shortestPath in tryParseSpecialFunctions. Also consolidated the exprText/nodeText variable to the top with an inline comment explaining the - 2 tolerance. All 5147 Cypher tests still pass.
| final String exprText = ctx.getText(); | ||
| final Cypher25Parser.CaseExpressionContext caseCtx = builder.findCaseExpressionRecursive(ctx); | ||
| if (caseCtx != null) | ||
| if (caseCtx != null && caseCtx.getText().length() >= exprText.length() - 2) | ||
| return builder.parseCaseExpression(caseCtx); | ||
|
|
||
| final Cypher25Parser.ExtendedCaseExpressionContext extCaseCtx = builder.findExtendedCaseExpressionRecursive(ctx); | ||
| if (extCaseCtx != null) | ||
| if (extCaseCtx != null && extCaseCtx.getText().length() >= exprText.length() - 2) | ||
| return builder.parseExtendedCaseExpression(extCaseCtx); |
There was a problem hiding this comment.
The length guard added here for CASE expressions should also be applied to EXISTS (line 58) and shortestPath (line 75) expressions. These expressions use the same recursive search mechanism and are susceptible to the same mis-parsing bug when nested inside other functions (e.g., count(EXISTS { ... })). Applying the guard consistently across all special functions in tryParseSpecialFunctions will prevent similar issues for those expression types and ensure the parser correctly identifies the top-level expression.
There was a problem hiding this comment.
Fixed. Applied the length guard to EXISTS (line 58) and shortestPath (line 75) in tryParseSpecialFunctions, and moved the exprText variable above all checks with a block comment explaining the rationale and the - 2 tolerance. 5147 Cypher tests pass.
Extend the same text-length guard added for CASE to EXISTS and shortestPath detection in both tryParseSpecialFunctions and parseExpressionFromText. These use the same recursive descent pattern and are susceptible to the same class of mis-parsing when nested inside function arguments. Add inline comments explaining the - 2 tolerance (whitespace that ANTLR's getText() strips). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
test comment - please ignore |
|
test2 - ignore |
|
test3 - ignore |
CLAUDE.mdThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. Project OverviewArcadeDB is a Multi-Model DBMS (Database Management System) built for extreme performance. It's a Java-based project that supports multiple data models (Graph, Document, Key/Value, Search Engine, Time Series, Vector Embedding) and query languages (SQL, Cypher, Gremlin, GraphQL, MongoDB Query Language). Response Formatting
Project InstructionsBefore writing any code:
General design principles:
Build and Development CommandsMaven (Java)
Studio Frontend (Node.js)
Server Operations
Distribution BuilderThe modular distribution builder ( Production builds (download from releases): cd package
./arcadedb-builder.sh --version=26.1.0 --modules=gremlin,studioDevelopment builds (use local Maven repository): # Build modules first
mvn clean install -DskipTests
# Create distribution with local modules
cd package
VERSION=$(mvn -f ../pom.xml help:evaluate -Dexpression=project.version -q -DforceStdout)
./arcadedb-builder.sh \
--version=$VERSION \
--modules=console,gremlin,studio \
--local-repo \
--skip-dockerTesting the builder: cd package
./test-builder-local.shTesting Commands
Codebase Navigation MapANTLR Grammars
SQL Engine Key Files
OpenCypher Engine Key Files
Graph Engine
Server / HTTP
Test Locations (by module)
Architecture OverviewCore Modules
Wire Protocol Modules
Key Engine Components
Server Components
Development GuidelinesJava Version
Code Structure
Testing Approach
Database Features to Consider
Common Development TasksAdding New Features
Working with Indexes
Query Development
Server Development
Wire Protocol Module Dependencies
Important Notes
|
1 similar comment
CLAUDE.mdThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. Project OverviewArcadeDB is a Multi-Model DBMS (Database Management System) built for extreme performance. It's a Java-based project that supports multiple data models (Graph, Document, Key/Value, Search Engine, Time Series, Vector Embedding) and query languages (SQL, Cypher, Gremlin, GraphQL, MongoDB Query Language). Response Formatting
Project InstructionsBefore writing any code:
General design principles:
Build and Development CommandsMaven (Java)
Studio Frontend (Node.js)
Server Operations
Distribution BuilderThe modular distribution builder ( Production builds (download from releases): cd package
./arcadedb-builder.sh --version=26.1.0 --modules=gremlin,studioDevelopment builds (use local Maven repository): # Build modules first
mvn clean install -DskipTests
# Create distribution with local modules
cd package
VERSION=$(mvn -f ../pom.xml help:evaluate -Dexpression=project.version -q -DforceStdout)
./arcadedb-builder.sh \
--version=$VERSION \
--modules=console,gremlin,studio \
--local-repo \
--skip-dockerTesting the builder: cd package
./test-builder-local.shTesting Commands
Codebase Navigation MapANTLR Grammars
SQL Engine Key Files
OpenCypher Engine Key Files
Graph Engine
Server / HTTP
Test Locations (by module)
Architecture OverviewCore Modules
Wire Protocol Modules
Key Engine Components
Server Components
Development GuidelinesJava Version
Code Structure
Testing Approach
Database Features to Consider
Common Development TasksAdding New Features
Working with Indexes
Query Development
Server Development
Wire Protocol Module Dependencies
Important Notes
|
Code ReviewClean fix for issue 3858. The length-guard approach is consistent with established patterns in ExpressionTypeDetector. Strengths
Concerns1. count(*) is not guarded (minor risk) In tryParseSpecialFunctions, the countStar check has no length guard. If count() ever appears as a sub-expression inside another construct (e.g., coalesce(count(), 0)), this would have the same mis-parsing issue that CASE had. Worth adding a guard for consistency and future safety, even if no current bug exists. 2. The -2 tolerance comment could be more precise The comment says the -2 "allows for whitespace that ANTLR's getText() strips." Since getText() strips whitespace on both sides, both strings are already whitespace-free - so the tolerance is covering something else (likely minor structural differences in how parent nodes represent their text boundary). The comment matches the pattern already in tryParseListPredicates so it is consistent, but a more precise explanation would help future maintainers understand when -2 is the right threshold. 3. Nit - unnecessary boolean cast in tests assertThat((boolean) results.hasNext()).isTrue() - the explicit boolean cast is unnecessary since hasNext() already returns a primitive boolean. Bottom lineThe fix is correct and well-tested. The main suggestion before merge: add the length guard to count(*) for completeness. Everything else is minor. |
Summary
sum(CASE WHEN ... END)as a bareCaseExpressioninstead ofFunctionCallExpression("sum", [CaseExpression]), which caused the planner to treat it as a non-aggregation grouping key and split results into multiple rowstryParseSpecialFunctionsandparseExpressionFromTextso recursive CASE detection only matches when the CASE covers the full expression (same pattern already used for reduce/pattern comprehension)Closes #3858
Test plan
aggregationWithCaseNoImplicitGroupBy- exact reproduction of the reported queryaggregationWithCaseAndGroupByKey- CASE inside aggregation alongside a real grouping key🤖 Generated with Claude Code