Skip to content

fix: OpenCypher aggregation with CASE returns multiple rows (#3858)#3859

Open
robfrank wants to merge 2 commits intomainfrom
fix/3858-cypher-aggregation-case-groupby
Open

fix: OpenCypher aggregation with CASE returns multiple rows (#3858)#3859
robfrank wants to merge 2 commits intomainfrom
fix/3858-cypher-aggregation-case-groupby

Conversation

@robfrank
Copy link
Copy Markdown
Collaborator

Summary

  • Fix expression parser mis-parsing sum(CASE WHEN ... END) as a bare CaseExpression instead of FunctionCallExpression("sum", [CaseExpression]), which caused the planner to treat it as a non-aggregation grouping key and split results into multiple rows
  • Add text-length guards to tryParseSpecialFunctions and parseExpressionFromText so recursive CASE detection only matches when the CASE covers the full expression (same pattern already used for reduce/pattern comprehension)
  • Add two regression tests: pure aggregation with CASE (single-row result) and mixed aggregation+grouping with CASE

Closes #3858

Test plan

  • New test aggregationWithCaseNoImplicitGroupBy - exact reproduction of the reported query
  • New test aggregationWithCaseAndGroupByKey - CASE inside aggregation alongside a real grouping key
  • All 8 CypherCaseTest tests pass
  • All 5147 Cypher/OpenCypher engine tests pass (0 failures)
  • All 62 aggregation-related tests pass

🤖 Generated with Claude Code

The expression parser's recursive CASE detection found CaseExpression
nodes nested inside function arguments (e.g. sum(CASE...)) and returned
the bare CASE, discarding the outer aggregation wrapper. This caused
the planner to treat the expression as a non-aggregation grouping key,
splitting results into multiple rows instead of one.

Add text-length guards to tryParseSpecialFunctions and
parseExpressionFromText so a found CASE context is only used when it
covers the full expression, matching the pattern already used for
reduce and pattern comprehension.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

Code Review

Good fix for the reported issue. The root cause is well identified: findCaseExpressionRecursive would find a CaseExpression nested inside a function call argument and return it as if it were the entire expression, causing the outer aggregation wrapper (sum(...)) to be lost.

Approach consistency

The length-based guard caseCtx.getText().length() >= nodeText.length() - 2 is consistent with the identical pattern already used in tryParseComprehensions and tryParseListPredicates for reduce, allReduce, patternComprehension, listComprehension, and listItemsPredicate. This is the right approach given the existing design.

Minor concern: magic number -2

The - 2 tolerance is unexplained in the new guards, though tryParseListPredicates at least has the inline comment // allow for whitespace. The new guards in ExpressionTypeDetector and CypherExpressionBuilder would benefit from the same:

if (caseCtx != null && caseCtx.getText().length() >= exprText.length() - 2) // allow for whitespace

It is worth documenting why 2 is sufficient - ANTLR's getText() strips whitespace from tokens, so the only difference comes from delimiters like parentheses (1 char each). A single-character function name wrapping a CASE gives f(CASE...) vs CASE..., a difference of 3, so the guard correctly rejects it. This reasoning is not obvious from the code alone.

Potential edge case: EXISTS lacks the same guard

In tryParseSpecialFunctions, EXISTS detection does not have the length guard:

final Cypher25Parser.ExistsExpressionContext existsCtx = builder.findExistsExpressionRecursive(ctx);
if (existsCtx != null)
    return builder.parseExistsExpression(existsCtx);

If EXISTS can appear as a function argument in any valid query pattern (e.g. a user-defined function wrapping it), it would have the same mis-parsing bug. This is likely not a practical issue since EXISTS is typically a top-level predicate, but it is worth noting for completeness or adding a TODO comment.

Tests

The two regression tests are well written:

Both use the project-standard assertThat(...).isEqualTo(...) style and properly assert there are no extra rows.

One small nit: assertThat((boolean) results.hasNext()).isTrue() could be written as assertThat(results.hasNext()).isTrue() since hasNext() already returns a primitive boolean - the cast is redundant (and existing tests in this file already do the same, so this is low priority).

Summary

The fix is correct, well-tested, and consistent with existing patterns in the parser. The main actionable suggestion is adding // allow for whitespace inline comments to match the style in tryParseListPredicates, and optionally adding a note about the EXISTS case. Otherwise this looks good to merge.

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 12, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 4 complexity

Metric Results
Complexity 4

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a parsing bug (issue #3858) where CASE expressions nested inside aggregations were mis-parsed as the primary expression, leading to incorrect implicit grouping. The fix introduces a length guard in CypherExpressionBuilder and ExpressionTypeDetector to ensure that CASE expressions are only matched when they span the full expression text. Regression tests were added to confirm the fix. Review feedback indicates that the same length guard should be applied to EXISTS and shortestPath expressions to prevent similar mis-parsing issues, and notes a minor efficiency improvement regarding the placement of the node.getText() call.

Comment on lines +248 to 255
final String nodeText = node.getText();
final Cypher25Parser.CaseExpressionContext caseCtx = findCaseExpressionRecursive(node);
if (caseCtx != null)
if (caseCtx != null && caseCtx.getText().length() >= nodeText.length() - 2)
return parseCaseExpression(caseCtx);

final Cypher25Parser.ExtendedCaseExpressionContext extCaseCtx = findExtendedCaseExpressionRecursive(node);
if (extCaseCtx != null)
if (extCaseCtx != null && extCaseCtx.getText().length() >= nodeText.length() - 2)
return parseExtendedCaseExpression(extCaseCtx);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The node.getText() call is performed eagerly, which is inefficient if no CASE expression is found in the subtree. More importantly, the recursive search logic used here is also applied to EXISTS (line 258) and shortestPath expressions, but they lack the length guard. This means expressions like count(EXISTS { ... }) or my_func(shortestPath(...)) will still be mis-parsed as just the inner special expression, losing the outer wrapper and potentially causing incorrect query results. You should apply the same guard to all special expressions that use recursive search to ensure consistent and correct parsing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied the length guard to EXISTS in both tryParseSpecialFunctions and parseExpressionFromText, and to shortestPath in tryParseSpecialFunctions. Also consolidated the exprText/nodeText variable to the top with an inline comment explaining the - 2 tolerance. All 5147 Cypher tests still pass.

Comment on lines 65 to 72
final String exprText = ctx.getText();
final Cypher25Parser.CaseExpressionContext caseCtx = builder.findCaseExpressionRecursive(ctx);
if (caseCtx != null)
if (caseCtx != null && caseCtx.getText().length() >= exprText.length() - 2)
return builder.parseCaseExpression(caseCtx);

final Cypher25Parser.ExtendedCaseExpressionContext extCaseCtx = builder.findExtendedCaseExpressionRecursive(ctx);
if (extCaseCtx != null)
if (extCaseCtx != null && extCaseCtx.getText().length() >= exprText.length() - 2)
return builder.parseExtendedCaseExpression(extCaseCtx);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The length guard added here for CASE expressions should also be applied to EXISTS (line 58) and shortestPath (line 75) expressions. These expressions use the same recursive search mechanism and are susceptible to the same mis-parsing bug when nested inside other functions (e.g., count(EXISTS { ... })). Applying the guard consistently across all special functions in tryParseSpecialFunctions will prevent similar issues for those expression types and ensure the parser correctly identifies the top-level expression.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Applied the length guard to EXISTS (line 58) and shortestPath (line 75) in tryParseSpecialFunctions, and moved the exprText variable above all checks with a block comment explaining the rationale and the - 2 tolerance. 5147 Cypher tests pass.

Extend the same text-length guard added for CASE to EXISTS and
shortestPath detection in both tryParseSpecialFunctions and
parseExpressionFromText.  These use the same recursive descent
pattern and are susceptible to the same class of mis-parsing when
nested inside function arguments.

Add inline comments explaining the - 2 tolerance (whitespace that
ANTLR's getText() strips).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

test comment - please ignore

@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

test2 - ignore

@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

test3 - ignore

@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

ArcadeDB is a Multi-Model DBMS (Database Management System) built for extreme performance. It's a Java-based project that supports multiple data models (Graph, Document, Key/Value, Search Engine, Time Series, Vector Embedding) and query languages (SQL, Cypher, Gremlin, GraphQL, MongoDB Query Language).

Response Formatting

  • Never use the em dash character () in responses. Use a normal dash (-), a comma, or rephrase instead.

Project Instructions

Before writing any code:

  • State how you will verify this change works (e.g., unit tests, integration tests, manual testing)
  • Write the tests first (TDD approach) whenever possible
  • Ensure code adheres to existing coding standards and styles
  • Then implement the code
  • Run verification and iterate until it passes
  • Run all the connected tests could be affected by the change to ensure nothing is broken (no need to run the whole suite, it would take too long)

General design principles:

  • reuse existing components whenever is possible
  • don't use fully qualified names if possible, always import the class and just use the name
  • don't include a new dependency unless is strictly necessary, and they MUST be Apache 2.0 compatible:
    • ✅ ALLOWED: Apache 2.0, MIT, BSD (2/3-Clause), EPL 1.0/2.0, UPL 1.0, EDL 1.0, LGPL 2.1+ (for libraries only), CC0/Public Domain
    • ❌ FORBIDDEN: GPL, AGPL, proprietary licenses without explicit permission, SSPL, Commons Clause
    • When adding a dependency, you MUST update ATTRIBUTIONS.md and, if Apache-licensed with a NOTICE file, incorporate required notices into the main NOTICE file
  • for Studio (webapp), limit to jquery and bootstrap 5. If necessary use 3rd party libs, but they must be Apache 2.0 compatible (see allowed licenses above)
  • always bear in mind PERFORMANCE. It must be always your mantra: performance and lightweight on garbage collector. If you can, prefer using arrays of primitives to List of Objects
  • if you need to use JSON, use the class com.arcadedb.serializer.json.JSONObject. Leverage the getter methods that accept the default value as 2nd argument, so you don't need to check if they present or not null = less boilerplate code
  • same thing for JSON arrays: use com.arcadedb.serializer.json.JSONArray class
  • code styles:
  • adhere to the existing code
  • if statements with only one child sub-statement don't require a curly brace open/close, keep it simple
  • use the final keyword when possible on variables and parameters
  • all new server-side code must be tested with a test case. Check existing test case to see the framework and style to use
  • write a regression test
  • after every change in the backend (Java), compile the project and fix all the issues until the compilation passes
  • test all the new and old components you've modified before considering the job finished. Please do not provide something untested
  • always keep in mind speed and security with ArcadeDB, do not introduce security hazard or code that could slow down other parts unless requested/approved
  • do not commit on git, I will do it after a review
  • remove any System.out you used for debug when you have finished
  • For test cases, prefer this syntax: assertThat(property.isMandatory()).isTrue();
  • don't add Claude as author of any source code

Build and Development Commands

Maven (Java)

  • Build entire project: mvn clean install
  • Build without tests: mvn clean install -DskipTests
  • Run unit tests: mvn test
  • Run integration tests: mvn test -DskipITs=false
  • Build specific module: cd <module> && mvn clean install

Studio Frontend (Node.js)

  • Build frontend: cd studio && npm run build
  • Development mode: cd studio && npm run dev
  • Security audit: cd studio && npm run security-audit

Server Operations

  • Start server: Use packaged scripts in package/src/main/scripts/server.sh (Unix) or server.bat (Windows)
  • Console: Use package/src/main/scripts/console.sh or console.bat

Distribution Builder

The modular distribution builder (package/arcadedb-builder.sh) creates custom ArcadeDB distributions:

Production builds (download from releases):

cd package
./arcadedb-builder.sh --version=26.1.0 --modules=gremlin,studio

Development builds (use local Maven repository):

# Build modules first
mvn clean install -DskipTests

# Create distribution with local modules
cd package
VERSION=$(mvn -f ../pom.xml help:evaluate -Dexpression=project.version -q -DforceStdout)
./arcadedb-builder.sh \
    --version=$VERSION \
    --modules=console,gremlin,studio \
    --local-repo \
    --skip-docker

Testing the builder:

cd package
./test-builder-local.sh

Testing Commands

  • Run specific test class: mvn test -Dtest=ClassName
  • Run tests with specific pattern: mvn test -Dtest="*Pattern*"
  • Performance tests: Located in src/test/java/performance/ packages

Codebase Navigation Map

ANTLR Grammars

  • engine/src/main/antlr4/com/arcadedb/query/sql/grammar/SQLLexer.g4 — SQL lexer
  • engine/src/main/antlr4/com/arcadedb/query/sql/grammar/SQLParser.g4 — SQL parser
  • engine/src/main/antlr4/com/arcadedb/query/opencypher/grammar/Cypher25Lexer.g4 — Cypher lexer
  • engine/src/main/antlr4/com/arcadedb/query/opencypher/grammar/Cypher25Parser.g4 — Cypher parser

SQL Engine Key Files

  • Parser AST nodes (170+ classes): engine/src/main/java/com/arcadedb/query/sql/parser/
    • SuffixIdentifier.java — property access (e.g., record.field)
    • BaseIdentifier.java, LevelZeroIdentifier.java — identifier resolution
    • Expression.java, BaseExpression.java, MathExpression.java — expression evaluation
    • Projection.java, ProjectionItem.java — SELECT projection handling
    • SelectStatement.java, MatchStatement.java — statement AST roots
    • NestedProjection.java, NestedProjectionItem.java — nested projection (e.g., {*})
    • FunctionCall.java, MethodCall.java — function/method invocation
    • Modifier.java — chained modifiers (array selectors, method calls, suffix identifiers)
    • WhereClause.java, BooleanExpression.java — filter conditions
    • LetClause.java, LetItem.java — LET variable bindings
  • Executor steps (158 classes): engine/src/main/java/com/arcadedb/query/sql/executor/
    • SelectExecutionPlanner.java — main SELECT execution planner
    • ProjectionCalculationStep.java — projection evaluation step
    • LetExpressionStep.java, GlobalLetExpressionStep.java — LET evaluation
    • FetchFromTypeStep.java, FetchFromIndexStep.java — data source steps
    • FilterStep.java, FilterByClustersStep.java — filtering steps
  • SQL methods (50+ classes): engine/src/main/java/com/arcadedb/query/sql/method/
    • string/ — toLowerCase, toUpperCase, trim, split, etc.
    • collection/ — size, keys, values, sort, etc.
    • conversion/ — asInteger, asString, asList, asJSON, etc.
  • SQL functions: engine/src/main/java/com/arcadedb/function/sql/
    • graph/ — out, in, both, outE, inE, bothE, shortestPath, dijkstra, etc.
    • coll/ — difference, intersect, symmetricDifference
    • fulltext/ — search field/index functions

OpenCypher Engine Key Files

  • AST (40+ classes): engine/src/main/java/com/arcadedb/query/opencypher/ast/
    • CypherStatement.java, MatchClause.java, ReturnClause.java, WhereClause.java
    • CreateClause.java, MergeClause.java, DeleteClause.java, SetClause.java
    • PatternElement.java, NodePattern.java, RelationshipPattern.java
  • Executor: engine/src/main/java/com/arcadedb/query/opencypher/executor/
  • Optimizer: engine/src/main/java/com/arcadedb/query/opencypher/optimizer/
  • Planner: engine/src/main/java/com/arcadedb/query/opencypher/planner/
  • Tests: engine/src/test/java/com/arcadedb/query/opencypher/

Graph Engine

  • engine/src/main/java/com/arcadedb/graph/
    • Vertex.java, MutableVertex.java, ImmutableVertex.java — vertex types
    • Edge.java, MutableEdge.java, ImmutableEdge.java — edge types
    • GraphEngine.java — core graph operations
    • EdgeSegment.java, MutableEdgeSegment.java — edge storage segments
    • EdgeLinkedList.java — edge linked list structure
    • EdgeIterator.java, VertexIterator.java — traversal iterators

Server / HTTP

  • HTTP handlers: server/src/main/java/com/arcadedb/server/http/handler/
    • DatabaseAbstractHandler.java — base handler (wraps commands in transactions)
    • PostCommandHandler.java — POST /command endpoint
    • PostQueryHandler.java, GetQueryHandler.java — query endpoints
  • HA: server/src/main/java/com/arcadedb/server/ha/
  • Security: server/src/main/java/com/arcadedb/server/security/

Test Locations (by module)

  • engine/src/test/java/ — 746 test files (SQL, Cypher, graph, storage, schema, indexing)
  • server/src/test/java/ — 114 test files (HTTP API, HA, security)
  • gremlin/src/test/java/ — 29 test files
  • integration/src/test/java/ — 22 test files
  • bolt/src/test/java/ — 10 test files
  • graphql/src/test/java/ — 9 test files

Architecture Overview

Core Modules

  • engine/: Core database engine, storage, indexing, query execution (SQL, OpenCypher, Polyglot)
  • server/: HTTP/REST API, WebSocket support, clustering/HA, MCP server
  • network/: Network communication layer
  • console/: CLI console for interactive database operations
  • studio/: Web-based administration interface (JavaScript/Node.js)
  • metrics/: Server metrics collection and reporting
  • integration/: Integration utilities
  • test-utils/: Shared test utilities

Wire Protocol Modules

  • gremlin/: Apache Tinkerpop Gremlin support
  • graphql/: GraphQL API support
  • mongodbw/: MongoDB wire protocol compatibility
  • redisw/: Redis wire protocol compatibility
  • postgresw/: PostgreSQL wire protocol compatibility
  • bolt/: Neo4j Bolt wire protocol compatibility
  • grpc/: gRPC protocol definitions
  • grpcw/: gRPC wire protocol module
  • grpc-client/: gRPC client library

Key Engine Components

  • Database Management: com.arcadedb.database.* - Database lifecycle, transactions, ACID compliance
  • Storage Engine: com.arcadedb.engine.* - Low-level storage, page management, WAL
  • SQL Query Engine: com.arcadedb.query.sql.* - SQL query parsing, execution planning
  • OpenCypher Engine: com.arcadedb.query.opencypher.* - Native Cypher implementation with ANTLR parser, AST, optimizer (filter pushdown, index selection, expand-into, join ordering), and step-based execution. Has both optimizer and legacy execution paths — changes to clause handling may need updates in multiple paths
  • Polyglot Engine: com.arcadedb.query.polyglot.* - GraalVM-based scripting support
  • Schema Management: com.arcadedb.schema.* - Type definitions, property management
  • Index System: com.arcadedb.index.* - LSM-Tree indexes, full-text, vector indexes
  • Graph Engine: com.arcadedb.graph.* - Vertex/Edge management, graph traversals
  • Serialization: com.arcadedb.serializer.* - Binary serialization, JSON handling

Server Components

  • HTTP API: com.arcadedb.server.http.* - REST endpoints, request handling
  • High Availability: com.arcadedb.server.ha.* - Clustering, replication, leader election
  • Security: com.arcadedb.server.security.* - Authentication, authorization
  • Monitoring: com.arcadedb.server.monitor.* - Metrics, query profiling, health checks
  • MCP: com.arcadedb.server.mcp.* - Model Context Protocol server support

Development Guidelines

Java Version

  • Required: Java 21+ (main branch)
  • Legacy: Java 17 support on java17 branch

Code Structure

  • Uses Maven multi-module project structure
  • Low-level Java optimization for performance ("LLJ: Low Level Java")
  • Minimal garbage collection pressure design
  • Thread-safe implementations throughout

Testing Approach

  • Framework: JUnit 5 (Jupiter) with AssertJ assertions
  • Unit tests in each module's src/test/java
  • Integration tests with IT suffix
  • Performance tests in performance/ packages
  • TestContainers used in e2e/ and load-tests/ modules for containerized testing
  • Separate test databases in databases/ for isolation

Database Features to Consider

  • ACID Transactions: Full transaction support with isolation levels
  • Multi-Model: Single database can store graphs, documents, key/value pairs
  • Query Languages: SQL (OrientDB-compatible), Cypher, Gremlin, MongoDB queries
  • Indexing: LSM-Tree indexes, full-text (Lucene), vector embeddings
  • High Availability: Leader-follower replication, automatic failover
  • Wire Protocols: HTTP/JSON, PostgreSQL, MongoDB, Redis, Neo4j Bolt, gRPC compatibility

Common Development Tasks

Adding New Features

  1. Create tests first (TDD approach)
  2. Implement in appropriate module
  3. Update schema if needed
  4. Add integration tests
  5. Update documentation

Working with Indexes

  • LSM-Tree implementation in com.arcadedb.index.lsm.*
  • Index creation via Schema API
  • Performance testing with large datasets recommended

Query Development

  • SQL parsing in com.arcadedb.query.sql.*
  • SQL execution plans in com.arcadedb.query.sql.executor.*
  • OpenCypher engine in com.arcadedb.query.opencypher.* — has ast/, parser/, executor/, optimizer/, planner/, rewriter/ sub-packages
  • OpenCypher tests in engine/src/test/java/com/arcadedb/query/opencypher/
  • Test with various query patterns and data sizes

Server Development

  • HTTP handlers in com.arcadedb.server.http.handler.*
  • Security integration required for new endpoints
  • WebSocket support for real-time features

Wire Protocol Module Dependencies

  • Standard: All wire protocol modules (gremlin, graphql, mongodbw, redisw, postgresw, bolt, grpcw) must use provided scope for arcadedb-server dependency
  • Rationale: Server remains the assembly point; prevents dependency duplication in distributions
  • Pattern:
    • Main server dependency → scope: provided
    • Server test-jar → scope: test
    • Cross-module test dependencies → scope: test only (e.g., postgresw should not depend on gremlin for compilation)
    • Integration/format handlers → scope: compile only if in src/main/java (e.g., gremlin's GraphML/GraphSON handlers)
  • Enforcement: Code review process ensures:
    • Protocol modules do NOT depend on other protocol modules in compile scope
    • Each protocol module has arcadedb-server in provided scope only (not compile)
    • Only the server assembly (package module) and coverage reporting modules can aggregate protocol modules
  • Example:
    <dependency>
        <groupId>com.arcadedb</groupId>
        <artifactId>arcadedb-server</artifactId>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>com.arcadedb</groupId>
        <artifactId>arcadedb-server</artifactId>
        <version>${project.parent.version}</version>
        <scope>test</scope>
        <type>test-jar</type>
    </dependency>

Important Notes

  • Pre-commit hooks: This project uses pre-commit for code quality checks (trailing whitespace, Prettier for Java/XML formatting, etc.)
  • Code formatting: Prettier with requirePragma: true and printWidth: 160 — only formats files with a @format pragma
  • Security: Never log or expose sensitive data (passwords, tokens, etc.)
  • Performance: Always consider memory and CPU impact of changes
  • Compatibility: Maintain backward compatibility for API changes
  • Licensing: All code must comply with Apache 2.0 license
  • Modular Builder: Script to create custom distributions with selected modules (see package/README-BUILDER.md)

1 similar comment
@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

ArcadeDB is a Multi-Model DBMS (Database Management System) built for extreme performance. It's a Java-based project that supports multiple data models (Graph, Document, Key/Value, Search Engine, Time Series, Vector Embedding) and query languages (SQL, Cypher, Gremlin, GraphQL, MongoDB Query Language).

Response Formatting

  • Never use the em dash character () in responses. Use a normal dash (-), a comma, or rephrase instead.

Project Instructions

Before writing any code:

  • State how you will verify this change works (e.g., unit tests, integration tests, manual testing)
  • Write the tests first (TDD approach) whenever possible
  • Ensure code adheres to existing coding standards and styles
  • Then implement the code
  • Run verification and iterate until it passes
  • Run all the connected tests could be affected by the change to ensure nothing is broken (no need to run the whole suite, it would take too long)

General design principles:

  • reuse existing components whenever is possible
  • don't use fully qualified names if possible, always import the class and just use the name
  • don't include a new dependency unless is strictly necessary, and they MUST be Apache 2.0 compatible:
    • ✅ ALLOWED: Apache 2.0, MIT, BSD (2/3-Clause), EPL 1.0/2.0, UPL 1.0, EDL 1.0, LGPL 2.1+ (for libraries only), CC0/Public Domain
    • ❌ FORBIDDEN: GPL, AGPL, proprietary licenses without explicit permission, SSPL, Commons Clause
    • When adding a dependency, you MUST update ATTRIBUTIONS.md and, if Apache-licensed with a NOTICE file, incorporate required notices into the main NOTICE file
  • for Studio (webapp), limit to jquery and bootstrap 5. If necessary use 3rd party libs, but they must be Apache 2.0 compatible (see allowed licenses above)
  • always bear in mind PERFORMANCE. It must be always your mantra: performance and lightweight on garbage collector. If you can, prefer using arrays of primitives to List of Objects
  • if you need to use JSON, use the class com.arcadedb.serializer.json.JSONObject. Leverage the getter methods that accept the default value as 2nd argument, so you don't need to check if they present or not null = less boilerplate code
  • same thing for JSON arrays: use com.arcadedb.serializer.json.JSONArray class
  • code styles:
  • adhere to the existing code
  • if statements with only one child sub-statement don't require a curly brace open/close, keep it simple
  • use the final keyword when possible on variables and parameters
  • all new server-side code must be tested with a test case. Check existing test case to see the framework and style to use
  • write a regression test
  • after every change in the backend (Java), compile the project and fix all the issues until the compilation passes
  • test all the new and old components you've modified before considering the job finished. Please do not provide something untested
  • always keep in mind speed and security with ArcadeDB, do not introduce security hazard or code that could slow down other parts unless requested/approved
  • do not commit on git, I will do it after a review
  • remove any System.out you used for debug when you have finished
  • For test cases, prefer this syntax: assertThat(property.isMandatory()).isTrue();
  • don't add Claude as author of any source code

Build and Development Commands

Maven (Java)

  • Build entire project: mvn clean install
  • Build without tests: mvn clean install -DskipTests
  • Run unit tests: mvn test
  • Run integration tests: mvn test -DskipITs=false
  • Build specific module: cd <module> && mvn clean install

Studio Frontend (Node.js)

  • Build frontend: cd studio && npm run build
  • Development mode: cd studio && npm run dev
  • Security audit: cd studio && npm run security-audit

Server Operations

  • Start server: Use packaged scripts in package/src/main/scripts/server.sh (Unix) or server.bat (Windows)
  • Console: Use package/src/main/scripts/console.sh or console.bat

Distribution Builder

The modular distribution builder (package/arcadedb-builder.sh) creates custom ArcadeDB distributions:

Production builds (download from releases):

cd package
./arcadedb-builder.sh --version=26.1.0 --modules=gremlin,studio

Development builds (use local Maven repository):

# Build modules first
mvn clean install -DskipTests

# Create distribution with local modules
cd package
VERSION=$(mvn -f ../pom.xml help:evaluate -Dexpression=project.version -q -DforceStdout)
./arcadedb-builder.sh \
    --version=$VERSION \
    --modules=console,gremlin,studio \
    --local-repo \
    --skip-docker

Testing the builder:

cd package
./test-builder-local.sh

Testing Commands

  • Run specific test class: mvn test -Dtest=ClassName
  • Run tests with specific pattern: mvn test -Dtest="*Pattern*"
  • Performance tests: Located in src/test/java/performance/ packages

Codebase Navigation Map

ANTLR Grammars

  • engine/src/main/antlr4/com/arcadedb/query/sql/grammar/SQLLexer.g4 — SQL lexer
  • engine/src/main/antlr4/com/arcadedb/query/sql/grammar/SQLParser.g4 — SQL parser
  • engine/src/main/antlr4/com/arcadedb/query/opencypher/grammar/Cypher25Lexer.g4 — Cypher lexer
  • engine/src/main/antlr4/com/arcadedb/query/opencypher/grammar/Cypher25Parser.g4 — Cypher parser

SQL Engine Key Files

  • Parser AST nodes (170+ classes): engine/src/main/java/com/arcadedb/query/sql/parser/
    • SuffixIdentifier.java — property access (e.g., record.field)
    • BaseIdentifier.java, LevelZeroIdentifier.java — identifier resolution
    • Expression.java, BaseExpression.java, MathExpression.java — expression evaluation
    • Projection.java, ProjectionItem.java — SELECT projection handling
    • SelectStatement.java, MatchStatement.java — statement AST roots
    • NestedProjection.java, NestedProjectionItem.java — nested projection (e.g., {*})
    • FunctionCall.java, MethodCall.java — function/method invocation
    • Modifier.java — chained modifiers (array selectors, method calls, suffix identifiers)
    • WhereClause.java, BooleanExpression.java — filter conditions
    • LetClause.java, LetItem.java — LET variable bindings
  • Executor steps (158 classes): engine/src/main/java/com/arcadedb/query/sql/executor/
    • SelectExecutionPlanner.java — main SELECT execution planner
    • ProjectionCalculationStep.java — projection evaluation step
    • LetExpressionStep.java, GlobalLetExpressionStep.java — LET evaluation
    • FetchFromTypeStep.java, FetchFromIndexStep.java — data source steps
    • FilterStep.java, FilterByClustersStep.java — filtering steps
  • SQL methods (50+ classes): engine/src/main/java/com/arcadedb/query/sql/method/
    • string/ — toLowerCase, toUpperCase, trim, split, etc.
    • collection/ — size, keys, values, sort, etc.
    • conversion/ — asInteger, asString, asList, asJSON, etc.
  • SQL functions: engine/src/main/java/com/arcadedb/function/sql/
    • graph/ — out, in, both, outE, inE, bothE, shortestPath, dijkstra, etc.
    • coll/ — difference, intersect, symmetricDifference
    • fulltext/ — search field/index functions

OpenCypher Engine Key Files

  • AST (40+ classes): engine/src/main/java/com/arcadedb/query/opencypher/ast/
    • CypherStatement.java, MatchClause.java, ReturnClause.java, WhereClause.java
    • CreateClause.java, MergeClause.java, DeleteClause.java, SetClause.java
    • PatternElement.java, NodePattern.java, RelationshipPattern.java
  • Executor: engine/src/main/java/com/arcadedb/query/opencypher/executor/
  • Optimizer: engine/src/main/java/com/arcadedb/query/opencypher/optimizer/
  • Planner: engine/src/main/java/com/arcadedb/query/opencypher/planner/
  • Tests: engine/src/test/java/com/arcadedb/query/opencypher/

Graph Engine

  • engine/src/main/java/com/arcadedb/graph/
    • Vertex.java, MutableVertex.java, ImmutableVertex.java — vertex types
    • Edge.java, MutableEdge.java, ImmutableEdge.java — edge types
    • GraphEngine.java — core graph operations
    • EdgeSegment.java, MutableEdgeSegment.java — edge storage segments
    • EdgeLinkedList.java — edge linked list structure
    • EdgeIterator.java, VertexIterator.java — traversal iterators

Server / HTTP

  • HTTP handlers: server/src/main/java/com/arcadedb/server/http/handler/
    • DatabaseAbstractHandler.java — base handler (wraps commands in transactions)
    • PostCommandHandler.java — POST /command endpoint
    • PostQueryHandler.java, GetQueryHandler.java — query endpoints
  • HA: server/src/main/java/com/arcadedb/server/ha/
  • Security: server/src/main/java/com/arcadedb/server/security/

Test Locations (by module)

  • engine/src/test/java/ — 746 test files (SQL, Cypher, graph, storage, schema, indexing)
  • server/src/test/java/ — 114 test files (HTTP API, HA, security)
  • gremlin/src/test/java/ — 29 test files
  • integration/src/test/java/ — 22 test files
  • bolt/src/test/java/ — 10 test files
  • graphql/src/test/java/ — 9 test files

Architecture Overview

Core Modules

  • engine/: Core database engine, storage, indexing, query execution (SQL, OpenCypher, Polyglot)
  • server/: HTTP/REST API, WebSocket support, clustering/HA, MCP server
  • network/: Network communication layer
  • console/: CLI console for interactive database operations
  • studio/: Web-based administration interface (JavaScript/Node.js)
  • metrics/: Server metrics collection and reporting
  • integration/: Integration utilities
  • test-utils/: Shared test utilities

Wire Protocol Modules

  • gremlin/: Apache Tinkerpop Gremlin support
  • graphql/: GraphQL API support
  • mongodbw/: MongoDB wire protocol compatibility
  • redisw/: Redis wire protocol compatibility
  • postgresw/: PostgreSQL wire protocol compatibility
  • bolt/: Neo4j Bolt wire protocol compatibility
  • grpc/: gRPC protocol definitions
  • grpcw/: gRPC wire protocol module
  • grpc-client/: gRPC client library

Key Engine Components

  • Database Management: com.arcadedb.database.* - Database lifecycle, transactions, ACID compliance
  • Storage Engine: com.arcadedb.engine.* - Low-level storage, page management, WAL
  • SQL Query Engine: com.arcadedb.query.sql.* - SQL query parsing, execution planning
  • OpenCypher Engine: com.arcadedb.query.opencypher.* - Native Cypher implementation with ANTLR parser, AST, optimizer (filter pushdown, index selection, expand-into, join ordering), and step-based execution. Has both optimizer and legacy execution paths — changes to clause handling may need updates in multiple paths
  • Polyglot Engine: com.arcadedb.query.polyglot.* - GraalVM-based scripting support
  • Schema Management: com.arcadedb.schema.* - Type definitions, property management
  • Index System: com.arcadedb.index.* - LSM-Tree indexes, full-text, vector indexes
  • Graph Engine: com.arcadedb.graph.* - Vertex/Edge management, graph traversals
  • Serialization: com.arcadedb.serializer.* - Binary serialization, JSON handling

Server Components

  • HTTP API: com.arcadedb.server.http.* - REST endpoints, request handling
  • High Availability: com.arcadedb.server.ha.* - Clustering, replication, leader election
  • Security: com.arcadedb.server.security.* - Authentication, authorization
  • Monitoring: com.arcadedb.server.monitor.* - Metrics, query profiling, health checks
  • MCP: com.arcadedb.server.mcp.* - Model Context Protocol server support

Development Guidelines

Java Version

  • Required: Java 21+ (main branch)
  • Legacy: Java 17 support on java17 branch

Code Structure

  • Uses Maven multi-module project structure
  • Low-level Java optimization for performance ("LLJ: Low Level Java")
  • Minimal garbage collection pressure design
  • Thread-safe implementations throughout

Testing Approach

  • Framework: JUnit 5 (Jupiter) with AssertJ assertions
  • Unit tests in each module's src/test/java
  • Integration tests with IT suffix
  • Performance tests in performance/ packages
  • TestContainers used in e2e/ and load-tests/ modules for containerized testing
  • Separate test databases in databases/ for isolation

Database Features to Consider

  • ACID Transactions: Full transaction support with isolation levels
  • Multi-Model: Single database can store graphs, documents, key/value pairs
  • Query Languages: SQL (OrientDB-compatible), Cypher, Gremlin, MongoDB queries
  • Indexing: LSM-Tree indexes, full-text (Lucene), vector embeddings
  • High Availability: Leader-follower replication, automatic failover
  • Wire Protocols: HTTP/JSON, PostgreSQL, MongoDB, Redis, Neo4j Bolt, gRPC compatibility

Common Development Tasks

Adding New Features

  1. Create tests first (TDD approach)
  2. Implement in appropriate module
  3. Update schema if needed
  4. Add integration tests
  5. Update documentation

Working with Indexes

  • LSM-Tree implementation in com.arcadedb.index.lsm.*
  • Index creation via Schema API
  • Performance testing with large datasets recommended

Query Development

  • SQL parsing in com.arcadedb.query.sql.*
  • SQL execution plans in com.arcadedb.query.sql.executor.*
  • OpenCypher engine in com.arcadedb.query.opencypher.* — has ast/, parser/, executor/, optimizer/, planner/, rewriter/ sub-packages
  • OpenCypher tests in engine/src/test/java/com/arcadedb/query/opencypher/
  • Test with various query patterns and data sizes

Server Development

  • HTTP handlers in com.arcadedb.server.http.handler.*
  • Security integration required for new endpoints
  • WebSocket support for real-time features

Wire Protocol Module Dependencies

  • Standard: All wire protocol modules (gremlin, graphql, mongodbw, redisw, postgresw, bolt, grpcw) must use provided scope for arcadedb-server dependency
  • Rationale: Server remains the assembly point; prevents dependency duplication in distributions
  • Pattern:
    • Main server dependency → scope: provided
    • Server test-jar → scope: test
    • Cross-module test dependencies → scope: test only (e.g., postgresw should not depend on gremlin for compilation)
    • Integration/format handlers → scope: compile only if in src/main/java (e.g., gremlin's GraphML/GraphSON handlers)
  • Enforcement: Code review process ensures:
    • Protocol modules do NOT depend on other protocol modules in compile scope
    • Each protocol module has arcadedb-server in provided scope only (not compile)
    • Only the server assembly (package module) and coverage reporting modules can aggregate protocol modules
  • Example:
    <dependency>
        <groupId>com.arcadedb</groupId>
        <artifactId>arcadedb-server</artifactId>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>com.arcadedb</groupId>
        <artifactId>arcadedb-server</artifactId>
        <version>${project.parent.version}</version>
        <scope>test</scope>
        <type>test-jar</type>
    </dependency>

Important Notes

  • Pre-commit hooks: This project uses pre-commit for code quality checks (trailing whitespace, Prettier for Java/XML formatting, etc.)
  • Code formatting: Prettier with requirePragma: true and printWidth: 160 — only formats files with a @format pragma
  • Security: Never log or expose sensitive data (passwords, tokens, etc.)
  • Performance: Always consider memory and CPU impact of changes
  • Compatibility: Maintain backward compatibility for API changes
  • Licensing: All code must comply with Apache 2.0 license
  • Modular Builder: Script to create custom distributions with selected modules (see package/README-BUILDER.md)

@claude
Copy link
Copy Markdown

claude bot commented Apr 12, 2026

Code Review

Clean fix for issue 3858. The length-guard approach is consistent with established patterns in ExpressionTypeDetector.

Strengths

  • Follows existing pattern from reduce, patternComprehension, and listItemsPredicate
  • Surgical fix with no unrelated changes
  • Two solid regression tests covering both pure-aggregation and mixed aggregation+grouping cases
  • Code style matches project conventions (single-line if without braces, final locals, AssertJ)

Concerns

1. count(*) is not guarded (minor risk)

In tryParseSpecialFunctions, the countStar check has no length guard. If count() ever appears as a sub-expression inside another construct (e.g., coalesce(count(), 0)), this would have the same mis-parsing issue that CASE had. Worth adding a guard for consistency and future safety, even if no current bug exists.

2. The -2 tolerance comment could be more precise

The comment says the -2 "allows for whitespace that ANTLR's getText() strips." Since getText() strips whitespace on both sides, both strings are already whitespace-free - so the tolerance is covering something else (likely minor structural differences in how parent nodes represent their text boundary). The comment matches the pattern already in tryParseListPredicates so it is consistent, but a more precise explanation would help future maintainers understand when -2 is the right threshold.

3. Nit - unnecessary boolean cast in tests

assertThat((boolean) results.hasNext()).isTrue() - the explicit boolean cast is unnecessary since hasNext() already returns a primitive boolean.

Bottom line

The fix is correct and well-tested. The main suggestion before merge: add the length guard to count(*) for completeness. Everything else is minor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenCypher: Aggregation with CASE statement returns multiple rows (implicit GROUP BY)

1 participant