-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](search) Add multi-field search support with fields parameter #59845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 31901 ms |
TPC-DS: Total hot run time: 175733 ms |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 31243 ms |
TPC-DS: Total hot run time: 172085 ms |
ClickBench: Total hot run time: 26.74 s |
Add `fields` parameter to the search() function, allowing queries to search
across multiple fields with a single query term. Similar to Elasticsearch's
query_string `fields` parameter.
Usage examples:
- search('hello', '{"fields":["title","content"]}')
-> Equivalent to: (title:hello OR content:hello)
- search('hello world', '{"fields":["title","content"],"default_operator":"and"}')
-> Equivalent to: (title:hello OR content:hello) AND (title:world OR content:world)
- search('a AND b', '{"fields":["title","content"],"mode":"lucene"}')
-> Multi-field with Lucene boolean semantics
Key changes:
- Extended SearchOptions with `fields` array and `isMultiFieldMode()`
- Added common helper methods for DRY compliance (parseWithVisitor, expandItemAcrossFields)
- Added FieldTrackingVisitor interface for polymorphic visitor handling
- Added parseDslMultiFieldMode() and parseDslMultiFieldLuceneMode()
- Added comprehensive unit tests (15+ test cases)
- Added regression tests (18 test cases)
Co-Authored-By: Claude Opus 4.5 <[email protected]>
…cross_fields behavior - Add test data id=9 to verify cross_fields vs best_fields semantics - Add Test 2b: multi_field_multi_term_and_lucene to test default_operator:and with mode:lucene - Add Test 11b: multi_field_cross_fields_verify to explicitly verify cross_fields behavior - Update expected output file with new test case results Co-Authored-By: Claude Opus 4.5 <[email protected]>
0b42651 to
7c0d178
Compare
|
run buildall |
TPC-H: Total hot run time: 31305 ms |
TPC-DS: Total hot run time: 173919 ms |
ClickBench: Total hot run time: 26.8 s |
…_fields/cross_fields modes
Add support for "type" option in multi-field search DSL to control how terms
are matched across fields:
- best_fields (default): All terms must match within the SAME field
Example: "machine learning" -> (title:machine AND title:learning) OR
(content:machine AND content:learning)
- cross_fields: Terms can match across DIFFERENT fields
Example: "machine learning" -> (title:machine OR content:machine) AND
(title:learning OR content:learning)
This aligns with Elasticsearch's default behavior where best_fields is the
default mode. Users can explicitly set type:"cross_fields" when they need
terms to be distributed across multiple fields.
Also includes:
- Input validation in setType() with IllegalArgumentException for invalid values
- Explicit type checking to prevent silent fallback behavior
- Unit tests for type parameter parsing and expansion
- Regression tests for both modes in standard and Lucene modes
Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
run buildall |
TPC-H: Total hot run time: 32084 ms |
TPC-DS: Total hot run time: 173383 ms |
ClickBench: Total hot run time: 26.81 s |
|
run cloud_p0 |
|
run vault_p0 |
|
run external |
zzzxl1993
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
FE Regression Coverage ReportIncrement line coverage |
…sses Make fields private with proper getter/setter methods in QsNode, QsPlan, and QsFieldBinding classes. This follows Java encapsulation best practices and allows for future validation or transformation logic. - Make all fields private with @JsonProperty getters for serialization - Add setters for QsNode.field and QsFieldBinding.fieldName (needed for field name normalization in RewriteSearchToSlots) - Update all usages to use getter/setter methods instead of direct field access - Add Javadoc to QsNode constructors Co-Authored-By: Claude Opus 4.5 <[email protected]>
…dling - Add SearchDslSyntaxException for clearer DSL syntax error messages - Add @nullable annotations to public parseDsl method parameters - Replace deprecated ANTLRInputStream with CharStreams.fromString() - Remove duplicate fields validation in expandMultiFieldDsl methods - Add context (actual fields value) to validation exception messages - Improve error handling with specific catch blocks for different exception types (syntax errors, argument errors, internal errors) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…mize StringBuilder - Add @JsonProperty/@JsonSetter annotations to SearchOptions for declarative JSON mapping - Simplify parseOptions() from 58 lines to 20 lines using JSON_MAPPER.readValue() - Add validate() method for mutual exclusion and range checks - Add minimum_should_match negative value validation - Reuse StringBuilder with setLength(0) instead of creating new instances in tokenizeDsl() Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
run buildall |
TPC-H: Total hot run time: 31480 ms |
TPC-DS: Total hot run time: 174157 ms |
ClickBench: Total hot run time: 26.75 s |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
eldenmoon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
PR approved by at least one committer and no changes requested. |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #59394
Problem Summary:
This PR adds
fieldsandtypeparameters to the SEARCH function, allowing queries to search across multiple fields with a single query term. This is similar to Elasticsearch's multi_match query withbest_fieldsandcross_fieldstypes.Multi-Field Search Support
Type Parameter Options
best_fields(default)"hello world"→(title:hello AND title:world) OR (content:hello AND content:world)cross_fields"hello world"→(title:hello OR content:hello) AND (title:world OR content:world)Key features:
typeparameter controls how terms are matched across fieldsbest_fields(default): Finds documents where all terms appear in the same field - ideal for relevance rankingcross_fields: Treats multiple fields as one big field - ideal for name searches across first_name/last_namefieldsanddefault_fieldare mutually exclusiveBehavior examples:
hello["title","content"](title:hello) OR (content:hello)hello world(AND)["title","content"](title:hello AND title:world) OR (content:hello AND content:world)hello world(AND)["title","content"](title:hello OR content:hello) AND (title:world OR content:world)EXACT(foo bar)["title","content"](title:EXACT(foo bar) OR content:EXACT(foo bar))hello AND category:tech["title","content"](title:hello OR content:hello) AND category:techUse case examples:
best_fieldswhen searching product name and description - prefer products where query terms appear togethercross_fieldswhen searching first_name and last_name - "John Smith" should match documents withfirst_name:Johnandlast_name:SmithRelease note
fieldsparameter)typeparameter withbest_fields(default) andcross_fieldsmodesbest_fields: All terms must match within the same field (default, matches Elasticsearch behavior)cross_fields: Terms can match across different fieldsCheck List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)