[feat](tvf) Support INSERT INTO TVF to export query results to local/HDFS/S3 files#60719
[feat](tvf) Support INSERT INTO TVF to export query results to local/HDFS/S3 files#60719morningman merged 12 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
93ff2fc to
297e20f
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
| _current_written_bytes = _vfile_writer->written_len(); | ||
|
|
||
| // Auto-split if max file size is set | ||
| if (_max_file_size_bytes > 0) { |
There was a problem hiding this comment.
seems no need this if check, have done in _create_new_file_if_exceed_size() function
|
|
||
| // Set hadoop config for hdfs/s3 (BE uses this for file writer creation) | ||
| if (!tvfName.equals("local")) { | ||
| tSink.setHadoopConfig(backendConnectProps); |
There was a problem hiding this comment.
tSink.setProperties(backendConnectProps);
seems properties is also use backendConnectProps?
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29189 ms |
TPC-DS: Total hot run time: 185185 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 28616 ms |
TPC-DS: Total hot run time: 185494 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
…HDFS/S3 files (apache#60719) ### What problem does this PR solve? #### Why do we still need this feature when `OUTFILE` already exists? `OUTFILE` itself is a MySQL-specific syntax. We should standardize all data access patterns: use `SELECT` for reading and `INSERT` for writing. Since a TVF is treated as a table, it should support being written to via `INSERT`. From a functionality perspective, `INSERT INTO tvf` is currently similar to `OUTFILE`. However, from the standpoint of conceptual consistency, we need to support `INSERT INTO tvf`. #### Key changes: Add support for INSERT INTO TVF (Table-Valued Function) syntax, allowing users to directly export query results into external file systems (local, HDFS, S3) in CSV, Parquet, and ORC formats. - FE: Add ANTLR grammar rule for INSERT INTO TVF syntax, implement UnboundTVFTableSink, LogicalTVFTableSink, PhysicalTVFTableSink plan nodes, and InsertIntoTVFCommand for query planning and execution. - BE: Add TVFTableSinkOperator for pipeline execution, VTVFTableWriter for async file writing with auto-split support, and VFileFormatTransformerFactory for creating CSV/Parquet/ORC format transformers. - Support CSV options: column_separator, line_delimiter, compression (gz/zstd/lz4/snappy). - Support append mode (default) with file-prefix naming ({prefix}{query_id}_{idx}.{ext}). - Add error handling for missing required params, unsupported formats, wildcards in file_path, and delete_existing_files on local TVF. Example SQL: ``` -- Export query results to local BE node as CSV INSERT INTO local( "file_path" = "/tmp/export/basic_csv_", "backend_id" = "10001", "format" = "csv" ) SELECT * FROM my_table ORDER BY id; -- Export as Parquet to HDFS INSERT INTO hdfs( "file_path" = "/tmp/test_insert_into_hdfs_tvf/complex_parquet/data_", "format" = "parquet", "hadoop.username" = "doris", "fs.defaultFS" = "hdfs://127.0.0.1:8020", "delete_existing_files" = "true" ) SELECT * FROM insert_tvf_complex_src ORDER BY c_int; -- Export ORC to s3 INSERT INTO s3( "uri" = "https://bucket/insert_tvf_test/basic_orc/*", "s3.access_key" = "ak", "s3.secret_key" = "sk", "format" = "orc", "region" = "region" ) SELECT c_int, c_varchar, c_string FROM my_table WHERE c_int IS NOT NULL ORDER BY c_int; ```
What problem does this PR solve?
Related #60768
Why do we still need this feature when
OUTFILEalready exists?OUTFILEitself is a MySQL-specific syntax.We should standardize all data access patterns: use
SELECTfor reading andINSERTfor writing.Since a TVF is treated as a table, it should support being written to via
INSERT.From a functionality perspective,
INSERT INTO tvfis currently similar toOUTFILE.However, from the standpoint of conceptual consistency, we need to support
INSERT INTO tvf.Key changes:
Add support for INSERT INTO TVF (Table-Valued Function) syntax, allowing users
to directly export query results into external file systems (local, HDFS, S3)
in CSV, Parquet, and ORC formats.
UnboundTVFTableSink, LogicalTVFTableSink, PhysicalTVFTableSink plan nodes,
and InsertIntoTVFCommand for query planning and execution.
async file writing with auto-split support, and VFileFormatTransformerFactory
for creating CSV/Parquet/ORC format transformers.
in file_path, and delete_existing_files on local TVF.
Example SQL:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)