generated from amazon-archives/__template_Custom
-
Notifications
You must be signed in to change notification settings - Fork 310
Open
Labels
Description
Is your feature request related to a problem? Please describe.
We want to provide S3 Tables (Iceberg Parquet) as an additional sink option for Data Prepper pipelines, enabling structured event data to be written directly into managed Apache Iceberg tables in S3.
Describe the solution you'd like
Add S3 Tables support to the existing S3 sink via a new s3_tables codec.
Current plan is to have schema be always provided through the pipeline configuration. Exactly one schema source must be specified:
schema.definition— explicit column definitions with optional table creation and partition specschema.registry— reference to an external schema registry (mainly AWS Glue at the moment)
Proposed pipeline configuration:
s3-tables-pipeline:
source:
http:
ssl: false
port: 2021
path: "/log/ingest"
processors:
# mutate, filter, etc.
sink:
- pipeline:
name: "pipeline_table_bucket"
pipeline_table_bucket:
source:
pipeline:
name: "s3-tables-pipeline"
processors:
# mutate, filter, etc.
sink:
- s3:
aws:
region: "us-east-2" # required
sts_role_arn: "arn:aws:iam::123456789012:role/DataPrepperRole" # optional
codec:
s3_tables:
# required — table location
table_bucket_arn: "arn:aws:s3tables:us-east-2:123456789012:bucket/my-bucket"
namespace: "my_namespace"
table_name: "my_table"
# required — schema (exactly one of definition or registry)
schema:
# Option A: explicit definition
definition:
create_table_if_not_exists: true
columns:
- name: "x"
type: "float"
required: false
- name: "y"
type: "float"
required: false
- name: "level"
type: "int"
required: true
validation: "non-zero"
- name: "name"
type: "string"
required: true
partition_spec:
- column: "timestamp"
transform: "day"
# Option B: schema registry (partitioning defined in Glue)
# registry:
# type: "aws_glue"
# registry_name: "my-registry"
# schema_name: "my-schema"
# optional — mismatch handling
on_type_mismatch: "coerce_or_null" # coerce_or_null | coerce_or_dlq
on_missing_required: "dlq" # dlq
on_no_match: "dlq" # dlq | drop
# optional — inherited from s3 sink
threshold:
event_count: 2000
maximum_size: "50mb"
event_collect_timeout: "30s"
buffer_type: "inmemory"
max_retries: 3
dlq:
s3:
bucket: "my-dlq-bucket"
key_path_prefix: "s3tables-failures/"
- stdout:Mismatch handling at runtime:
| Scenario | Behavior |
|---|---|
| Exact match | Write |
| Extra fields in event | Drop silently |
| Missing optional field | NULL |
| Missing required field | DLQ |
| Wrong type, coercible (e.g., "123" → 123) | Coerce and write |
| Wrong type, not coercible, optional column | NULL |
| Wrong type, not coercible, required column | DLQ |
| No fields match schema | DLQ |
Bad events are handled per-event. One bad event never fails the entire batch.
Describe alternatives you've considered
- New standalone plugin — A dedicated
s3_tablessink plugin separate from the existing S3 sink. Decided against this as a lot of the logic is shared within S3 Sink - Automatic schema inference and evolution — Having Data Prepper infer the schema from incoming events and automatically evolve the table when new fields appear. Avoiding this for now because in multi-node deployments, different nodes may see different data shapes, leading to type conflicts and race conditions. A static schema provided via config or registry gives a clear single source of truth.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Unplanned