[Feature} S3 Tables Iceberg Support in S3 Sink

**Is your feature request related to a problem? Please describe.**

We want to provide S3 Tables (Iceberg Parquet) as an additional sink option for Data Prepper pipelines, enabling structured event data to be written directly into managed Apache Iceberg tables in S3.

**Describe the solution you'd like**

Add S3 Tables support to the existing S3 sink via a new `s3_tables` codec. 

Current plan is to have schema be always provided through the pipeline configuration. Exactly one schema source must be specified:
- `schema.definition` — explicit column definitions with optional table creation and partition spec
- `schema.registry` — reference to an external schema registry (mainly AWS Glue at the moment)

Proposed pipeline configuration:

```yaml
s3-tables-pipeline:
  source:
    http:
      ssl: false
      port: 2021
      path: "/log/ingest"

  processors:
    # mutate, filter, etc.

  sink:
    - pipeline:
        name: "pipeline_table_bucket"


pipeline_table_bucket:
  source:
    pipeline:
      name: "s3-tables-pipeline"

  processors:
    # mutate, filter, etc.

  sink:
    - s3:
        aws:
          region: "us-east-2"                                              # required
          sts_role_arn: "arn:aws:iam::123456789012:role/DataPrepperRole"   # optional

        codec:
          s3_tables:
            # required — table location
            table_bucket_arn: "arn:aws:s3tables:us-east-2:123456789012:bucket/my-bucket"
            namespace: "my_namespace"
            table_name: "my_table"

            # required — schema (exactly one of definition or registry)
            schema:

              # Option A: explicit definition
              definition:
                create_table_if_not_exists: true
                columns:
                  - name: "x"
                    type: "float"
                    required: false
                  - name: "y"
                    type: "float"
                    required: false
                  - name: "level"
                    type: "int"
                    required: true
                    validation: "non-zero"
                  - name: "name"
                    type: "string"
                    required: true
                partition_spec:
                  - column: "timestamp"
                    transform: "day"

              # Option B: schema registry (partitioning defined in Glue)
              # registry:
              #   type: "aws_glue"
              #   registry_name: "my-registry"
              #   schema_name: "my-schema"

            # optional — mismatch handling
            on_type_mismatch: "coerce_or_null"            # coerce_or_null | coerce_or_dlq
            on_missing_required: "dlq"                    # dlq
            on_no_match: "dlq"                            # dlq | drop

        # optional — inherited from s3 sink
        threshold:
          event_count: 2000
          maximum_size: "50mb"
          event_collect_timeout: "30s"
        buffer_type: "inmemory"
        max_retries: 3
        dlq:
          s3:
            bucket: "my-dlq-bucket"
            key_path_prefix: "s3tables-failures/"

    - stdout:
```

Mismatch handling at runtime:

| Scenario | Behavior |
|---|---|
| Exact match | Write |
| Extra fields in event | Drop silently |
| Missing optional field | NULL |
| Missing required field | DLQ |
| Wrong type, coercible (e.g., "123" → 123) | Coerce and write |
| Wrong type, not coercible, optional column | NULL |
| Wrong type, not coercible, required column | DLQ |
| No fields match schema | DLQ |

Bad events are handled per-event. One bad event never fails the entire batch.

**Describe alternatives you've considered**

- **New standalone plugin** — A dedicated `s3_tables` sink plugin separate from the existing S3 sink. Decided against this as a lot of the logic is shared within S3 Sink
- **Automatic schema inference and evolution** — Having Data Prepper infer the schema from incoming events and automatically evolve the table when new fields appear. Avoiding this for now because in multi-node deployments, different nodes may see different data shapes, leading to type conflicts and race conditions. A static schema provided via config or registry gives a clear single source of truth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature} S3 Tables Iceberg Support in S3 Sink #6652

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	Behavior
Exact match	Write
Extra fields in event	Drop silently
Missing optional field	NULL
Missing required field	DLQ
Wrong type, coercible (e.g., "123" → 123)	Coerce and write
Wrong type, not coercible, optional column	NULL
Wrong type, not coercible, required column	DLQ
No fields match schema	DLQ

[Feature} S3 Tables Iceberg Support in S3 Sink #6652

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions