Skip to content

[Feature} S3 Tables Iceberg Support in S3 Sink #6652

@jmsusanto

Description

@jmsusanto

Is your feature request related to a problem? Please describe.

We want to provide S3 Tables (Iceberg Parquet) as an additional sink option for Data Prepper pipelines, enabling structured event data to be written directly into managed Apache Iceberg tables in S3.

Describe the solution you'd like

Add S3 Tables support to the existing S3 sink via a new s3_tables codec.

Current plan is to have schema be always provided through the pipeline configuration. Exactly one schema source must be specified:

  • schema.definition — explicit column definitions with optional table creation and partition spec
  • schema.registry — reference to an external schema registry (mainly AWS Glue at the moment)

Proposed pipeline configuration:

s3-tables-pipeline:
  source:
    http:
      ssl: false
      port: 2021
      path: "/log/ingest"

  processors:
    # mutate, filter, etc.

  sink:
    - pipeline:
        name: "pipeline_table_bucket"


pipeline_table_bucket:
  source:
    pipeline:
      name: "s3-tables-pipeline"

  processors:
    # mutate, filter, etc.

  sink:
    - s3:
        aws:
          region: "us-east-2"                                              # required
          sts_role_arn: "arn:aws:iam::123456789012:role/DataPrepperRole"   # optional

        codec:
          s3_tables:
            # required — table location
            table_bucket_arn: "arn:aws:s3tables:us-east-2:123456789012:bucket/my-bucket"
            namespace: "my_namespace"
            table_name: "my_table"

            # required — schema (exactly one of definition or registry)
            schema:

              # Option A: explicit definition
              definition:
                create_table_if_not_exists: true
                columns:
                  - name: "x"
                    type: "float"
                    required: false
                  - name: "y"
                    type: "float"
                    required: false
                  - name: "level"
                    type: "int"
                    required: true
                    validation: "non-zero"
                  - name: "name"
                    type: "string"
                    required: true
                partition_spec:
                  - column: "timestamp"
                    transform: "day"

              # Option B: schema registry (partitioning defined in Glue)
              # registry:
              #   type: "aws_glue"
              #   registry_name: "my-registry"
              #   schema_name: "my-schema"

            # optional — mismatch handling
            on_type_mismatch: "coerce_or_null"            # coerce_or_null | coerce_or_dlq
            on_missing_required: "dlq"                    # dlq
            on_no_match: "dlq"                            # dlq | drop

        # optional — inherited from s3 sink
        threshold:
          event_count: 2000
          maximum_size: "50mb"
          event_collect_timeout: "30s"
        buffer_type: "inmemory"
        max_retries: 3
        dlq:
          s3:
            bucket: "my-dlq-bucket"
            key_path_prefix: "s3tables-failures/"

    - stdout:

Mismatch handling at runtime:

Scenario Behavior
Exact match Write
Extra fields in event Drop silently
Missing optional field NULL
Missing required field DLQ
Wrong type, coercible (e.g., "123" → 123) Coerce and write
Wrong type, not coercible, optional column NULL
Wrong type, not coercible, required column DLQ
No fields match schema DLQ

Bad events are handled per-event. One bad event never fails the entire batch.

Describe alternatives you've considered

  • New standalone plugin — A dedicated s3_tables sink plugin separate from the existing S3 sink. Decided against this as a lot of the logic is shared within S3 Sink
  • Automatic schema inference and evolution — Having Data Prepper infer the schema from incoming events and automatically evolve the table when new fields appear. Avoiding this for now because in multi-node deployments, different nodes may see different data shapes, leading to type conflicts and race conditions. A static schema provided via config or registry gives a clear single source of truth.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Unplanned

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions