Skip to content

Add S3 ingester for AWS cloud storage document ingestion#3

Closed
rorygraves wants to merge 3 commits intomainfrom
feature/s3-ingester
Closed

Add S3 ingester for AWS cloud storage document ingestion#3
rorygraves wants to merge 3 commits intomainfrom
feature/s3-ingester

Conversation

@rorygraves
Copy link
Contributor

Summary

  • Adds S3 document ingestion source for automated document retrieval from AWS S3 buckets
  • Supports multiple authentication methods: explicit credentials, IAM roles, cross-account role assumption
  • Includes pattern-based filtering, prefix filtering, and pagination for large buckets
  • Full documentation and comprehensive test coverage

Changes

  • build.sbt: Added AWS SDK v2 dependencies (S3 + STS for role assumption)
  • IngestionConfig.scala: New S3SourceConfig case class with config parsing
  • IngestionService.scala: S3 ingestion implementation with AWS SDK v2
  • IngestionRoutes.scala: Added S3 source to route response mapping
  • application.conf: Example S3 source configuration
  • IngestionConfigSpec.scala: 7 new tests for S3 configuration parsing
  • docs/s3-ingestion.md: Comprehensive S3 ingestion documentation

Test plan

  • S3SourceConfig parsing tests pass
  • All 242 existing tests pass
  • Documentation covers authentication methods, IAM permissions, and examples
  • Manual testing with real S3 bucket (requires AWS credentials)

Implements S3 document source for automated ingestion from AWS S3 buckets:
- S3SourceConfig for bucket, prefix, region, credentials, and pattern filtering
- S3 ingestion logic using AWS SDK v2 with pagination support
- Multiple authentication methods: explicit credentials, IAM roles, role assumption
- Pattern-based file filtering and metadata extraction
- Configuration examples in application.conf
- Comprehensive test suite for config parsing
- Full documentation in docs/s3-ingestion.md
@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

❌ Patch coverage is 17.88618% with 101 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../org/llm4s/ragbox/ingestion/IngestionService.scala 0.00% 95 Missing ⚠️
...cala/org/llm4s/ragbox/routes/IngestionRoutes.scala 0.00% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

The S3 ingestion code requires AWS credentials to test, similar to
other infrastructure code like database ingestion. Configure patch
coverage as informational to prevent blocking PRs while maintaining
project-level coverage requirements.
@rorygraves
Copy link
Contributor Author

@codex please review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f85fc978e9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +526 to +528
try {
val reader = new BufferedReader(new InputStreamReader(response, StandardCharsets.UTF_8))
val content = new StringBuilder()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid treating binary S3 objects as UTF-8 text

The S3 ingestion path reads every matched object via a UTF-8 BufferedReader and then feeds the result to ingestText. This means any non-text formats that the feature explicitly advertises (e.g., the default *.pdf pattern) will be decoded as UTF‑8 bytes, producing garbage/empty content or decode errors, so those documents won’t be correctly indexed. This will surface whenever the bucket contains PDFs or other binary formats; consider routing through the same document loaders used by directory ingestion or restricting patterns to text-only extensions.

Useful? React with 👍 / 👎.

S3 ingestion reads files as UTF-8 text, so binary formats like PDF,
DOCX, and images produce garbage content. This change:

- Updates default patterns from ["*.md", "*.txt", "*.pdf"] to
  ["*.md", "*.txt", "*.json", "*.xml", "*.html", "*.csv"]
- Adds clear documentation that S3 only supports text formats
- Adds runtime warning when binary extensions are configured
- Updates all documentation examples to use text-only patterns
- Adds Limitations section to S3 ingestion guide

For binary format support (PDF, DOCX, etc.), users should download
files locally and use directory ingestion with DirectoryLoader.

Fixes review feedback about UTF-8 decoding corrupting binary files.
@rorygraves rorygraves closed this Jan 11, 2026
@rorygraves rorygraves deleted the feature/s3-ingester branch January 11, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant