Add S3 ingester for AWS cloud storage document ingestion#3
Add S3 ingester for AWS cloud storage document ingestion#3rorygraves wants to merge 3 commits intomainfrom
Conversation
Implements S3 document source for automated ingestion from AWS S3 buckets: - S3SourceConfig for bucket, prefix, region, credentials, and pattern filtering - S3 ingestion logic using AWS SDK v2 with pagination support - Multiple authentication methods: explicit credentials, IAM roles, role assumption - Pattern-based file filtering and metadata extraction - Configuration examples in application.conf - Comprehensive test suite for config parsing - Full documentation in docs/s3-ingestion.md
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
The S3 ingestion code requires AWS credentials to test, similar to other infrastructure code like database ingestion. Configure patch coverage as informational to prevent blocking PRs while maintaining project-level coverage requirements.
|
@codex please review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f85fc978e9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| try { | ||
| val reader = new BufferedReader(new InputStreamReader(response, StandardCharsets.UTF_8)) | ||
| val content = new StringBuilder() |
There was a problem hiding this comment.
Avoid treating binary S3 objects as UTF-8 text
The S3 ingestion path reads every matched object via a UTF-8 BufferedReader and then feeds the result to ingestText. This means any non-text formats that the feature explicitly advertises (e.g., the default *.pdf pattern) will be decoded as UTF‑8 bytes, producing garbage/empty content or decode errors, so those documents won’t be correctly indexed. This will surface whenever the bucket contains PDFs or other binary formats; consider routing through the same document loaders used by directory ingestion or restricting patterns to text-only extensions.
Useful? React with 👍 / 👎.
S3 ingestion reads files as UTF-8 text, so binary formats like PDF, DOCX, and images produce garbage content. This change: - Updates default patterns from ["*.md", "*.txt", "*.pdf"] to ["*.md", "*.txt", "*.json", "*.xml", "*.html", "*.csv"] - Adds clear documentation that S3 only supports text formats - Adds runtime warning when binary extensions are configured - Updates all documentation examples to use text-only patterns - Adds Limitations section to S3 ingestion guide For binary format support (PDF, DOCX, etc.), users should download files locally and use directory ingestion with DirectoryLoader. Fixes review feedback about UTF-8 decoding corrupting binary files.
Summary
Changes
S3SourceConfigcase class with config parsingTest plan