[MEDI] Allow Pipeline and Reader to work with any Source#7090
[MEDI] Allow Pipeline and Reader to work with any Source#7090adamsitnik wants to merge 9 commits intodotnet:mainfrom
Conversation
…gestionDocumentReader
…umber of breaking changes
…ut for MarkItDownMcpReader
…ke TSource and transfrom it into TChunk(s)
There was a problem hiding this comment.
Pull request overview
This PR refactors the data ingestion pipeline to be generic over source types, allowing it to work with any source type beyond just FileInfo and Stream. The key change introduces IIngestionDocumentReader<TSource> interface and converts IngestionPipeline to IngestionPipeline<TSource, TChunk>.
Key changes:
- Introduces
IIngestionDocumentReader<TSource>interface to enable generic source type support - Refactors
IngestionPipeline<T>toIngestionPipeline<TSource, TChunk>with two type parameters - Extracts file system-specific operations into
FileSystemIngestionExtensionsclass - Moves media type detection logic to
MediaTypeProviderutility class
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| IIngestionDocumentReader.cs | New generic interface for document readers with any source type |
| IngestionDocumentReader.cs | Updated abstract class to implement new interface and use MediaTypeProvider |
| IngestionPipeline.cs | Converted to generic over source type; file system operations moved to extensions |
| FileSystemIngestionExtensions.cs | New extension class with FileInfo-specific processing methods |
| MediaTypeProvider.cs | New utility class for media type detection, extracted from IngestionDocumentReader |
| MarkdownReader.cs, MarkItDownReader.cs, MarkItDownMcpReader.cs | Updated to make mediaType parameter nullable |
| Test files | Updated to use generic type parameters and new test cases |
Comments suppressed due to low confidence (1)
src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentReader.cs:1
- The
.markdownand.mdextensions mapping is duplicated in bothIngestionDocumentReader.cs(line 41) andMediaTypeProvider.cs(line 41). SinceMediaTypeProvideris now the centralized location for media type mappings and is being linked into this project, this duplication should be removed.
// Licensed to the .NET Foundation under one or more agreements.
src/Libraries/Microsoft.Extensions.DataIngestion/FileSystemIngestionExtensions.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/FileSystemIngestionExtensions.cs
Outdated
Show resolved
Hide resolved
|
@adamsitnik what's the status of this PR? |
|
hey @adamsitnik any updates on when this will be merged and released? |
This is a breaking change, which is acceptable for preview, but I would like to include it in Preview 2, which I am for to be published in March. @stephentoub Please let me know if you believe it's not worth waiting with Preview 2 to introduce breaking changes. |
|
@eiriktsarpalis please review this PR from API design point of view (#7082 has all the context). If you believe it's Ok, I am going to ask Copilot to re-do it against the https://github.com/dotnet/extensions/tree/data-ingestion-preview2 branch, review it myself (to see how well it did) and then ask you for another review. |
eiriktsarpalis
left a comment
There was a problem hiding this comment.
As discussed offline, I would make the following recommendations:
- Hide the document enumeration/parsing concerns from the pipeline itself by having it accept an
IAsyncEnumerable<IngestionDocument>representing the batch(es) of documents being ingested. The library should be providing helpers that make constructing that IAE easier. The filesystem reader could then be repurposed as a static method returning such an IAE. - Repurpose
IIngestionDocumentReader<TSource>as a non-generic, stream-specific document parsing abstraction, i.e.Task<IngestionDocument> ReadAsync(Stream stream, CancellationToken cancellationToken);that can be used for document sources that read from files (file system, blob storage, etc.) - Consider making the chunk abstraction non-generic, e.g. by replicating the design of the
AIContentabstraction in Microsoft.Extensions.AI. - Consider exploring possible ways in which users could optionally utilize System.Linq.AsyncEnumerable for defining ingestion piplines declaratively.
|
Big thanks for a great suggestion @eiriktsarpalis ! As soon as #7396 is merged, I am going to try this attempt. |
The idea is to introduce a new interface, called
IIngestionDocumentReader, where the generic type parameter specifies the source. Source can be anything (FileInfo, Stream but also int or Guid or custom user type). It's up to the reader to get the document for given source (parse a file or read it from DB or somewhere else).The
IngestionDocumentReaderclass remains, we are still opinionated and believe that readers should implementFileInfoandStreamsupport. It implements the new interface.But we also know that it's not enough for all the scenarios, so Pipline no longer accepts the old reader class, but something that implements the new interface.
Because of that, the pipeline now needs to specify two generic type arguments instead of one. This is a breaking change.
Moreover, pipeline itself no longer defines methods for processing multiple files or directory. This functionality was moved to extension methods.
fixes #7082
Microsoft Reviewers: Open in CodeFlow