Skip to content

[MEDI] Allow Pipeline and Reader to work with any Source#7090

Closed
adamsitnik wants to merge 9 commits intodotnet:mainfrom
adamsitnik:anySource
Closed

[MEDI] Allow Pipeline and Reader to work with any Source#7090
adamsitnik wants to merge 9 commits intodotnet:mainfrom
adamsitnik:anySource

Conversation

@adamsitnik
Copy link
Copy Markdown
Member

@adamsitnik adamsitnik commented Nov 27, 2025

The idea is to introduce a new interface, called IIngestionDocumentReader, where the generic type parameter specifies the source. Source can be anything (FileInfo, Stream but also int or Guid or custom user type). It's up to the reader to get the document for given source (parse a file or read it from DB or somewhere else).

public interface IIngestionDocumentReader<TSource>
{
    Task<IngestionDocument> ReadAsync(TSource source, string identifier, string? mediaType = null, CancellationToken cancellationToken = default);
}

The IngestionDocumentReader class remains, we are still opinionated and believe that readers should implement FileInfo and Stream support. It implements the new interface.
But we also know that it's not enough for all the scenarios, so Pipline no longer accepts the old reader class, but something that implements the new interface.

Because of that, the pipeline now needs to specify two generic type arguments instead of one. This is a breaking change.
Moreover, pipeline itself no longer defines methods for processing multiple files or directory. This functionality was moved to extension methods.

fixes #7082

Microsoft Reviewers: Open in CodeFlow

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the data ingestion pipeline to be generic over source types, allowing it to work with any source type beyond just FileInfo and Stream. The key change introduces IIngestionDocumentReader<TSource> interface and converts IngestionPipeline to IngestionPipeline<TSource, TChunk>.

Key changes:

  • Introduces IIngestionDocumentReader<TSource> interface to enable generic source type support
  • Refactors IngestionPipeline<T> to IngestionPipeline<TSource, TChunk> with two type parameters
  • Extracts file system-specific operations into FileSystemIngestionExtensions class
  • Moves media type detection logic to MediaTypeProvider utility class

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
IIngestionDocumentReader.cs New generic interface for document readers with any source type
IngestionDocumentReader.cs Updated abstract class to implement new interface and use MediaTypeProvider
IngestionPipeline.cs Converted to generic over source type; file system operations moved to extensions
FileSystemIngestionExtensions.cs New extension class with FileInfo-specific processing methods
MediaTypeProvider.cs New utility class for media type detection, extracted from IngestionDocumentReader
MarkdownReader.cs, MarkItDownReader.cs, MarkItDownMcpReader.cs Updated to make mediaType parameter nullable
Test files Updated to use generic type parameters and new test cases
Comments suppressed due to low confidence (1)

src/Libraries/Microsoft.Extensions.DataIngestion.Abstractions/IngestionDocumentReader.cs:1

  • The .markdown and .md extensions mapping is duplicated in both IngestionDocumentReader.cs (line 41) and MediaTypeProvider.cs (line 41). Since MediaTypeProvider is now the centralized location for media type mappings and is being linked into this project, this duplication should be removed.
// Licensed to the .NET Foundation under one or more agreements.

@adamsitnik adamsitnik requested a review from a team as a code owner November 27, 2025 16:37
@stephentoub
Copy link
Copy Markdown
Member

@adamsitnik what's the status of this PR?

@rainerllera
Copy link
Copy Markdown

hey @adamsitnik any updates on when this will be merged and released?

@adamsitnik
Copy link
Copy Markdown
Member Author

hey @adamsitnik any updates on when this will be merged and released?

This is a breaking change, which is acceptable for preview, but I would like to include it in Preview 2, which I am for to be published in March.

@stephentoub Please let me know if you believe it's not worth waiting with Preview 2 to introduce breaking changes.

@adamsitnik
Copy link
Copy Markdown
Member Author

@eiriktsarpalis please review this PR from API design point of view (#7082 has all the context). If you believe it's Ok, I am going to ask Copilot to re-do it against the https://github.com/dotnet/extensions/tree/data-ingestion-preview2 branch, review it myself (to see how well it did) and then ask you for another review.

Copy link
Copy Markdown
Member

@eiriktsarpalis eiriktsarpalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I would make the following recommendations:

  1. Hide the document enumeration/parsing concerns from the pipeline itself by having it accept an IAsyncEnumerable<IngestionDocument> representing the batch(es) of documents being ingested. The library should be providing helpers that make constructing that IAE easier. The filesystem reader could then be repurposed as a static method returning such an IAE.
  2. Repurpose IIngestionDocumentReader<TSource> as a non-generic, stream-specific document parsing abstraction, i.e. Task<IngestionDocument> ReadAsync(Stream stream, CancellationToken cancellationToken); that can be used for document sources that read from files (file system, blob storage, etc.)
  3. Consider making the chunk abstraction non-generic, e.g. by replicating the design of the AIContent abstraction in Microsoft.Extensions.AI.
  4. Consider exploring possible ways in which users could optionally utilize System.Linq.AsyncEnumerable for defining ingestion piplines declaratively.

@adamsitnik adamsitnik closed this Mar 18, 2026
@adamsitnik
Copy link
Copy Markdown
Member Author

Big thanks for a great suggestion @eiriktsarpalis ! As soon as #7396 is merged, I am going to try this attempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using IngestionPipeline for content not originating from the file system

5 participants