Explorations for object streams by LaurenzV · Pull Request #79 · typst/pdf-writer

LaurenzV · 2026-05-14T09:12:35Z

Please note that this PR was fully AI-generated and is not intended to be merged like this. For a mergeable version, we obviously want more tests + a cleaner API interface + actually cross-checking the impl against the PDF reference. Not asking for a code review here, I just want to have something to get the conversation going a bit and discuss the different trade-offs before committing to a direction.

Goal

Overall, our goal is to be able to leverage object streams in krilla. This is important because, especially for tagged PDFs, PDF sizes end up ballooning unnecessary due to the large number of tag nodes usually contained in a PDF.

The hard part

Getting a working version is not that hard from a few vibe-coded experiments; in my opinion, the hard part is finding a good trade-off between two different requirements:

Ensuring the implementation is performant (making use of multi-threading when creating and compressing the object streams), at the same time making sure that we don't unnecessarily waste memory (by allocating and copying memory by repeatedly using APIs like renumber).
Deciding the right split between what should live in pdf-writer and what can instead be moved to krilla. In addition, the new API should ideally remain as compatible as possible with existing pdf-writer APIs. However, as I will show below, I'm afraid we will have to make some compromises here.

Current state

Let me start by outlining the current state in krilla. Overall, thanks due to a few refactors I performed, we already are in a much better position to implement object streams than we were 1-2 months ago.

Right now, we still have this single global ChunkContainer struct which basically collects everything we write during the serialization process grouped by a "super category" and then a "sub category". In the end, these categories are then basically copied one after another into the final PDF.

The super category split that I recently introduced is between non-stream chunks, mixed chunks and stream chunks. This means that in the final PDF, all non-stream objects will be grouped together, followed by mixed chunks (which are just from embedded PDFs) and finally all streams. This change already puts us in a very good position to implement object streams, because we conceptually already group everything that can be put in a object stream together (object streams can contain any objects except for other stream objects, which are prohibited).

Then, the sub category for each group basically consists of different "topics". For example, we put all objects that were generated from serializing color spaces into a different bucket than the ones from external graphics state. This is different to what existed two months ago, where each category actually had a Vec<Chunk>, and each object landed in a new chunk. However, this turned out to be a huge waste of memory, hence why it was changed that each category is just represented by a single Chunk, which is the better thing to do anyway in my opinion.

Note there is no reason why we would have to group color space objects separately from annotations, it is purely for aesthetic reasons to make the final PDF more organized. In theory, we could just lump pretty much all non-stream objects into a single chunk. However, it's a nice property and I don't see this as a hindrance to implementing object streams. Therefore, I suggest we keep this grouping for now.

Once all chunks of each group have been created, we get to the final serialization step which actually concatenates them into a single PDF. This basically happens in two steps:

First, we do a single pass over all chunks to determine a new numbering of objects. There once again is no inherent reason why we have to do this, it just has the nice property that in the final PDFs, object numbers will be in ascending order.
Then, using the new number mapping we determined, we more or less just concatenate all chunks into a single PDF.

That's basically it.

Removing the renumbering step

Before talking more about object streams, my first proposal is to remove the renumbering step from the krilla pipeline (i.e. step 1)) (except for embedded PDFs via hayro-write, because those always start with Ref 1). The two reasons for this are 1) It's an unnecessary waste of time because we have to iterate over each object ref to create the mapping. I doubt it's very costly, but still, if it can be avoided why not! But most importantly, 2) I'm really doubtful about being able to support the renumber_into API with chunks that contain an object stream. The problem is that object streams themselves are just stream objects, with the special property that they contain PDF-relevant data. In particular, the streams themselves couple the object numbers with the object themselves. This means that if you have a flate-encoded object stream, there is no way of just renumbering its entries into a new chunk without decoding, renumbering and then encoding again. Therefore, I believe we will have to restrict the renumber_into API to not allow having any object streams inside of it, and panic in case it does. However, for the chunk.extend function this is fine, because we can just merge the xref entries as usual.

Therefore, as a first step I would suggest moving krilla way from making use of that API (except for hayro-write PDFs which are stored separately as mixed chunks) so that we don't face this problem further down the road. Yes my OCD also doesn't like the fact that numbers aren't consecutive anymore, but I think it's better to be practical here. 😄 Here is the change, and it's not too bad: LaurenzV/krilla@68f7d38

Object stream API

With the above thing resolved, let's now get to the real meat, how to implement object streams. Fundamentally, we need some API that 1) iterates over the objects in a chunk and segments them into groups of X (e.g. 100) objects (which is necessary because you are not supposed to put too many object into a single object stream) and 2) does the actual creation of the object stream. 2) Obviously needs to live in pdf-writer, the question is where 1) should live. If 1) lives in krilla, then we would need to basically have to make the indirect_objects iterator public, which is kind of ugly because exposing the raw body like that doesn't feel right.

Therefore, I think it's better if the chunking API lives in pdf-writer instead. However, this does have the disadvantage that it makes the API pretty ugly. The reason is that in krilla, we have the specific requirement that we want to be able to handle compressing multiple stream objects at the same time using multi-threading for performance reason. Because of that, in pdf-writer the chunking API needs to be very generic to allow implementing this kind of behavior. So this is kind of what I ended up with now. pdf-writer exposes this kind of API to split a Chunk into a range of compressed stream objects. krilla then basically consumes this API to create object streams if requested. Similarly to before, that was completely PR-generated so please don't interpret this as me wanting to be the final version; I think the code can be made to be less verbose. But before I put more time into this, I would like to get the high-level direction settled first.

Curious to hear your thoughts!

reknih · 2026-05-18T15:57:04Z

I have to admit, what would help me most to give more substantial feedback here is to have the new key functions and structs explained, esp. ObjectStreamBuilder. For example I am not entirely sure what its spawn parameter does. Likewise, what is the purpose of ObjectStreamFilter and the closure in ObjectStreamJob::build_with_filter? Does it expect to have a closure with the filtered bytes passed? I ask not because I want to review the AI code but to understand the control flow better.

What I like about the change is that, as far as I can see, the breakage for existing consumers is minimal (or none?).

I also assume that the fairly low-level design would allow us to not put certain sub categories in the stream? Thinking about the Document Information Dictionary and PieceInfo dicts.

LaurenzV · 2026-05-25T06:17:42Z

Fair enough, in that case I probably should spend some more time into turning this API into something that I'd actually consider usable from my side. As mentioned, it's possible that there is some unnecessary complexity.

For example I am not entirely sure what its spawn parameter does.

The spawn parameter is used for letting us launch asynchronous deflate jobs. As mentioned, the workflow is roughly that on the krilla side, we iterate over all chunks that contain non-stream objects. Those objects are then grouped into ~100 objects, and once we have reached the threshold, we need to have some mechanism such that the object stream we just built can be flate-encoded in the background, while we start processing the next batch of 100 objects on the main thread. This is to accommodate our requirement or parallelizing PDF creation as much as possible. I agree that having this generic API on pdf-writer is a bit ugly, but I don't really see another way of abstracting away from that, given that it's a pretty "niche" requirement from our side.

Likewise, what is the purpose of ObjectStreamFilter

It's possible that this can partly be merged with our XRefFilter enum. Fundamentally, we need to provide some callback such that the user can control how the object stream is filtered (either not at all or with deflate or whatever). Usually, the user does this by themselves by deflating a stream manually and setting the filters themselves, but since we more or less abstract away the construction of the object stream, in my view we instead need to do this via callbacks, similar to how it's already done for xref streams right now in pdf-writer.

ObjectStreamJob::build_with_filter

Now that I looked at it more carefully, it seems like this isn't actually used in krilla right now. 😅 I think I'll just try to get this into a mergeable state first where I have full understanding and then we can discuss again.

What I like about the change is that, as far as I can see, the breakage for existing consumers is minimal (or none?).

This should be the case, yes, if you don't use object streams little should change for the existing workflows.

I also assume that the fairly low-level design would allow us to not put certain sub categories in the stream? Thinking about the Document Information Dictionary and PieceInfo dicts.

Yes we basically have full control over what we put in the object streams.

.

9ae5935

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explorations for object streams#79

Explorations for object streams#79
LaurenzV wants to merge 1 commit into
typst:mainfrom
LaurenzV:object_stream_exploration

LaurenzV commented May 14, 2026

Uh oh!

reknih commented May 18, 2026

Uh oh!

LaurenzV commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

LaurenzV commented May 14, 2026

Goal

The hard part

Current state

Removing the renumbering step

Object stream API

Uh oh!

reknih commented May 18, 2026

Uh oh!

LaurenzV commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants