Explorations for object streams#79
Conversation
|
I have to admit, what would help me most to give more substantial feedback here is to have the new key functions and structs explained, esp. What I like about the change is that, as far as I can see, the breakage for existing consumers is minimal (or none?). I also assume that the fairly low-level design would allow us to not put certain sub categories in the stream? Thinking about the Document Information Dictionary and PieceInfo dicts. |
|
Fair enough, in that case I probably should spend some more time into turning this API into something that I'd actually consider usable from my side. As mentioned, it's possible that there is some unnecessary complexity.
The spawn parameter is used for letting us launch asynchronous deflate jobs. As mentioned, the workflow is roughly that on the krilla side, we iterate over all chunks that contain non-stream objects. Those objects are then grouped into ~100 objects, and once we have reached the threshold, we need to have some mechanism such that the object stream we just built can be flate-encoded in the background, while we start processing the next batch of 100 objects on the main thread. This is to accommodate our requirement or parallelizing PDF creation as much as possible. I agree that having this generic API on pdf-writer is a bit ugly, but I don't really see another way of abstracting away from that, given that it's a pretty "niche" requirement from our side.
It's possible that this can partly be merged with our
Now that I looked at it more carefully, it seems like this isn't actually used in krilla right now. 😅 I think I'll just try to get this into a mergeable state first where I have full understanding and then we can discuss again.
This should be the case, yes, if you don't use object streams little should change for the existing workflows.
Yes we basically have full control over what we put in the object streams. |
Please note that this PR was fully AI-generated and is not intended to be merged like this. For a mergeable version, we obviously want more tests + a cleaner API interface + actually cross-checking the impl against the PDF reference. Not asking for a code review here, I just want to have something to get the conversation going a bit and discuss the different trade-offs before committing to a direction.
Goal
Overall, our goal is to be able to leverage object streams in krilla. This is important because, especially for tagged PDFs, PDF sizes end up ballooning unnecessary due to the large number of tag nodes usually contained in a PDF.
The hard part
Getting a working version is not that hard from a few vibe-coded experiments; in my opinion, the hard part is finding a good trade-off between two different requirements:
Ensuring the implementation is performant (making use of multi-threading when creating and compressing the object streams), at the same time making sure that we don't unnecessarily waste memory (by allocating and copying memory by repeatedly using APIs like
renumber).Deciding the right split between what should live in pdf-writer and what can instead be moved to krilla. In addition, the new API should ideally remain as compatible as possible with existing pdf-writer APIs. However, as I will show below, I'm afraid we will have to make some compromises here.
Current state
Let me start by outlining the current state in krilla. Overall, thanks due to a few refactors I performed, we already are in a much better position to implement object streams than we were 1-2 months ago.
Right now, we still have this single global
ChunkContainerstruct which basically collects everything we write during the serialization process grouped by a "super category" and then a "sub category". In the end, these categories are then basically copied one after another into the final PDF.The super category split that I recently introduced is between non-stream chunks, mixed chunks and stream chunks. This means that in the final PDF, all non-stream objects will be grouped together, followed by mixed chunks (which are just from embedded PDFs) and finally all streams. This change already puts us in a very good position to implement object streams, because we conceptually already group everything that can be put in a object stream together (object streams can contain any objects except for other stream objects, which are prohibited).
Then, the sub category for each group basically consists of different "topics". For example, we put all objects that were generated from serializing color spaces into a different bucket than the ones from external graphics state. This is different to what existed two months ago, where each category actually had a
Vec<Chunk>, and each object landed in a new chunk. However, this turned out to be a huge waste of memory, hence why it was changed that each category is just represented by a singleChunk, which is the better thing to do anyway in my opinion.Note there is no reason why we would have to group color space objects separately from annotations, it is purely for aesthetic reasons to make the final PDF more organized. In theory, we could just lump pretty much all non-stream objects into a single chunk. However, it's a nice property and I don't see this as a hindrance to implementing object streams. Therefore, I suggest we keep this grouping for now.
Once all chunks of each group have been created, we get to the final serialization step which actually concatenates them into a single PDF. This basically happens in two steps:
First, we do a single pass over all chunks to determine a new numbering of objects. There once again is no inherent reason why we have to do this, it just has the nice property that in the final PDFs, object numbers will be in ascending order.
Then, using the new number mapping we determined, we more or less just concatenate all chunks into a single PDF.
That's basically it.
Removing the renumbering step
Before talking more about object streams, my first proposal is to remove the renumbering step from the krilla pipeline (i.e. step 1)) (except for embedded PDFs via hayro-write, because those always start with Ref 1). The two reasons for this are 1) It's an unnecessary waste of time because we have to iterate over each object ref to create the mapping. I doubt it's very costly, but still, if it can be avoided why not! But most importantly, 2) I'm really doubtful about being able to support the
renumber_intoAPI with chunks that contain an object stream. The problem is that object streams themselves are just stream objects, with the special property that they contain PDF-relevant data. In particular, the streams themselves couple the object numbers with the object themselves. This means that if you have a flate-encoded object stream, there is no way of just renumbering its entries into a new chunk without decoding, renumbering and then encoding again. Therefore, I believe we will have to restrict therenumber_intoAPI to not allow having any object streams inside of it, and panic in case it does. However, for thechunk.extendfunction this is fine, because we can just merge the xref entries as usual.Therefore, as a first step I would suggest moving krilla way from making use of that API (except for
hayro-writePDFs which are stored separately as mixed chunks) so that we don't face this problem further down the road. Yes my OCD also doesn't like the fact that numbers aren't consecutive anymore, but I think it's better to be practical here. 😄 Here is the change, and it's not too bad: LaurenzV/krilla@68f7d38Object stream API
With the above thing resolved, let's now get to the real meat, how to implement object streams. Fundamentally, we need some API that 1) iterates over the objects in a chunk and segments them into groups of X (e.g. 100) objects (which is necessary because you are not supposed to put too many object into a single object stream) and 2) does the actual creation of the object stream. 2) Obviously needs to live in pdf-writer, the question is where 1) should live. If 1) lives in krilla, then we would need to basically have to make the
indirect_objectsiterator public, which is kind of ugly because exposing the rawbodylike that doesn't feel right.Therefore, I think it's better if the chunking API lives in pdf-writer instead. However, this does have the disadvantage that it makes the API pretty ugly. The reason is that in krilla, we have the specific requirement that we want to be able to handle compressing multiple stream objects at the same time using multi-threading for performance reason. Because of that, in
pdf-writerthe chunking API needs to be very generic to allow implementing this kind of behavior. So this is kind of what I ended up with now.pdf-writerexposes this kind of API to split aChunkinto a range of compressed stream objects. krilla then basically consumes this API to create object streams if requested. Similarly to before, that was completely PR-generated so please don't interpret this as me wanting to be the final version; I think the code can be made to be less verbose. But before I put more time into this, I would like to get the high-level direction settled first.Curious to hear your thoughts!