Add documentation for stage extension API [CTT-830]#2087
Add documentation for stage extension API [CTT-830]#2087k-jamroz wants to merge 8 commits intohazelcast:mainfrom
Conversation
✅ Deploy Preview for hardcore-allen-f5257d ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Add general purpose mechanism to extend Pipeline API in the area of
transformations. Sources and sinks can be openly extended, but it is
hard to add custom transformation definitions. The extension mechanism
(`using`) gives ability to plug in additional APIs in convenient,
type-safe and fluent way with IDE support and without overloading Stage
API with many methods. Stage extensions can reside in any module, can be
also EE only.
Users also would be able to add their own extensions if they like.
General (stream&batch, regular&keyed) extension has quite a lot of
boilerplate caused by Java generics and type safety, but it can be
significantly reduced if only some cases are to be supported.
Comparison with other already available mechanisms:
- `customTransform` - is very low-level as it uses `Processor`
implementation.
- `apply` is more geared at reuse of some common pipeline fragments (see
javadoc). The main drawback is that it requires distinguishing between
stream and batch stages, making the api more complex to use in generic
cases (see `VectorTransforms`: `mapUsingVectorSearch` vs
`mapUsingVectorSearchBatch`, same for `PythonTransforms`). Due to use of
`FunctionEx` it is not allowed to provide single object implementing
variants for Stream and Batch stages. Also the generic type inference
can be more challenging. `apply()` is also currently not available for
keyed stages (from other deprecations it seems that stages with key
should only be used for aggregations, not for regular operations).
Compared to `apply` the `using` approach has 2 main differences:
1. use dedicated interfaces for each stage type allowing them to be
implemented together by single class, simplifying the usage for the user
(does not have to distinguish stream/batch variants)
2. ability to return custom stage type instead of keeping the same
Stream/BatchStage only with different stream item type
`mapUsingPutIfAbsent` is implemented as first extension. Other IMap
methods that can be useful as pipeline transformation steps can be added
in the same way. The usage is very easy, for example
```java
srcStage.using(IMapExtension.iMapExtension())
// additional methods available here
.mapUsingPutIfAbsent("my-map", e -> e % 10, e -> e + 5, Tuple3::tuple3)
// back to regular stage API, need to invoke `using` again to get extra methods
.writeTo(Sinks.logger());
```
This PR revisits some ideas that were discussed when `apply` was
introduced (hazelcast/hazelcast-jet#1352,
hazelcast/hazelcast-jet#1310).
Documentation PR: hazelcast/hz-docs#2087
GitOrigin-RevId: 7eaad2241c90627a41d8b35d7eb94cf4f2a4bbcf
24608f8 to
ed0f18d
Compare
Co-authored-by: Tomasz Gawęda <[email protected]>
Rob-Hazelcast
left a comment
There was a problem hiding this comment.
Thanks for adding this, generally looks good. I've left a few comments in-line.
This will need a final copy edit before publishing. Shout when you're done making changes and I'll do that.
| It is possible to implement custom reusable Pipeline API stage extensions. | ||
| This tutorial is based on `PythonExtension` available in Hazelcast and walks step-by-step | ||
| how it is created. | ||
| You can check the whole production-ready implementation in the Hazelcast repository. |
There was a problem hiding this comment.
Can you include the path or a link here?
There was a problem hiding this comment.
yes, after https://github.com/hazelcast/hazelcast-mono/pull/5934 is merged
| ``` | ||
|
|
||
| [#_map_transforms] | ||
| == Map Transforms |
There was a problem hiding this comment.
Why are map transforms listed on this page?
Map transforms are neither sinks nor sources. Map transform is a specific stream transform, so it should be in the Transforms page, isn't it?
There was a problem hiding this comment.
same convention as for https://docs.hazelcast.com/hazelcast/5.6/integrate/vector-collection-connector#searching-in-vector-collection
Map transforms are neither sinks nor sources
transform that interacts with external (from Jet perspective) resource is also a part of the connector. Interaction with such resources only from sources and sinks in some cases would be too limited. Eg. you cannot perform 2 actions in order if you have 2 sinks.
From the modularity perspective, "Transforms" page documents predominantly "internal" transforms (map, filter, window etc) and the most basic/important connector ones. IMO it cannot document all integrations - it would be too much in one place.
There was a problem hiding this comment.
From the modularity perspective, "Transforms" page documents predominantly "internal" transforms
How does this work with the fact that the page already contains transformations that use external (from jet perspective) source - for example, mapUsingService or mapUsingIMap?
There was a problem hiding this comment.
From the customer’s perspective, it’s all Hazelcast. I don’t think customers view some parts as Jet and others as map — so the map feels external to Jet.
My feeling is that it’s easier for customers to think of it this way: some steps are just intermediate transformations, and others mark the start or end of the job, connecting it to other Hazelcast structures. (that is why it is "connector")
There was a problem hiding this comment.
I strongly disagree with the current splitting approach. However, I don’t want to block the merge of this PR based solely on my personal preference. Maybe we should ask others for their opinion? @TomaszGaweda @dmitriyrazboev
There was a problem hiding this comment.
I agree with you in this.
Maybe if it was called IMap handling or `IMap support" it would be more in place. However i am not 100% sure it's a big issue, maybe just rename the file?
There was a problem hiding this comment.
It’s not about naming — it’s about splitting connector methods from transform methods.
My feeling is that a “connector” should represent a source or sink of the pipeline — something that connects the pipeline to other data structures. A transform is an intermediate step, even if it interacts with structures that are “external” from Jet’s point of view, like a imap.
Arguments:
1. All current “connectors” are either sources or sinks.
2. Transforms already include methods that use “external” structures — e.g., mapUsingIMap, mapUsingReplicatedMap, mapUsingPython, etc.
3. From the user’s perspective, an IMap is not really an external structure, so calling it a “connector” is confusing.
| Pipeline pipeline = Pipeline.create(); | ||
| pipeline.readFrom(TestSources.itemStream(10, (ts, seq) -> String.valueOf(seq))) | ||
| .withoutTimestamps() | ||
| .apply(mapUsingPython(new PythonServiceConfig() |
There was a problem hiding this comment.
Did we ban the previous Python approach? It might be nice to allow customers to choose which API they want to use. It's probably worth adding a new example without deleting the previous one
There was a problem hiding this comment.
the old one is not banned and I did not deprecate it - it is still fine, just less convenient. new one should be preferred because it is simpler (no need to distinguish stream/batch variants)
| StreamStage<String> sourceStage = sourceStage(); | ||
| StreamStage<String> pythonMapped = sourceStage.apply(PythonTransforms.mapUsingPython( | ||
| new PythonServiceConfig().setHandlerFile("path/to/handler.py"))); | ||
| StreamStage<String> pythonMapped = sourceStage.using(PythonExtension.python()).map( |
There was a problem hiding this comment.
This paragraph is exactly about mapUsingPython. So why do you change the example? Additionally, we still allow customers to use mapUsingPython.
There was a problem hiding this comment.
it is true that with the extension mapUsingPython method is not used. I decided to keep the logically still sensible header, but the underlying syntax is slightly different. IMO the current header communicates meaning in very concise way.
BTW, before it also was not simply mapUsingPython, strictly speaking it was apply(mapUsingPython) and not directly in transform in Pipeline API. I followed this shortcut.
There was a problem hiding this comment.
IMO it is better (simpler for the users) to describe one approach instead of 2 and making the users to understand and choose. Even more radical idea is to just have one approach of achieving given goal, but I did not decide to deprecate old PythonTransforms.
There was a problem hiding this comment.
I slightly disagree. There are probably customers who use the old variant and may want to see some information about it.
| ```java | ||
| StreamStage<String> sourceStage = sourceStage(); | ||
| StreamStage<String> pythonMapped = sourceStage.apply(PythonTransforms.mapUsingPython( | ||
| StreamStage<String> pythonMapped = sourceStage.using(PythonExtension.python()).map( |
There was a problem hiding this comment.
the same comment as the previous one. My point is, do not delete the existing options of usage, just add info about additional opportunity
Description of Pipeline API StageExtension API implemented in https://github.com/hazelcast/hazelcast-mono/pull/5687 (API, mapUsingPutIfAbsent) and https://github.com/hazelcast/hazelcast-mono/pull/5934/ (Python). Added also documentation for
applywhich was missing.TODO: