Energy profiling tools: Add Daemon class for periodic task execution#300
Conversation
b757df7 to
84a0ef5
Compare
4039559 to
10c87de
Compare
2c20e1f to
08e5d24
Compare
| void run(); | ||
| void stop(); | ||
| bool is_running() const { return running_; } | ||
| std::thread& get_thread() { return thread_; } |
There was a problem hiding this comment.
do we actually want this to be modifiable?
JBludau
left a comment
There was a problem hiding this comment.
Description and title could be updated, but the rest is ok for now
|
If we go to c++20 we can use jthread for thread safety, but that is for kokkos 5 |
180df12 to
9458838
Compare
|
|
||
| class Daemon { | ||
| public: | ||
| Daemon(std::function<void()> func, int interval_ms) |
There was a problem hiding this comment.
Should we pass in a std::chrono::duration directly, instead of an int? What about the type int; e.g. what is the default arithmetic type that std::chrono::milliseconds encodes?
| std::thread& get_thread() { return thread_; } | ||
|
|
||
| private: | ||
| std::chrono::milliseconds interval_; |
There was a problem hiding this comment.
Should this be chosen as the smallest possible tick period? Thinking that "milliseconds" may be quite long as a sampling interval.
There was a problem hiding this comment.
hmm ... we could use the smallest to make it as general as possible. Nevertheless, the question that arises is if there is anything useful below the ms threshold. Do you have an example in mind?
There was a problem hiding this comment.
|
|
||
| private: | ||
| std::chrono::milliseconds interval_; | ||
| bool running_{false}; |
There was a problem hiding this comment.
Is there an issue of thread safety?
There was a problem hiding this comment.
yes. As stated in a previous comment, we could just switch to jthread and https://en.cppreference.com/w/cpp/thread/jthread/request_stop.html once we have c++20
| #include <functional> | ||
| #include <thread> | ||
| #include <chrono> |
There was a problem hiding this comment.
Best to order the includes alphabetically.
| // | ||
| //@HEADER | ||
|
|
||
| #pragma once |
There was a problem hiding this comment.
We should probably agree at the level of kokkos-tools on using consistently either preprocessor conditionals or this pragma to protect our includes.
There was a problem hiding this comment.
I agree that consistency would be a good thing to aim for. There are a lot more things in Tools that could be made consistent though.
| } else { | ||
| throw std::runtime_error("Daemon already started"); |
There was a problem hiding this comment.
We may want to think about how to be consistent in handling errors. In your tool, it appears you're sometimes using exceptions, and sometimes using a boolean to handle errors. I'm not sure if there's a policy at the kokkos-tools level to avoid exceptions. Probably best to check e.g. with @JBludau.
There was a problem hiding this comment.
I would even say we don't want to throw when the daemon is already running. This is not an exception but should just do nothing if the daemon is started.
There was a problem hiding this comment.
Yeah, exceptions aren't really used for Kokkos Tools libraries - and I think not throwing here is the right choice.
maartenarnst
left a comment
There was a problem hiding this comment.
This PR would also benefit from tests and an example.
|
Hey @JBludau. I've just gone over the changes. You'll find a few suggestions. Sill thinking that it would be good to add a small test. Perhaps there could be a test with a callback function that just waits for a given duration. The test could use durations that would be typical for the use case of the energy profiler. |
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Co-authored-by: Maarten Arnst <maarten.arnst@ulg.ac.be>
Yeah, I asked @ethan-puyaubreau if he could add his test. |
aprokop
left a comment
There was a problem hiding this comment.
This looks good to me. I only have a couple question.
We need this capability as soon as we can for one of my projects.
|
|
||
| namespace KokkosTools::EnergyProfiler { | ||
| void Daemon::start() { | ||
| if (!running_) { |
There was a problem hiding this comment.
What are the semantics of start() if it's already running? Should it break? Should it be restarted?
There was a problem hiding this comment.
I think it should just do nothing if it is already running. Multiple start calls should be fine since in our current use case the global system time is recorded. If someone wants to use it differently I would opt we add a restart() functionality.
There was a problem hiding this comment.
but as described below ... this is not threadsafe anyway
| } | ||
|
|
||
| void Daemon::stop() { | ||
| if (running_) { |
There was a problem hiding this comment.
What are the semantics here? Should we require that it's been running?
There was a problem hiding this comment.
I mean, it is not threadsafe, so we are limited with the current impl. But if we care we can either go to jthread or at least use atomics for the status
There was a problem hiding this comment.
FWIW, I think C++20 std::jthread is a great idea here. I have a slight inclination to use C++ atomics for now since it seems more straightforward - I am wondering how portable/available jthreads would be on various computing systems currently?
There was a problem hiding this comment.
Currently the tools don't have a minimum requirement for 20, right? Since most of the other tools are not threadsafe, I would say that is a larger discussion.
But if we want it to be std::jthread now, we can update. I am happy for any contribution
Co-authored-by: Andrey Prokopenko <andrey.prok@gmail.com>
|
Hi all, I was wondering about the status on this. I think this is a good PR that would be valuable for a lot of profiling data beyond energy information, which is relevant to #291 - albeit I would say the energy profiling use case is the one that is probably of most value currently to production Kokkos applications. Do you know the rough timeline of this PR by chance? I know that @ethan-puyaubreau is done with his internship, so there may be timing constraints from his end. I think this PR is on the right track overall. |
|
The main open thing is a desire to have an example for the daemon. So far I was not able to come up with a good one. |
Yes, I definitely agree just considering the energy profiling use case already shows an example usage. It would certainly help to have a small example since it can make clear assumptions being made on its use, but I think it doesn't need to be a blocker. I do think some of the error handling and/or tests for corner cases (e.g., interval being longer than time for kernel) that are discussed by @maartenarnst and you in this PR are important to think about. Except for making sure no critical error checks are missing, I think the code logic of this PR makes sense to me. Doing some sort of beta release of this Daemon class, especially so, e.g., @aprokop along with #291 can use this would be good to get early feedback on real-world production use cases, as that could directly provide a lot of insight for improvements. Maybe it could be merged in as a beta feature early next year? Just my thoughts on this now - I will look and see if I can make a few more improvements to the code and error checks in the PR myself till then. |
vlkale
left a comment
There was a problem hiding this comment.
LGTM overall, just a suggestion to consider a default interval
| namespace KokkosTools::EnergyProfiler { | ||
| class Daemon { | ||
| public: | ||
| Daemon(std::function<void()> func, const std::chrono::duration& interval) |
There was a problem hiding this comment.
One quick note is I don't see a default value for the time interval of the Daemon but maybe you could have one. If you do, I think a reasonable one would be 1 second. This would at least makes sense from the point of view of #291. Maybe there are other use cases and situations where you want it to be shorter (or longer).
There was a problem hiding this comment.
Hmm, I think the user explicitly specifying a duration is a good practice here. It is a core setting of the daemon that should be thought about explicitly.
There was a problem hiding this comment.
Ah, got it - yes, I agree. We want this Daemon class to be generic beyond the energy profiling use case and/or LDMS use case, and it should indeed be thought out by users explicitly based on the scenario.
|
@maartenarnst do you want us to investigate the use case of a longer return time than the frequency the daemon is given? If so, do you have a specific case in mind that would need this? |
|
@vlkale I think we can merge this. Maarten did not reply and we still can look at making the daemon able to deal with runtimes longer than its frequency later if we ever have a usecase for this |
@JBludau I am good with merging it. Yes, allowing for runtimes longer than its frequency would be good but it's not needed right now for immediate/priority use cases. |
|
@vlkale could you merge this? |
This PR introduces a generic Daemon class for background task execution in the energy profiler infrastructure:
The daemon system will enable energy providers to perform periodic power sampling at specified intervals, supporting the energy measurement capabilities planned in #301.