Add metrics about the solver-service behaviour#70
Add metrics about the solver-service behaviour#70moyodiallo wants to merge 5 commits intoocurrent:mainfrom
Conversation
|
@moyodiallo Could you update this to the EIO version of the solver? |
tmcgilchrist
left a comment
There was a problem hiding this comment.
Everything else looks good. @moyodiallo co-ordinate with @mtelvers to deploy it to a solver worker and check the statistics are exported correctly. Do we still have a test solver pool available?
service/pool.ml
Outdated
| let request, set_reply = Eio.Stream.take t.requests in | ||
| Atomic.incr t.running; | ||
| handle request |> Promise.resolve set_reply; | ||
| Atomic.decr t.running; |
There was a problem hiding this comment.
This pattern (incr, handle, decr) feels unsafe in the presence of exceptions from handle request cc @talex5
There was a problem hiding this comment.
handle request needs to catch those exceptions otherwise we're losing a worker and that will result as a bug.
There was a problem hiding this comment.
Yes, if a worker crashes then the whole service exits, so it's not strictly necessary to handle exceptions here, although it would make the code more robust to future changes. However, I don't think you need this counter. The number of running workers is always min n_workers n_jobs, so you can just calculate it as needed.
There was a problem hiding this comment.
Yes you're right, I did not get this min n_workers n_jobs. Thanks.
There was a problem hiding this comment.
Oh, we don't have n_jobs in the pool. All we have is waiting jobs (jobs Eio.Stream.t). It won't be precise if consider waiting jobs as n_jobs, ex. 2 jobs waiting and all the 8 workers busy.
There was a problem hiding this comment.
What would be useful to see is:
- n_workers - static capacity of this solver_worker
- queued_requests - requests waiting for a Fiber to run on
- running_workers - Fibers actively running a solve job
That would let us answer questions like total solver_worker capacity available, utilisation of the total capacity, and saturation of the total capacity and per solver_worker saturation.
The number of running workers is always
min n_workers n_jobs, so you can just calculate it as needed.
I thought from reading the code and https://github.com/ocaml-multicore/eio?search=1#multicore-support, Pool pre_forked a number of OS threads (Domains?) which would wake up when work was added to the Eio.Stream? Is that accurate @talex5 ?
There was a problem hiding this comment.
That's right - there is a fixed pool of workers and they will all be busy unless there just aren't any jobs in the queue.
The issue here is that run_worker is running in a worker domain, and so can't (currently) update Prometheus metrics itself (which I guess it why it's updating an atomic instead and waiting for the main domain to report that). But the main domain can work out how many workers are running just by knowing how many jobs are outstanding, so this isn't necessary.
There was a problem hiding this comment.
That's right there's a way to do that.
solver instead of worker make sense. Co-authored-by: Mark Elvers <mtelvers@users.noreply.github.com>
The pool stream wasn't accumulating the requests. So it was incorrect to consider its length as waiting requests.
No description provided.