Proper and safe evaluation of realtime capability#4132
Conversation
|
I do think it to be very problematic that getting the RT status is so involved. There should be a simple test that does not involve instantiating larger parts of infrastructure. |
|
:-D The CI has no realtime: It is somewhat a chicken and egg problem. Not involving the RT infrastructure needs separate code and this can always not be in sync with RT. Involving RT runs a lot of stuff. Let's think about this. Exactly the issue here: #4129 What do you think about a parameter / pin? So you can use getp / gets to ask for realtime? Or I could add a new command path to halcmd / rtapi_app that does not start the hal if it is not yet running. |
|
Running non-RT is not an error per se (see CI). Therefore, you cannot and must not "simply" or "blindly" force one or the other. There are use-cases where you want to know the RT status and that does not always mean that you will or will not be running either. Finding out what the RT status is or will be must be lightweight and may be different from where you ask. Doing it in a component or from the cmd-line may be different, depending how you ask and with what intention you ask. |
f1d22a7 to
97bb377
Compare
|
Both CI failures are small:
Two runtime things I noticed while reading:
Minor: |
b73daaf to
5dfd303
Compare
|
Thanks. Fixed. One quirk: RTAI in userspace is called LXRT. I should rename this consistently. |
7f6e60d to
fbf2b81
Compare
|
Hmm, time for a break, to many force pushes, sorry. I will continue tomorrow. So you are OK with the general concept? Then I will polish it up, update the doc and do some more testing in different combinations. Open:
Deferred:
|
|
The CI fixes and the LXRT / fallback handling look good now. On naming, with @BsAtHome's #4099 in mind: For the two open how-tos:
|
Well, I don't own it. However, I agree with the argument to leave the get/set moniker to the hal pin/param data access. |
|
The last few commits are still figuring out ways to solve all issues, not ready for code style review yet. @grandixximo Thanks a lot for the hints. Helps a lot not to have to search everything.
|
6477a97 to
af73e08
Compare
|
And of course, after debian package install, the realtime script is not any more in path. And there was also no define for the path where it is. af73e08 adds one. For scripts, |
|
Due to error handling is annoying not knowing when rtapi_app is running or not: It now works exactly the same as RTAI. And it generated a new ToDo: Cleanup the realtime script. It is a mess. There are a lot of RTAI only parts executed in uspace mode. Like loading an empty list of modules and so on. So far, I just added the things I needed. |
|
Two notes after the latest push. Naming: you're right, I'll withdraw my concern. CI / auto-start: the two failures ( Separately: is dropping the auto-start actually needed for the RT-status goal, or is it a cleanup that could be its own PR? Keeping them decoupled would let the getrt/ |
|
If these test fails due to a missing realtime start, they most probably fail also with RTAI, i have to test it. It is not needed in this PR. However, as soon you like to use I will move it in a separate PR as initialy planned, this one gets already big. |
|
Please squash 😁 3 commits is clean |
This helps to check if hal functions can be used or if a component needs to be created first.
|
So, squashed and cleanup. New:
ToDo:
I will look into this. Is there a script to do this? |
|
No dedicated rebase script, it's manual. Run |
…istic Query realtime status with 'realtime verify' (from LinuxCNC#4132) rather than probing the setuid bit. latency-histogram asks the realtime layer directly; latency-test relies on the existing "POSIX non-realtime" note.
…istic Query realtime status with 'realtime verify' (from LinuxCNC#4132) rather than probing the setuid bit. latency-histogram asks the realtime layer directly; latency-test relies on the existing "POSIX non-realtime" note.
|
So, I went trough the full testing matrix. Most works as expected. There is just something to consider: This is strictly speaking still realtime, just a bad one. On the terminal, there is a warning. I could change to: and then check type > REALTIME_TYPE_PREEMPT_DYNAMIC. Or alternatively, use |
|
After some consideration, I think auto selecting unknown or PREEMPT_DYNAMIC is probably a bad idea and can generate a lot of confusion for not to experienced users. Rather select the worst possible option, so it is clear something is wrong, instead of selecting a halve working variant. So with the last commit, LINUXCNC_FORCE_REALTIME=1 is needed to select these options and there is a new type REALTIME_TYPE_UNKNOWN (SCHED_FIFO available but not PREEMT_DYNAMIC) Feedback? |
|
The "pick the worst option so it's obvious something is wrong" instinct is right for backend selection,we shouldn't auto-load a half-working RTAI/Xenomai backend. But this lands on the same problem @BsAtHome flagged earlier, that we "must not blindly force one or the other" and the answer may differ by where you ask. The commit gates one knob too many: it conflates what type do I report with do I grant SCHED_FIFO. Unforced Suggestion: keep using Smaller, not blockers:
|
The stepconf / pncconf should be fixed. Not having a way for "proceed anyway" is not nice. This is also nothing new, behaves like this also in 2.9 with vanilla kernel + setuid active.
Just checked: Using SCHED_FIFO without PREEMT_RT was introduced by #3964. Before, SCHED_OTHER was used when the kernel is not PREEMT_RT or otherwise realtime capable. SCHED_FIFO was used with non-PREEMT_RT kernels only with LINUXCNC_FORCE_REALTIME=1. Also, since #3964 is_rt is broken and always returns false. So it tracks back this behavior how it was before essentially. One other option: If SCHED_FIFO is available but no realtime kernel, I can still use SCHED_FIFO. In this case, I will set REALTIME_TYPE_NONE / is_rt=false.
It was essentially void before. The only case when it returned 1 was when rtapi_is_realtime() failed, which was checked later with
Python hands back: There are many functions doing this, all undocumented. Same for the hal returning -EINVAL. I can document this everywhere where missing.
Ah yes, i forgot to mention this. I believe unloading hal_lib was missed at rtapi_app exit. This is needed to set back to REALTIME_TYPE_UNINITIALIZED when rtapi_app is not running. I hope there are no other side effects, however I did not see one until now.
Thanks, I will fix this. |
…_app Python is_rt / is_sim now use "realtime verify" which uses "rtapi_app check_rt" in uspace / returns true in rtai. If realtime not running: rtapi_app performs the checks an returns the state If realtime running: rtapi_app calls master to perform the check and returns the state
The same checks are performed always the same now. If something is not properly checked, makeApp() fill fail instead of just chosing a different RT implementation by itsself. New function: rtapi_realtime_type_t rtapi_get_realtime_type(void)
91a894b to
ce89d89
Compare
|
About the behaivor with a vanilla kernel, options:
always:
Normal: LINUXCNC_FORCE_REALTIME=1
Normal: LINUXCNC_FORCE_REALTIME=1 I am for 2. Consistent and no change in behaivour compared to before nonroot. Shows issues when the wrong kernel is running.
|
|
Agreed, go with 2. I came in pushing 3, but you've convinced me 2 is the right call for a machine controller. The only one who actually benefits from best-effort SCHED_FIFO on a non-RT kernel is someone running real hardware on the wrong kernel, and that's exactly the case we want to make visibly degraded rather than silently "good enough", since masked latency spikes turn into following errors and lost steps on a real machine. Dev and sim don't need SCHED_FIFO anyway, so 2 costs them nothing, and So: 2 it is. Restores pre-rootless behavior, consistent, and surfaces a wrong-kernel boot instead of papering over it. |
|
One correction to my wording, then the concrete reason for 2. My "3 hides the problem without admitting it" was wrong: per your tables 2 and 3 report identically ( But the "Unexpected realtime delay" warning is gated on the scheduler policy, not on // uspace_posix.cc, RtapiTask::wait(), in the overrun branch:
if (policy == SCHED_FIFO)
unexpected_realtime_delay(task);
That makes the option-3 normal row ( So option-3 normal as drawn would behave like option 1 with That's why 2 is clean: scheduler, overrun warning, and |
rtapi_is_realtime() was always unreliable. hal_get_realtime_type() returns now the true running realtime type through the HAL for user and realtime components. This function is also exposed through python hal. rtapi_app now unloads hal_lib at exit, so everything is cleaned up properly and realtime_type is set back to REALTIME_TYPE_UNINITIALIZED.
REALTIME_TYPE_UNKNOWN / REALTIME_TYPE_PREEMPT_DYNAMIC need LINUXCNC_FORCE_REALTIME=1 to be set. These two options are most likely not desired and should not be selected automatically.
|
Option 3 would need some additional code changes to separate SCHED_FIFO from the warning and the reported RT type. Surely possible but if you agree option 2 is the best, this is already implemented. @BsAtHome Are you also fine with option 2? Done:
Still open:
|
|
|
||
| #define RTAPI_HAL_PRIV /* Use rtapi/hal private functions */ | ||
| #include <rtapi.h> /* RTAPI realtime OS API */ |
There was a problem hiding this comment.
Either you publish enums and functions or you do not.
Everything in rtapi.h hal.h is available and should not be hidden by any define construct. If you need it such construct, then something is wrong.
What is private remains private. What is public is always public.
| #include "config.h" | ||
|
|
There was a problem hiding this comment.
Why do you need this config include?
| //Call realtime verify to gather realtime status | ||
| //Most probably we don't have realtime running yet | ||
| int ret = system(EMC2_REALTIME " verify > /dev/null"); | ||
| int exit_stat = WEXITSTATUS(ret); | ||
| if(exit_stat != 0 && exit_stat != 1){ | ||
| PyErr_Format(PyExc_RuntimeError, "realtime verify failed, system() return value %i / exit %i", ret, exit_stat); |
There was a problem hiding this comment.
Calling system() in the hal module is extremely ugly and a sign that something is amiss.
Intended to be used with:
#4107
Will fix is_sim / is_rt which is broken: #4129
Two new functionality's for two use-cases:
realtime statuscan be used to check if realtime is running.The realtime script is extended with the
verifycommand returning 0 if RT capable, 1 if not.It is intended to use when not running and running.
rtapi_app getrtand returns the stateThere is the new function
hal_get_realtime_type()returning the type of the actually running realtime system trough the hal.is_sim / is_rt use
realtime verifyat init.rtapi_is_realtime()is deprecated: It works only in real time context since #3964 and was never 100% reliable, also according to the doc.