executor: Don't run background tasks if future is ready#2298
executor: Don't run background tasks if future is ready#2298Gelbpunkt wants to merge 1 commit intohermit-os:mainfrom
Conversation
Often times, the future is ready immediately and yet we will still run the background tasks every time, adding significant overhead that is noticable when benchmarking a HTTP server. The blocking httpd example gains anywhere between 20-50% more throughput for me with these changes, while the netbench TCP bandwidth benchmark only slightly regresses by about 5%. The httpd example ported to axum sees an increase in throughput by 500-600%. Co-authored-by: Louis Vialar <louis.vialar@gmail.com>
|
Thanks for bringing back this patch! I'm concerned the fact that the executor may run way less often may cause other issues. Then again, the |
Yeah, I think this is happening in CI. Of course, one of the issues with our executor is that many things are actually blocking, which is bad. Making things properly non-blocking should make polling the executor here a non-issue. On the other hand, I agree that if something breaks due to this PR, that is probably a bug and should be resolved. |
There was a problem hiding this comment.
Benchmark Results
Details
| Benchmark | Current: e67a10f | Previous: 7c500d8 | Performance Ratio |
|---|---|---|---|
| startup_benchmark Build Time | 90.87 s |
91.25 s |
1.00 ❗ |
| startup_benchmark File Size | 0.81 MB |
0.79 MB |
1.02 ❗ |
| Startup Time - 1 core | 0.96 s (±0.03 s) |
0.93 s (±0.03 s) |
1.03 |
| Startup Time - 2 cores | 0.97 s (±0.03 s) |
0.94 s (±0.03 s) |
1.03 |
| Startup Time - 4 cores | 0.99 s (±0.03 s) |
0.94 s (±0.02 s) |
1.05 ❗ |
| multithreaded_benchmark Build Time | 86.72 s |
88.17 s |
0.98 ❗ |
| multithreaded_benchmark File Size | 0.91 MB |
0.89 MB |
1.02 ❗ |
| Multithreaded Pi Efficiency - 2 Threads | 88.71 % (±7.78 %) |
90.13 % (±10.09 %) |
0.98 |
| Multithreaded Pi Efficiency - 4 Threads | 43.62 % (±3.09 %) |
45.06 % (±4.71 %) |
0.97 |
| Multithreaded Pi Efficiency - 8 Threads | 25.50 % (±1.63 %) |
25.74 % (±2.31 %) |
0.99 |
| micro_benchmarks Build Time | 94.77 s |
93.18 s |
1.02 ❗ |
| micro_benchmarks File Size | 0.92 MB |
0.90 MB |
1.02 ❗ |
| Scheduling time - 1 thread | 72.96 ticks (±3.77 ticks) |
71.27 ticks (±4.26 ticks) |
1.02 |
| Scheduling time - 2 threads | 40.96 ticks (±6.09 ticks) |
39.04 ticks (±5.20 ticks) |
1.05 |
| Micro - Time for syscall (getpid) | 3.09 ticks (±0.25 ticks) |
2.97 ticks (±0.30 ticks) |
1.04 |
| Memcpy speed - (built_in) block size 4096 | 65436.08 MByte/s (±47028.44 MByte/s) |
66371.85 MByte/s (±47142.10 MByte/s) |
0.99 |
| Memcpy speed - (built_in) block size 1048576 | 29657.45 MByte/s (±24588.04 MByte/s) |
29496.16 MByte/s (±24371.57 MByte/s) |
1.01 |
| Memcpy speed - (built_in) block size 16777216 | 27575.23 MByte/s (±23021.49 MByte/s) |
27947.10 MByte/s (±23282.80 MByte/s) |
0.99 |
| Memset speed - (built_in) block size 4096 | 66053.38 MByte/s (±47416.08 MByte/s) |
66660.22 MByte/s (±47324.52 MByte/s) |
0.99 |
| Memset speed - (built_in) block size 1048576 | 30460.31 MByte/s (±25032.45 MByte/s) |
30239.65 MByte/s (±24809.14 MByte/s) |
1.01 |
| Memset speed - (built_in) block size 16777216 | 28336.00 MByte/s (±23452.91 MByte/s) |
28701.07 MByte/s (±23711.18 MByte/s) |
0.99 |
| Memcpy speed - (rust) block size 4096 | 56917.41 MByte/s (±42026.53 MByte/s) |
59143.47 MByte/s (±43480.35 MByte/s) |
0.96 |
| Memcpy speed - (rust) block size 1048576 | 29396.25 MByte/s (±24539.08 MByte/s) |
29331.43 MByte/s (±24300.99 MByte/s) |
1.00 |
| Memcpy speed - (rust) block size 16777216 | 27420.15 MByte/s (±22866.37 MByte/s) |
28046.81 MByte/s (±23386.04 MByte/s) |
0.98 |
| Memset speed - (rust) block size 4096 | 57564.84 MByte/s (±42408.57 MByte/s) |
59676.52 MByte/s (±43806.15 MByte/s) |
0.96 |
| Memset speed - (rust) block size 1048576 | 30204.88 MByte/s (±24996.22 MByte/s) |
30122.22 MByte/s (±24741.83 MByte/s) |
1.00 |
| Memset speed - (rust) block size 16777216 | 28032.05 MByte/s (±23163.63 MByte/s) |
28769.44 MByte/s (±23781.83 MByte/s) |
0.97 |
| alloc_benchmarks Build Time | 93.51 s |
92.51 s |
1.01 ❗ |
| alloc_benchmarks File Size | 0.88 MB |
0.86 MB |
1.02 ❗ |
| Allocations - Allocation success | 100.00 % |
100.00 % |
1 |
| Allocations - Deallocation success | 100.00 % |
100.00 % |
1 |
| Allocations - Pre-fail Allocations | 100.00 % |
100.00 % |
1 |
| Allocations - Average Allocation time | 11197.85 Ticks (±194.00 Ticks) |
10757.47 Ticks (±159.10 Ticks) |
1.04 ❗ |
| Allocations - Average Allocation time (no fail) | 11197.85 Ticks (±194.00 Ticks) |
10757.47 Ticks (±159.10 Ticks) |
1.04 ❗ |
| Allocations - Average Deallocation time | 1082.14 Ticks (±540.08 Ticks) |
1302.69 Ticks (±790.51 Ticks) |
0.83 |
| mutex_benchmark Build Time | 92.90 s |
93.08 s |
1.00 ❗ |
| mutex_benchmark File Size | 0.91 MB |
0.90 MB |
1.02 ❗ |
| Mutex Stress Test Average Time per Iteration - 1 Threads | 13.10 ns (±0.70 ns) |
13.32 ns (±0.71 ns) |
0.98 |
| Mutex Stress Test Average Time per Iteration - 2 Threads | 23.40 ns (±14.93 ns) |
18.58 ns (±6.34 ns) |
1.26 |
This comment was automatically generated by workflow using github-action-benchmark.
Often times, the future is ready immediately and yet we will still run the background tasks every time, adding significant overhead that is noticable when benchmarking a HTTP server.
The blocking httpd example gains anywhere between 20-50% more throughput for me with these changes, while the netbench TCP bandwidth benchmark only slightly regresses by about 5%. The httpd example ported to axum sees an increase in throughput by 500-600%:
before:
after:
This was suggested by @zyuiop in #2086 (comment).