gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966
gh-146073: Add fitness/exit quality mechanism for JIT trace frontend#147966Fidget-Spinner merged 12 commits intopython:mainfrom
Conversation
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Great work. Thanks for doing this!
|
I also forgot to mention: but we should adjust for the following:
|
markshannon
left a comment
There was a problem hiding this comment.
This looks good, and thanks for doing this.
Overall, the approach looks good.
I originally had in a mind that the fitness would reduce geometrically, not arithmetically, but your arithmetic approach looks easier to reason about and at least as good in terms of trace quality.
We need to reduce the number of configurable parameters to just one or two.
Then we can make sure that fitness and penalties are set such that we are guaranteed not to overflow the trace or optimizer buffers.
I'm not sure that rewinding is worth it. As long as "good" exits have a much higher score than "bad" exits, then we should (almost) always end up at a good exit.
Include/internal/pycore_optimizer.h
Outdated
| /* Default fitness configuration values for trace quality control. | ||
| * These can be overridden via PYTHON_JIT_FITNESS_* environment variables. */ | ||
| #define FITNESS_PER_INSTRUCTION 2 | ||
| #define FITNESS_INITIAL (UOP_MAX_TRACE_LENGTH * FITNESS_PER_INSTRUCTION) |
There was a problem hiding this comment.
I'd stick with 2000. We really don't need traces of over 1000 uops; it is far more important to find a good stopping point than to have very long traces
There was a problem hiding this comment.
@markshannon 1000 uops is only roughly 300 uops after optimization which corresponds to roughly only 60 bytecode instructions. That's too low. A reminder that the trace recording spews out a lot of uops that is only optimized away by the optimizer.
There was a problem hiding this comment.
@cocolato I suspect richards perf regression might be due to too low trace length, I experimented with richards before and it's quite sensitive to how well the optimizer performs
There was a problem hiding this comment.
Yes, and I tried reducing the branch/frame penalty, which also improve richards' performance.
| trace->end -= needs_guard_ip; | ||
|
|
||
| int space_needed = expansion->nuops + needs_guard_ip + 2 + (!OPCODE_HAS_NO_SAVE_IP(opcode)); | ||
| if (uop_buffer_remaining_space(trace) < space_needed) { |
There was a problem hiding this comment.
| assert (uop_buffer_remaining_space(trace) > space_needed); |
If we choose the fitness and exit values correctly, we can't run out of space.
There was a problem hiding this comment.
I'm not sure if the current parameters allow us to remove the check here, so I think it's better to keep it for now.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
|
I’ll try to optimize the relevant parameters further, as performance in benchmarks like |
|
It might be worth thinking about a few scenarios to help choose parameters.
|
| static inline int32_t | ||
| compute_frame_penalty(const _PyOptimizationConfig *cfg) | ||
| { | ||
| return (int32_t)cfg->fitness_initial / 10 + 1; | ||
| } |
There was a problem hiding this comment.
The newest result with lower frame penalty:
+----------------------+----------+-----------------------+
| Benchmark | baseline | fitness |
+======================+==========+=======================+
| raytrace | 262 ms | 212 ms: 1.24x faster |
+----------------------+----------+-----------------------+
| pickle_pure_python | 277 us | 254 us: 1.09x faster |
+----------------------+----------+-----------------------+
| go | 83.7 ms | 78.8 ms: 1.06x faster |
+----------------------+----------+-----------------------+
| xml_etree_iterparse | 74.1 ms | 69.9 ms: 1.06x faster |
+----------------------+----------+-----------------------+
| xml_etree_process | 50.4 ms | 48.3 ms: 1.04x faster |
+----------------------+----------+-----------------------+
| xml_etree_generate | 77.3 ms | 74.1 ms: 1.04x faster |
+----------------------+----------+-----------------------+
| xml_etree_parse | 122 ms | 119 ms: 1.03x faster |
+----------------------+----------+-----------------------+
| regex_compile | 103 ms | 99.7 ms: 1.03x faster |
+----------------------+----------+-----------------------+
| deltablue | 2.19 ms | 2.14 ms: 1.03x faster |
+----------------------+----------+-----------------------+
| unpickle_pure_python | 171 us | 167 us: 1.02x faster |
+----------------------+----------+-----------------------+
| regex_effbot | 2.16 ms | 2.12 ms: 1.02x faster |
+----------------------+----------+-----------------------+
| fannkuch | 253 ms | 249 ms: 1.02x faster |
+----------------------+----------+-----------------------+
| json_loads | 19.3 us | 19.1 us: 1.01x faster |
+----------------------+----------+-----------------------+
| json_dumps | 7.60 ms | 7.66 ms: 1.01x slower |
+----------------------+----------+-----------------------+
| pidigits | 136 ms | 138 ms: 1.01x slower |
+----------------------+----------+-----------------------+
| pyflate | 267 ms | 273 ms: 1.02x slower |
+----------------------+----------+-----------------------+
| float | 45.2 ms | 46.2 ms: 1.02x slower |
+----------------------+----------+-----------------------+
| richards | 16.5 ms | 17.3 ms: 1.05x slower |
+----------------------+----------+-----------------------+
| Geometric mean | (ref) | 1.03x faster |
+----------------------+----------+-----------------------+
|
By analyzing specific trace paths in the richards, I found that the frame penalty has the greatest impact on fitness: |
|
Very impressive, on fastmark, I see a roughly 1% speedup on my x86_64 Linux system, and on raytrace specifically it's quite a big bump: Note: my system is kind of old now (i7-12700H) |
Fidget-Spinner
left a comment
There was a problem hiding this comment.
LGTM. just one comment
| // so there's no point continuing the trace. | ||
| DPRINTF(2, "Unsupported: frame depth %d >= MAX_ABSTRACT_FRAME_DEPTH\n", | ||
| ts_depth->frame_depth); | ||
| goto unsupported; |
There was a problem hiding this comment.
This should goto somewhere that we rewind and insert _EXIT_TRACE. The current unsupported instead inserts _DEOPT.
Perhaps try assigning int end_trace_opcode = _DEOPT, and when it hits this branch, write end_trace_opcode = _EXIT_TRACE then assign curr->opcode = end_trace_opcode;.
|
I have made the requested changes; please review again. |
|
Thanks for making the requested changes! @markshannon: please review the changes made to this pull request. |
|
Thank you all for taking the time to review this! Do the current parameters meet our expectations? |
Changes have been addressed. Any minor tweaks can be done as follow-up PRs. Thanks Mark!
|
I see a slight (~8%) slowdown in Richards on my system. However, I think that's acceptable considering most other things sped up. |
Background
See: #146073
Introduced a preliminary fitness/exit quality mechanism for JIT trace frontend has, enabling the tracer to:
ENTER_EXECUTOR) to avoid stopping on instructions that can be optimized