Optimize every case where allocation can be eliminated, including when all real uses are inlined, delay allocation to branches where the full object is needed, allow passing stack pointer to functions that doesn’t care if the pointer is to the heap etc.
That’s actually a very minor change in the codegen. It’s a single line in codegen and maybe a few more lines in signal handler, depending on how accurate you want it to be.
Thanks for the elaboration, I really appreciate it! Sorry for pestering you so much, but may I ask for a little bit of extra elaboration? All three optimizations you listed are very cool. Especially the delay of allocation to branches that end in throw/unreachable would be nice.
So, to check that I understood this right:
“when all real uses are inlined”: I thought we already remove allocations if every use is inlined?
“delay allocation to branches where the full object is needed”: Delay of allocation looks like it should be done on the level of optimizing SSA-IR, and not touch codegen at all?
“allow passing stack pointer to functions”: Passing stack pointers to non-inlined functions looks like functions would need more attributes (can they leak refs?) and otherwise wants to be done close to llvm / in codegen and not touch the SSA-IR optimizer at all? Except for potentially generating the “can’t leak references” attribute.
“Collapsing pointer-chains”: That would need an ABI change: Implicit (pointerchain-collapsed) objects would be passed by effective value, and the callee would need to instantiate a new wrapper object if it wants to pass it on to a @nospecialize function. As far as I understood, such situations are rare in performance-relevant situations. But pointer-chain collapse would indeed produce an overhead / extra allocations in such a situation. On the other hand, (2) and (3) would alleviate this potential performance impact.
If I understood you right, the last point is the reason you want the “collapse of pointer chains” to be tackled at a later future date, even if the necessary changes for (4) are mostly independent of (1)-(3)?
If you mean 4 as an optimization, then it is on the table and can be implemented at any time, with overhead in type unstable case and that’s not what I’m talking about. This thread had been about user visible changes though and that’s what should be done after optimization options are exhausted.