Sorry, I have no quantitative comparison for that. It's baked into the design of...

Sorry, I have no quantitative comparison for that. It's baked into the design of the register allocator to schedule loads early and to rematerialize constants far ahead of their use. And I'm contemplating to add an extra scheduling pass for some cases that are currently not covered well.

From experience in tuning the interpreter, instruction scheduling (esp. load scheduling) can easily make a 20-50% difference on some tight loops for a simple in-order architecture. This is much less of an issue with an out-of-order architecture, of course. It's still beneficial to do instruction scheduling for all ARM architectures, because there's apparently a huge performance difference compared to the out-of-order engines on contemporary x86/x64 CPUs.