Ok, I made some quick benchmarks, comparing LuaJIT git HEAD and Dalvik 1.2.0 (Android 2.2.1) on the MSM 7201A (528 MHz ARM11 soft-fp). I don't have access to a newer Dalvik VM for this device. The Dalvik JIT compiler is definitely enabled, because performance gets much worse when I disable it.
Here are SciMark scores (HIGHER numbers are better):
ARM11 SciMark SMALL | FFT SOR MC SPARSE LU
-----------------------+---------------------------------------
Lua 5.1.4 0.60 | 0.50 0.92 0.36 0.55 0.69
LuaJIT git 4.34*| 2.61* 6.70* 3.91* 3.13 5.36*
Dalvik 1.2.0+JIT 3.35 | 2.35 5.65 1.09 3.39* 4.27
Those are not too relevant, though, since it's a soft-float device. The maximum speedup is limited by the high cost of the soft-float operations (e.g. 62 cycles for a double-precision FP ADD).
And here are some simple integer benchmarks, run time in seconds (LOWER numbers are better):
Note that binary-trees is a GC-intensive benchmark where LuaJIT usually loses against Java VMs, since they have a much better GC. Not so for Dalvik it seems.
The winner for each benchmark is marked with a '*'. Looks good. ;-)
How does LuaJIT decide to use int vs. float, given that Lua has a single number type with floating point semantics? Do you notice that a variable happens to always contain an int, then compile it as an int with guards for overflow and non-integral division?
LuaJIT/ARM uses the dual-number VM mode, where numbers can be either represented as 32 bit integers or as a double. It uses lazy normalization, so conversions happen as needed. This is invisible to the user, but internally there are two different number types.
So there's usually already a strong indicator that a variable is an integer or a double just from looking at the internal type. The interpreter usually has two paths for all operations on numbers and the integer path ist the fast path. The JIT compiler adds guards to check for the proper types and emits specialized code.
Also, the JIT compiler pro-actively narrows doubles to integers, whenever it's beneficial to do so. The logic is quite involved -- you can take a look at the big comment block in lj_opt_narrow.c.
This optimization is also active on e.g. x86/x64, where there's only a single underlying number representation (a double). But I should mention that on x86/x64 it doesn't necessarily pay off to perform _all_ operations on integer types, since this would waste the extra FPU registers and the massive extra bandwidth of the FP units in these chips. The branch unit is already quite busy and you're effectively serializing the code with all of those overflow checks.
So the JIT-generated machine code for x86/x64 and for ARM may be quite different for the same inputs. And I'm not talking about the instruction set differences alone.
E.g. compare the output of these two commands:
luajit -jdump -e "local x=0; for i=1,100 do x=x+1 end"
luajit -jdump -e "local x=0.5; for i=1,100 do x=x+1 end"
The generated code for the inner loop on x86/x64 is the same in both cases:
Note: the mov r10, r4 is to preserve the value prior to the overflowing calculation in exit 2. Yes, one could avoid that for an addition by undoing the calculation. But this won't work in general, e.g. for a multiplication.
Here are SciMark scores (HIGHER numbers are better):
Those are not too relevant, though, since it's a soft-float device. The maximum speedup is limited by the high cost of the soft-float operations (e.g. 62 cycles for a double-precision FP ADD).And here are some simple integer benchmarks, run time in seconds (LOWER numbers are better):
Note that binary-trees is a GC-intensive benchmark where LuaJIT usually loses against Java VMs, since they have a much better GC. Not so for Dalvik it seems.The winner for each benchmark is marked with a '*'. Looks good. ;-)