Ok, I made some quick benchmarks, comparing LuaJIT git HEAD and Dalvik 1.2.0 (An...

mikemike · on June 5, 2011

After some improvements to the code generator, LuaJIT git HEAD now wins on all SciMark sub-benchmarks:

    ARM11    SciMark SMALL | FFT     SOR      MC    SPARSE    LU
    -----------------------+---------------------------------------
    LuaJIT git        4.48*|  2.73*   6.93*   3.96*   3.44*   5.36*
    Dalvik 1.2.0+JIT  3.35 |  2.35    5.65    1.09    3.39    4.27

martincmartin · on June 3, 2011

How does LuaJIT decide to use int vs. float, given that Lua has a single number type with floating point semantics? Do you notice that a variable happens to always contain an int, then compile it as an int with guards for overflow and non-integral division?

mikemike · on June 3, 2011

LuaJIT/ARM uses the dual-number VM mode, where numbers can be either represented as 32 bit integers or as a double. It uses lazy normalization, so conversions happen as needed. This is invisible to the user, but internally there are two different number types.

So there's usually already a strong indicator that a variable is an integer or a double just from looking at the internal type. The interpreter usually has two paths for all operations on numbers and the integer path ist the fast path. The JIT compiler adds guards to check for the proper types and emits specialized code.

Also, the JIT compiler pro-actively narrows doubles to integers, whenever it's beneficial to do so. The logic is quite involved -- you can take a look at the big comment block in lj_opt_narrow.c.

This optimization is also active on e.g. x86/x64, where there's only a single underlying number representation (a double). But I should mention that on x86/x64 it doesn't necessarily pay off to perform _all_ operations on integer types, since this would waste the extra FPU registers and the massive extra bandwidth of the FP units in these chips. The branch unit is already quite busy and you're effectively serializing the code with all of those overflow checks.

So the JIT-generated machine code for x86/x64 and for ARM may be quite different for the same inputs. And I'm not talking about the instruction set differences alone.

E.g. compare the output of these two commands:

    luajit -jdump -e "local x=0; for i=1,100 do x=x+1 end"
    luajit -jdump -e "local x=0.5; for i=1,100 do x=x+1 end"

The generated code for the inner loop on x86/x64 is the same in both cases:

    ->LOOP:
    addsd xmm7, xmm0
    add ebp, +0x01
    cmp ebp, +0x64
    jle ->LOOP

The generated code for ARM uses either integers + overflow checks or soft-float code:

    ->LOOP:
    mov   r10, r4
    adds  r4, r4, #1
    blvs  ->exit 2
    add   r11, r11, #1
    cmp   r11, #100
    ble   ->LOOP

    ->LOOP:
    movw  r3, #0
    movt  r3, #16368
    mov   r2, #0
    bl    ->softfp_add
    add   r11, r11, #1
    cmp   r11, #100
    ble   ->LOOP

Note: the mov r10, r4 is to preserve the value prior to the overflowing calculation in exit 2. Yes, one could avoid that for an addition by undoing the calculation. But this won't work in general, e.g. for a multiplication.