The insight is that on a modern processor memory bandwidth is scarce but it can ...

The insight is that on a modern processor memory bandwidth is scarce but it can issue several floating point ops per clock. So multiplying and adding stuff in L1 or already in registers is cheap. Streaming algorithms that walk large chunks of ram will be repeatedly memory starved, so anything you can do to effectively increase memory b/w is valuable.

Say you have 8bit integers; store them packed in ram, then upon reading, unpack. So instead of striping an array of int[], you have an internal array of long[] and you read them with a function. Your memory read will suck in 8 at once.

The same technique works for floats or ints with a small-ish range and limited precision; you can store a scale and offset, then pack on write / unpack on read. It's common to be able to quadruple your effective memory bandwidth, then the read operation -- ie

   // instead of:
   double[] _data;
   // accessed as
   _data[idx];

   //instead you do
   getd(idx);

   // using the below
   bytes[] _mem;
   float _bias, _scale;
   double getd(int idx){
      long res = _mem[idx];
      return (res + _bias) * _scale;
   }

executes entirely from registers, and is essentially free. The price of all this is you have to process your data on ingestion, but if you run iterative algorithms -- like convex optimizers -- that repeatedly walk your entire dataset, this is often a big win. You can often lose some of the low precision bits on the float or double, but those probably don't matter much anyway.

Like anything else, you'll have to measure.