This feels like the kind of thing that's so obvious it's hard to believe it isn't being pursued... Google could, for example, make a _huge_ splash with the Pixel 10 by presenting this with the option of after-the-fact optical zoom or wide angle shots, or using the multiple lenses for some fancy fusing for additional detail. And to your point, DSLRs have been doing deferred readout in the sense of storing to an on-device cache before writing out to the SD card while waiting for previous frames to complete their write ops... this same sort of concept should be able to apply here.
I don't know much more about the computational photography pipeline, but I imagine there might be some tricky bits around focusing across multiple lenses simultaneously, around managing the slight off-axis offset (though that feels more trivial nowadays), and, as you say, around reading from the sensors into memory, but then also how to practically merge or not-merge the various shots. Google already does this with stacked photos that include, say, a computationally blurred/portrait shot alongside the primary sensor capture before that processing was done, so the bones are there for something similar... but to really take advantage of it would likely require some more work.
But this is all by way of saying, this would be really really cool and would open up a lot of potential opportunities.
The biggest issue with doing this, for most people, is that now each of your photos is 3x time the size and they need to spend more on their phones and/or cloud storage.
What we really need is a better way of paring down the N photos you take of a given subject into the one or two ideal lens*adjustments*cropping tuples.
I’m imagining you open a “photo session” when you open the camera, and all photos taken in that session are grouped together. Later, you can go into each session and some AI or whatever spits out a handful of top edits for you to consider, and you delete the rest.
Use case is for taking photos with children or another animals where you need approx 50 photos to get one where they’re looking at the camera with their eyes open and a smile, then today you need to manually perform some atrocious O(N*K) procedure to get the best K of the N photos.
Even a simple "A vs B" selection process would be an improvement to what most of what we spray'n'pray photographers currently use. It's been forever on my list of mobile apps I might want to write (I expect that similar things exists, but I also kind of expect that they are all filled with other features that I really would not want to have)
What if you make an AVIF image sequence, with the zoomed photo followed by the wide angle photo? Presumably AV1 is smart enough to compress the second based on the first.
I don't know much more about the computational photography pipeline, but I imagine there might be some tricky bits around focusing across multiple lenses simultaneously, around managing the slight off-axis offset (though that feels more trivial nowadays), and, as you say, around reading from the sensors into memory, but then also how to practically merge or not-merge the various shots. Google already does this with stacked photos that include, say, a computationally blurred/portrait shot alongside the primary sensor capture before that processing was done, so the bones are there for something similar... but to really take advantage of it would likely require some more work.
But this is all by way of saying, this would be really really cool and would open up a lot of potential opportunities.