You can make image networks (unet-like things) by chunking rectangles in 2D (with some convolution steps)... I wonder if there is an image-specific architecture a bit like this that could possibly work well?
Perhaps something like this: https://neurips.cc/virtual/2024/poster/94115 Though I haven't looked up what their actual tokenization strategy is, and whether switching to hierarchical (H-Net) chunks would be possible.