Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings
Modal-mixed chain-of-thought lets a VLM interleave text with compact latent visual sketches, using a diffusion-based latent decoder with SFT+RL training to boost vision-intensive reasoning while adding only modest inference overhead.