Reinforcement Learning

Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings

Modal-mixed chain-of-thought lets a VLM interleave text with compact latent visual “sketches”, using a diffusion-based latent decoder with SFT+RL training to boost vision-intensive reasoning while adding only modest inference overhead.