STARS - Segment-level Token Alignment via Rejection Sampling in Large Language Models

(Quamar et al., 2026)

We proposed STARS (Segment-level Token Alignment via Rejection Sampling) - a decoding method that improves the alignment of large language models (LLMs) with human preferences without costly retraining. Instead of evaluating entire generated responses or every single token, STARS breaks outputs into short, fixed-size segments. Each segment is only accepted if it meets an adaptive reward threshold, ensuring high-quality, harmless, and preference-aligned responses.

This approach bridges the gap between efficiency and quality: it matches or exceeds the alignment performance of resource-heavy strategies like Best-of-N, but with significantly fewer LLM calls. In tests across multiple 7B-13B open-source models, STARS improved win-rates by 20-25 percentage points over standard decoding, even enabling smaller aligned models to outperform larger, unaligned ones.

By focusing on inference-time alignment, STARS offers a practical, scalable, and compute-conscious solution for safer AI deployments.

References

2026

ICML-W 2026

STARS: Synchronous Token Alignment for Robust Supervision in Large Language Models

M. Atif Quamar^* , M. Areeb^* , M. Kuznetsov , M. Ozgur Ozmen , and Z. Berkay Celik

ICML 2026 - Structured Probabilistic Inference & Generative Modeling Workshop

Abs arXiv

Aligning large language models (LLMs) with human values is critical for their safe deployment, but existing methods like fine-tuning are computationally expensive, while inference-time approaches like Best-of-N sampling are inefficient. We propose STARS: Segment-level Token Alignment via Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, powerful and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.