Running large language models efficiently in production demands constant optimization, but traditional methods often fall short. Speculative decoding, a technique designed to speed up inference, frequently underperforms due to stale draft models that can't keep pace with live traffic shifts. Together AI aims to solve this with its new open-source framework, Aurora.
Aurora is built on a reinforcement learning (RL) foundation, enabling it to learn directly from live inference traces and continuously update its draft models without interrupting service. This creates a self-improving flywheel for LLM inference optimization.
A Serve-to-Train Flywheel
The system comprises decoupled Inference and Training Servers. The Inference Server uses a speculative decoding engine to generate token proposals, which are then verified by the target model. Accepted and rejected token results are streamed to a data buffer.
The Training Server asynchronously fetches this data, updates a copy of the draft model, and hot-swaps improved weights back to the inference server. This process avoids the significant costs and complexities of large-scale offline activation collection pipelines.
By framing speculative decoding as an RL problem, Aurora aligns training signals directly with real deployment utility. It learns not just from accepted tokens (imitation loss) but also from rejected proposals (discard sampling), using a specialized Tree Attention mechanism to process complex branching structures efficiently.
Adapting to Shifting Demands
Experiments show Aurora's ability to adapt to dynamic traffic patterns. When faced with abrupt shifts across different domains like code generation and finance, the system recovers its performance within approximately 10,000 requests.
Across various batch sizes and models like Qwen3-Coder-Next-FP8, Aurora consistently delivered speedups. It achieved an additional 1.25x speedup over a well-trained static speculator, demonstrating that continuous adaptation compounds benefits. In mixed traffic scenarios, online training from scratch with Aurora even surpassed carefully pretrained static baselines.
This approach challenges the conventional wisdom that extensive offline pretraining is mandatory for effective speculative decoding, offering a more dynamic and cost-efficient solution for LLM inference optimization.
