Adaptive Runtime Optimization for Efficient Large Language Model Serving
일시: 2026년 6월 16일(화) 11:00 - 12:00
장소: IT/BT 813호
Abstract
Efficient large language model (LLM) serving is often framed around token throughput, GPU occupancy, and request latency. These metrics describe how much work the server completes and how quickly requests finish. In modern serving paths, the usefulness of a GPU step also depends on runtime state. A speculative decoding step gains value when verified draft tokens become target-valid output. A streaming decoding step gains value when generated output becomes ready before the consumer needs the next output unit. This thesis studies the gap between computed work and objective-relevant progress that appears when those conditions are uncertain.
This thesis asks how an LLM serving runtime can expose the conditions under which work becomes objective-relevant to its scheduler. The central argument is that the runtime should attach objective-specific utility signals to controllable units of GPU work and allocate computation according to those signals. The thesis pairs speculative verification and streaming scheduling because they expose the same runtime allocation question at complementary control granularities. Within a speculative request, the controllable unit is a draft-token prefix whose verification value depends on acceptance likelihood; across streaming requests, the controllable unit is an in-flight request whose scheduling value depends on unit-deadline risk, where a unit deadline is the time by which the next consumable output unit should be ready. This framing gives a common design pattern for different serving objectives: estimate utility from runtime state, map the estimate to a control action, and preserve the correctness or service constraint of the setting.
I develop this pattern at two granularities of LLM serving control. At the token level, Speculative Verification (SV) targets verification work inside batched speculative decoding. SV uses a companion-model signal, derived from the relationship between draft and companion model distributions, to estimate which draft-token prefixes are likely to be accepted by the target model. It reduces verification work with low expected contribution to Accepted Tokens, and emitted tokens still pass through target-model verification or direct generation to preserve target-model semantics. At the request level, ProgressServe targets scheduling and admission decisions in streaming serving. It tracks consumable buffers and unit-deadline risk to decide which requests should run, defer, or be admitted, directing GPU work toward requests where generation is more likely to prevent configured unit-deadline misses. Together, these systems show how the same runtime principle applies when the controllable unit is a draft-token prefix and when it is an in-flight request.
The evaluation reports each system with the metric matched to its control granularity. In public-model speculative decoding experiments under matched draft-target and batch settings, SV improves goodput over standard speculative decoding, reaching an average 1.4x speedup and up to 1.9x speedup at the largest supported batch size. In profile-based streaming experiments under matched model, workload, and arrival settings, ProgressServe supports 1.99x more admitted in-flight requests on average and up to 3.66x more than the vLLM baseline while keeping observed unit-deadline misses below a conservative 0.5% threshold. These results support the thesis that LLM serving efficiency improves when runtimes make objective-specific utility signals visible at the granularity where GPU work is allocated, from draft-token verification that produces target-valid output to request scheduling that keeps consumable units ready before their deadlines.
