>50%
Memory Saved (Beam Search)
3–4×
Throughput Gain
Multi-GPU
Tensor Parallelism

When you use ChatGPT or similar tools, sometimes you:

Ask for multiple answers to choose from like "Give me 3 variations",

Or want the best possible answer out of many options like "Translate this correctly"

Or ask repeatedly using the same format like "Summarize this article, then this one, then that one..."

Sometimes, too many people try to use the AI at the same time. The computer (especially its memory) can't handle everything all at once. When thousands of people use ChatGPT at once:

That's how it handles millions of complex, simultaneous requests without crashing or slowing down too much.

Here comes our hero - vLLM.

Let us see how it has been solved. But first let's go through some unique concepts in brief:

Using vLLM for Other Types of Responses (Decoding Scenarios)

What does vLLM do?

Think of vLLM as a very smart waiter at a crowded restaurant. Instead of taking each order separately and running back and forth for every little thing, it does this:

HOW vLLM HANDLES MULTIPLE REQUEST TYPES Parallel Sampling "3 tweet variations" Copy-on-write Beam Search "Best translation" Shared beam blocks Shared Prefix "Same system prompt" Cached once, reused Mixed Requests "1 vs 5 answers mixed" Block-based memory vLLM — Smart Memory Scheduler PagedAttention · Copy-on-Write · FCFS · Preemption · Swap · Recompute Efficient GPU Utilisation · No Crashes · High Throughput
vLLM unifies all request types under one memory scheduler — handling parallel sampling, beam search, prefix sharing, and mixed batches simultaneously

Let us deep dive into an advanced explanation:

Parallel Sampling - Asking for Multiple Responses at Once

You might ask an LLM: "Give me 3 different versions of this tweet."

Normally, this would take 3× memory, since each response has to be generated and stored separately.

But with vLLM, if all 3 responses start from the same input (prompt), it shares the starting part's memory across all samples. It only creates separate copies when the answers begin to change. This is called copy-on-write – memory is copied only when a change happens.

Think of it like: 3 people writing different endings to the same story – you give them one copy of the start, and they continue individually from there.

BEAM SEARCH — vLLM COPY-ON-WRITE MEMORY Shared Prompt KV Cache (read-only) Beam A score: 0.92 Beam B score: 0.85 Beam C score: 0.61 ✕ Beam D score: 0.54 ✕ Memory freed Memory freed Best Beams Survive >50% memory saved Copy-on-Write Saves Memory copied only on divergence
Beam Search in vLLM — beams share common KV cache blocks; low-score beams have memory freed immediately saving over 50%

Suppose you want not just different replies, but the best possible one. Beam search tries many options at each step and keeps the top few (like a tournament). Normally each "beam" takes its own memory. But vLLM solves this problem.

vLLM's advantage:

You can think of it like following several storylines at once and keeping only the best ones. Remember the possibilities that Dr. Strange saw in Endgame but found only one possibility. Bam!! It is similar to that.

PARALLEL SAMPLING — COPY-ON-WRITE KV CACHE Shared Input Prompt "Give me 3 tweet versions" Output 1 Own KV blocks only from divergence point Output 2 Own KV blocks only from divergence point Output 3 Own KV blocks only from divergence point Shared prompt KV cached once · Copies made only when outputs differ
Parallel Sampling copy-on-write — all three outputs share the prompt's KV cache; individual memory only allocated at the point of divergence

Shared Prompt Prefix – Using a Common Starting Template

In many AI tasks, the beginning of every prompt is the same. For example:

"Translate English to French: 'apple' => 'pomme'"

Why generate the same thing over and over? vLLM allows this prefix to be:

Mixing Different Requests

Now, let us combine all these together

People can send all types of requests—some simple, some complex. Older systems can't batch them well because of the differences.

vLLM solves this by abstracting memory management:

Key insight: vLLM's block-based memory abstraction means the underlying model doesn't need to know or care whether memory is shared or private — the scheduler handles all of it transparently.

What If Memory Runs Out?

vLLM chooses which requests to pause or remove from GPU memory (temporarily).

Two smart strategies are there:

Special Features:

It handles grouped requests like beam search together—they're scheduled or paused as a group. The swap area size is limited, so it never overwhelms the CPU.

vLLM does the following:

  1. First-Come, First-Served - Like a queue in a bakery, the first person is served first, so it's fair.
  2. Preemption (Temporary Pausing) - If new requests keep coming in and there's no space, vLLM temporarily pauses the newest ones to focus on the older ones.
  3. Two Smart Tricks to Save Work - If it had to stop working on a request, it has two options:

1. Swapping: Move the work to a slower room (CPU) and come back to it later.

2. Recomputation: Instead of storing everything, it just redoes the work quickly when it's needed again.

Think of it like putting things in a freezer (swap) or just cooking again from a recipe (recompute), depending on what's faster and more efficient at the time.

How vLLM Works Across Multiple GPUs (Distributed)

Big models (like GPT-3 or LLaMA-65B) don't fit into one GPU. So we use multiple GPUs working as a team. vLLm design is such that it

Memory Management in vLLM is like a central manager (scheduler) keeps track of all memory across GPUs. Each GPU gets the input tokens and memory block map, runs its part of the model, shares results using fast GPU communication (all-reduce)

GPUs don't need to coordinate memory themselves. They just follow the instructions given by the central brain (scheduler).

Deploying LLMs at scale?

Our team at AIMLverse Lab helps founders and engineering leaders optimize LLM inference costs and throughput. Let's talk.

Contact Us — contact@aimlverse.com
#vLLM #LLMInference #KVCache #BeamSearch #ParallelSampling #GPUMemory #AIInfrastructure #PagedAttention #MultiGPU #CopyOnWrite