Inference & Serving
LLM Inference engines, serving frameworks, and optimization tools
1.
KV Cache & Inference Dynamics
WIP
KV cache memory math, PagedAttention, prefill spikes, and eviction policies
2.
Batching & Disaggregated Inference
WIP
The evolution of LLM serving: Continuous Batching (Orca), Chunked Prefill (Sarathi), and Disaggregated Inference (DistServe)
3.
Hardware & Roofline Model
WIP
First-principles of LLM inference arithmetic, memory bandwidth vs compute bound, and the Roofline model.
4.
Context Scaling & Flash Attention
WIP
Context Length VRAM curves, Flash Attention benefits, and RoPE scaling side effects
5.
Quantization & Offloading
WIP
AWQ vs GPTQ tradeoffs, INT8/INT4 quantization, activation offloading, and mixed-precision decoding
6.
Speculative Decoding
WIP
Accelerating LLM inference using draft models, rejection sampling math, and state-of-the-art methods like EAGLE-3.
7.
llama.cpp
WIP
Port of LLaMA in C/C++ for CPU and mixed CPU/GPU inference
8.
Ollama
WIP
Get up and running with large language models locally
9.
vLLM Anatomy & Production Systems
WIP
Architecture of vLLM, the industry standard LLM serving engine. Required reading for ML systems interviews.
10.
SGLang & Serving Framework Comparisons
WIP
Structured Generation Language (SGLang), RadixAttention, and comparing vLLM vs TGI vs SGLang.
11.
TensorRT-LLM & Kernel-Level Optimizations
WIP
NVIDIA's framework for high-performance LLM inference, Kernel Fusion, and FlashInfer.
12.
AirLLM
WIP
Run huge LLMs on a single consumer GPU
13.
KTransformers
WIP
Flexible framework for running massive LLMs locally via CPU/GPU heterogeneous computing
14.
Interview Prep: Systems Design & Questions
WIP
Cheat sheet for LLM serving interview questions and the 10K RPS system design prompt.