Inference & Serving

1.

KV Cache & Inference Dynamics WIP

KV cache memory math, PagedAttention, prefill spikes, and eviction policies

2.

Batching & Disaggregated Inference WIP

The evolution of LLM serving: Continuous Batching (Orca), Chunked Prefill (Sarathi), and Disaggregated Inference (DistServe)

3.

Hardware & Roofline Model WIP

First-principles of LLM inference arithmetic, memory bandwidth vs compute bound, and the Roofline model.

4.

Context Scaling & Flash Attention WIP

Context Length VRAM curves, Flash Attention benefits, and RoPE scaling side effects

5.

Quantization & Offloading WIP

AWQ vs GPTQ tradeoffs, INT8/INT4 quantization, activation offloading, and mixed-precision decoding

6.

Speculative Decoding WIP

Accelerating LLM inference using draft models, rejection sampling math, and state-of-the-art methods like EAGLE-3.

7.

llama.cpp WIP

Port of LLaMA in C/C++ for CPU and mixed CPU/GPU inference

8.

Ollama WIP

Get up and running with large language models locally

9.

vLLM Anatomy & Production Systems WIP

Architecture of vLLM, the industry standard LLM serving engine. Required reading for ML systems interviews.

10.

SGLang & Serving Framework Comparisons WIP

Structured Generation Language (SGLang), RadixAttention, and comparing vLLM vs TGI vs SGLang.

11.

TensorRT-LLM & Kernel-Level Optimizations WIP

NVIDIA's framework for high-performance LLM inference, Kernel Fusion, and FlashInfer.

12.

AirLLM WIP

Run huge LLMs on a single consumer GPU

13.

KTransformers WIP

Flexible framework for running massive LLMs locally via CPU/GPU heterogeneous computing

14.

Interview Prep: Systems Design & Questions WIP

Cheat sheet for LLM serving interview questions and the 10K RPS system design prompt.