AirLLM

WIP nlp-llms airllm inference llm memory-optimization nlp optimization-compute serving 1 min read

Run huge LLMs on a single consumer GPU

Overview

AirLLM is a tool designed to help you run extremely large models (like 70B parameter LLMs) on a single consumer GPU with limited VRAM (e.g., 8GB) by using layer-wise execution and swapping.

Mechanism

TODO: Add content on how AirLLM achieves low VRAM inference.

Usage

TODO: Add content on usage.