llama.cpp

WIP nlp-llms deployment-serving gguf inference llama-cpp llm nlp serving 1 min read

Port of LLaMA in C/C++ for CPU and mixed CPU/GPU inference

Overview

llama.cpp allows running massive LLMs on consumer hardware without heavy GPU requirements, focusing heavily on CPU inference, Apple Silicon optimization, and mixed CPU/GPU execution using the GGUF format.

TODO: Add details on GGUF and usage.