Running Llama 2 on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using Llama 2, C Transformers, GGML, and LangChain

Kenneth Leung

Published in

TDS Archive

11 min readJul 18, 2023

Third-party commercial large language model (LLM) providers like OpenAI’s GPT4 have democratized LLM use via simple API calls. However, teams may still require self-managed or private deployment for model inference within enterprise perimeters due to various reasons around data privacy and compliance.

The proliferation of open-source LLMs has fortunately opened up a vast range of options for us, thus reducing our reliance on these third-party providers.

When we host open-source models locally on-premise or in the cloud, the dedicated compute capacity becomes a key consideration. While GPU instances may seem the most convenient choice, the costs can easily spiral out of control.

In this easy-to-follow guide, we will discover how to run quantized versions of open-source LLMs on local CPU inference for retrieval-augmented generation (aka document Q&A) in Python. In particular, we will leverage the latest, highly-performant Llama 2 chat model in this project.

(1) Quick Primer on Quantization…

TDS Archive

Running Llama 2 on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using Llama 2, C Transformers, GGML, and LangChain

Contents

Create an account to read the full story.

Published in TDS Archive

Written by Kenneth Leung

Responses (31)