TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial…

Follow publication

You're unable to read via this Friend Link since it's expired. Learn more

Member-only story

Running Llama 2 on CPU Inference Locally for Document Q&A

Kenneth Leung
TDS Archive
Published in
11 min readJul 18, 2023
Photo by NOAA on Unsplash

Third-party commercial large language model (LLM) providers like OpenAI’s GPT4 have democratized LLM use via simple API calls. However, teams may still require self-managed or private deployment for model inference within enterprise perimeters due to various reasons around data privacy and compliance.

The proliferation of open-source LLMs has fortunately opened up a vast range of options for us, thus reducing our reliance on these third-party providers.

When we host open-source models locally on-premise or in the cloud, the dedicated compute capacity becomes a key consideration. While GPU instances may seem the most convenient choice, the costs can easily spiral out of control.

In this easy-to-follow guide, we will discover how to run quantized versions of open-source LLMs on local CPU inference for retrieval-augmented generation (aka document Q&A) in Python. In particular, we will leverage the latest, highly-performant Llama 2 chat model in this project.

Contents

(1) Quick Primer on Quantization

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Kenneth Leung
Kenneth Leung

Written by Kenneth Leung

Senior Data Scientist at Boston Consulting Group | Top Tech Author | 2M+ reads on Medium | linkedin.com/in/kennethleungty | github.com/kennethleungty

Responses (31)

Write a response