Amrits Blog

🤖 Machine Learning | Analytics | Tech 👨‍💻

🌟 check out my website 🌟

Retrieval-Augmented Generation Run Locally With Llama 2


Researchers developed the technique of Retrieval-Augmented Generation (RAG) in the 2020 paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Before we jump into the code let's take a moment to answer the following questions:

  • Why do we use RAG?
  • Why would we run models locally?
  • How can I access Llama 2?
  • I don't have a GPU, what can I do?

Why do we use RAG?

Picture this, you want to ask a LLM questions about your extensive financial report such as "what were the earnings for Q1"? or "Why has revenue fallen over the past 4 months?".

Well you've got a problem as your financial report was not included in the models initial training data. Retraining the model to include up-to-date information can be extremely costly coupling this with having to retrain every quarter it quickly becomes an impracticable approach.

Generally speaking these models do not have access to the outside world nor can they readily keep up to date on information.

The solution is to use RAG, by allowing the model to dynamically access external repositories of information you can provide recent context allowing the model to answer questions with greater reliability and mitigate the risk of hallucinations.

Why do we run models locally?

The biggest reason is privacy.

When working with clients, especially large ones privacy becomes a massive point of contention. Sending sensitive data to other companies via API is not something compliance would sign off on so being able to keep the data and models in house is a priority.

There are some other added benefits to such as

  • Reduced long term costs
  • Greater control and customisation
  • Lower latency as data no longer needs to be transmitted

How can I access LLama 2?

Head over to this form here You'll shortly receive emails from Meta with a unique link to download the weights.

I don't have a GPU, what can I do?

The best option for individuals who can't access GPU's to use a Google Colab You'll be able to use a T4 GPU absolutely free.

The source code for this project can be found here

Now without further ado let's jump into the code.

Section 1: Dependencies

Here are the list of dependencies used in this project with a link to learn more :


It is recommended that you use a virtual environment for this project, you can find out how to do that here

To install dependencies run the following in your terminal

pip install -r requirements.txt

or run the following in a code cell in your Jupyter Notebook

!pip install -r requirements.txt

Section 2: Building the Embedding Pipeline

Consider embeddings to be coordinates of words in a nth-dimensional space. Words which are semantically similar will cluster together such as 'light, happy, fun' and words which have the opposite meaning such as 'dark, unhappy, boring' may be placed at the opposite end of a spectrum.

To illustrate this we can use the following image by researchers in the paper Cross-domain sentiment-aware word embeddings for review sentiment analysis

An example

In this case we will be using the all-MiniLM-L6-v2 model which can map sentences to a 384 dimensional vector space. You can access the link and try using the interface API to compare sentence similarities.

Let's load the model onto our GPU and embed two example sentences. Each sentence will be converted into a list of length embedding dimension which as stated above is 384.

See for yourself when we implement this.


# Let's now use the model to embed two sentences 

docs = [
    "this is one document",
    "and another document"

# Embeddings will be a list where each element contains nested list of 384 values
embeddings = embed_model.embed_documents(docs)
# Extract the number of dimensions per sentence
number_of_dimensions = len(embeddings[0])

print(f"We have {len(embeddings)} embeddings, each with {number_of_dimensions} dimensions.")


We have 2 embeddings, each with 384 dimensions.

Section 3: Building the Vector index

In order for the model to successfully retrieve our information we will need to store our embeddings in a vector database. To do this it is recommended that you use Pinecones free tier

It's good practice to store your API key in an .env in the following format:


First we will instantiate the Pinecone class with your API key and then create a new index for this project.

Creating an index requires three values:

  • index_name: Any arbitrary name that you can identify this index with
  • dimension: This will be the number of dimensions the model will create for each sentence in our case this value will be 384
  • metric: This represents the method of measurement used to calculate the distance between vectors in the database. We will use cosine which is typically used in text analysis. You can learn more here.
  • spec: We will use the default specs associated with a free account.

Let's create our index and connect to it


# Create the index 
index_name = 'llama-2-rag'

if index_name not in pinecone.list_indexes().names():
# Check if the index is ready to use 
if pinecone.describe_index(index_name).status['ready']:
    print("Ready to go!")
    # Connect to the index 
    index = pinecone.Index(index_name)


Ready to go!
{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Section 4: Load & Embed the Dataset

We will be using the jamescalam/llama-2-arxiv-papers-chunked dataset which contains excerpts from the llama-2 paper split into chunks.

Once we obtain the data via the HuggingFace module we will convert it into a pandas dataframe, extract IDs, emebddings and metadata to be sent to Pinecone in batches.


from datasets import load_dataset

data = load_dataset(

data = data.to_pandas()

# Iterate through each batch in the data
for i in range(0, len(data), batch_size):
    # Calculate the final index for each batch avoiding an index error for the final batch
    i_end = min(len(data), i + batch_size)
    # Extract the current batch
    batch = data.iloc[i:i_end]
    # Create a unique ID from doi + chunk_id
    ids = [f"{row['doi']}-{row['chunk-id']}" for _, row in batch.iterrows()]
    # Extract Text data and create embeddings
    texts = [row['chunk'] for _, row in batch.iterrows()]
    embeddings = embed_model.embed_documents(texts)
    # Generate Meta Data 
    metadata = [
            'text': row['chunk'],
            'source': row['source'],
            'title': row['title']
        } for _, row in batch.iterrows()

    # Upload to Pinecone 
    index.upsert(vectors=zip(ids, embeddings, metadata))



{'dimension': 384,
 'index_fullness': 0.04448,
 'namespaces': {'': {'vector_count': 4448}},
 'total_vector_count': 4448}

We can see our total_vector_count went from 0 -> 4,448

Section 5: Initialize the Large Language Model

Now that our vector index has been set up the next step is to load our LLM with the respective tokenizer for the model.

We will also be using the bitsandbytes library to quantize the model to work with less GPU memory.

Quantization is the process of reducing the precision of tensors to lower memory requirements and get faster inference with a model, this comes at a cost of performance but for our use case this is okay.

Below I've demonstrated what happens to a tensor value after quantization.

Weight Value Before QuantizationWeight Value After Quantization

First, we will configure the quantization configuration from BytesandBytes with the following settings:

  • load_in_4bit=True : This will set weights to a 4-bit precision
  • bnb_4bit_quant_type=nf4 : This will use the nf4 schema which is the normalised 4 bit data type
  • bnb_4bit_use_double_quant=True : This will allow us to use nested quantization where the quantization constants are quantized again
  • bnb_4bit_compute_dtype=bfloat16 : This will set computations to use the bfloat16 data type

Model configuration will simply include the name of the model and your HuggingFace API key.

Let's use these configurations to load our model and set it to evaluation mode.


from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-13b-chat-hf'

# Set quantization configuration
quantization_config = transformers.BitsAndBytesConfig(

# Load Hugging Face Token 
hugging_face_token = os.environ.get('HF_AUTH_TOKEN')
# Set model configuration
model_config = transformers.AutoConfig.from_pretrained(

# Load model with quantization and model configurations
model = transformers.AutoModelForCausalLM.from_pretrained(

# Set model to evaluation mode

Now the model is loaded let's load the tokenizer for the model and construct the pipeline. We will also test to see if our implementation is working.

# Load the Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(

# Construct the Pipeline
from langchain.llms import HuggingFacePipeline

generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    temperature=0.01,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating

llm = HuggingFacePipeline(pipeline=generate_text)
llm(prompt="Explain to me the difference between nuclear fission and fusion.")

Section 6: Implementing RAG

Now that we have our model loaded let's allow our model to access the Vector Index.

We will create a Pinecone object and connect it to a LangChain pipeline so that querying is easier.

from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA

# Pinecone requires this field for Metadata
text_field = 'text'  

vectorstore = Pinecone(

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',

# Use our RAG pipeline
rag_pipeline('what is so special about llama 2?')

There we go! You are no able to provide large language models with additional context without the hassle of re-training.