Running Llama 3 7B using NVIDIA NGC Containers in the Cloud

Create 2024-05-05

mockup

by Georg R. Pollak

In this tutorial, we’ll explore how to deploy and run the Llama 7B model using NVIDIA NGC containers in the cloud. We’ll cover setting up a virtual machine, configuring Docker, and embedding a Jupyter Notebook in your environment.

Setting Up a Virtual Machine on Paperspace

To get started, sign up or log into your Paperspace account. Once logged in:

  1. Navigate to the Core section and click on VMs.
  2. Click on Create and select your preferred machine type.
  3. The cheapest machine to test is a M4000 for around $0.40 per hour.
    Sidenote: a simple prompt takes around 10min on a M4000
  4. Choose a public template that includes Ubuntu as the operating system.
  5. Configure your storage, RAM, and network settings as needed and then launch the VM.
    Sidenote: 50GB may be not enough.

Installing Docker and Configuring User Permissions

Once your VM is up and running, connect to it via SSH. Install Docker and add your user to the Docker group:

sudo apt-get update
sudo apt-get install docker.io
sudo usermod -aG docker $USER
newgrp docker

Log out and log back in to ensure the changes take effect.

Installing and Configuring NGC CLI

The NGC CLI tool facilitates interaction with NVIDIA's cloud resources. Follow these steps to install and configure it:

AMD64 Linux Install

The NGC CLI binary for Linux is supported on Ubuntu 16.04 and later distributions.

  1. Click "Download CLI" to download the zip file that contains the binary, then transfer the zip file to a directory where you have permissions and then unzip and execute the binary. Alternatively, you can download, unzip, and install from the command line by moving to a directory where you have execute permissions and then running the following command:
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.41.4/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip
  1. Check the binary's md5 hash to ensure the file wasn't corrupted during download:
find ngc-cli/ -type f -exec md5sum {} + | LC_ALL=C sort | md5sum -c ngc-cli.md5
  1. Check the binary's SHA256 hash to ensure the file wasn't corrupted during download. Run the following command:
sha256sum ngccli_linux.zip

Compare with the following value, which can also be found in the Release Notes of the Resource:

2c86681048ab8e2980bdd6aa6c17f086eff988276ad90a28f1307b69fdb50252
  1. After verifying the value, make the NGC CLI binary executable and add your current directory to the path:
chmod u+x ngc-cli/ngc
echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
  1. You must configure NGC CLI for your use so that you can run the commands. Enter the following command, including your API key when prompted:
ngc config set

Creating an NVIDIA API Key and Logging into NVIDIA GPU Cloud (NGC)

  1. Visit the NVIDIA NGC website and create an account if you haven't already.
  2. Once logged in, go to the API Key section in your profile and generate a new API key.
  3. Back in your terminal on the VM, authenticate with NGC using Docker:
docker login nvcr.io
# When prompted, use '$oauthtoken' as the username and your API key as the password.

Now you should be able to pull images from the official NVIDIA GPU Cloud container registry 🎉.

Setting Up the Docker Environment

Before running JupyterLab with the Llama 3-7B model, you need to set up the necessary software environment using Docker. Create a file called Dockerfile in your home directory and paste the following content into it:

# Use the base image from NVIDIA PyTorch DockerHub repository
FROM nvcr.io/nvidia/pytorch:24.04-py3
# Install JupyterLab
RUN pip install jupyterlab
# Set the working directory
WORKDIR /pytorch
# Expose port for JupyterLab
EXPOSE 8888
# Start JupyterLab
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

The base image used in the Dockerfile is nvcr.io/nvidia/pytorch:24.04-py3 from the NVIDIA catalog.

This image is hosted on NVIDIA's NGC container registry, which provides optimized containers for a variety of uses, particularly in AI and deep learning applications. The specific tag 24.04-py3 indicates that this image is built with support for Python 3 and includes the releases from April 2024. This image ensures that you have the necessary CUDA and PyTorch libraries pare-installed and configured to take full advantage of NVIDIA GPUs. We use JupyterLab to display of our code and results directly in a web browser on our local machine on port 8888.

Remote Access and Container Deployment

After setting up your Docker environment, you will need to log out and log back into your machine using port forwarding:

ssh -L 8888:localhost:8888 paperspace@<ip-address>

Building and Running Your Docker Container

  1. Build the Docker image:
docker build -t <img_name> .
  1. Check the image was bulit properly:
docker images
  1. Run your Docker container:
docker run --gpus all --rm --net=host --privileged -v /home/ubuntu/pytorch:/pytorch -p 8888:8888 <img_name>

Explanation of the flags used:

  • --gpus all: Allows the Docker container to use all available GPUs.
  • --rm: Automatically removes the container when it stops. This is useful for not accumulating container instances.
  • --net=host: Uses the host's network stack inside the container. This simplifies networking by making the container use the host's IP and port directly.
  • --privileged: Gives extended privileges to this container, which can be necessary for certain operations like accessing the GPU.
  • -p 8888:8888: Maps port 8888 of the container to port 8888 on the host, facilitating access to JupyterLab via a web browser.

Accessing JupyterLab

Once the container is running, JupyterLab will be accessible from your local machine’s web browser. Simply open your browser and go to the URL displayed in the terminal, typically something like

http://localhost:8888/?token=<numberstring>

Creating And Running The notebook

Before using the Llama 3 models, ensure you have created a Hugging Face account and accepted the Llama 3 user conditions. This step is essential for accessing and utilizing the Llama index and associated libraries.

import torch
!pip install llama-index
!pip install llama-index-llms-huggingface
!pip install llama-index-embeddings-huggingface
!nvidia-smi
    Sun May  5 11:20:21 2024       
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Quadro M4000                   On  | 00000000:00:05.0 Off |                  N/A |
    | 46%   26C    P8              12W / 120W |   7319MiB /  8192MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
                                                                                             
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    +---------------------------------------------------------------------------------------+
hf_token = "Your Hugging Faces Token"
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    token=hf_token,
)

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]
 
# generate_kwargs parameters are taken from https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
import torch
from llama_index.llms.huggingface import HuggingFaceLLM

# Optional quantization to 4bit
# import torch
# from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
# )

llm = HuggingFaceLLM(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.6,
        "top_p": 0.9,
    },
    tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids,
)
%%time
response = llm.complete("What is Accenture known for?")
print(response)
Accenture is a multinational professional services company that provides a range of services, including strategy,
consulting, digital, technology, and operations.
The company is known for its expertise in helping clients navigate the digital age and achieve their business goals
through the use of technology and innovation.
    
Accenture has a strong reputation for its work in a variety of industries, including:
1. Technology: Accenture has a long history of working with technology companies and helping
   them navigate the rapidly changing technology landscape.
2. Financial Services: Accenture has a strong presence in the financial services industry, providing services to banks,
   insurance companies, and other financial institutions.
3. Healthcare: Accenture has a significant presence in the healthcare industry, providing services to hospitals,
   health systems, and pharmaceutical companies.
4. Retail: Accenture has a strong presence in the retail industry, providing services to retailers and
   consumer goods companies.
5. Energy: Accenture has a significant presence in the energy industry, providing services to oil and gas companies,
   utilities, and renewable energy companies.
    
    Accenture is also known for its innovative approach to consulting and its use of digital technologies, such as artificial intelligence,
    blockchain, and the Internet of Things (IoT). The company has a strong focus on sustainability and has made a commitment to reducing
    CPU times: user 8min 43s, sys: 1min 14s, total: 9min 58s
    Wall time: 9min 57s