• Think Ahead With AI
  • Posts
  • 🚀 "Mastering Llama 3: Unleashing Its Potential with Local Execution" 🚀

🚀 "Mastering Llama 3: Unleashing Its Potential with Local Execution" 🚀

🔮 "A Comprehensive Guide to Harnessing Llama 3's Capabilities Right from Your Desktop" 🔮

Story Highlights: 👋

  • 🎩 Meta unveils the Llama 3 family, featuring four models with exceptional performance metrics.

  • 🎩 Llama 3 models excel across various benchmarks, showcasing superior performance in their respective weight categories.

  • 🎩 Technical enhancements like an increased vocabulary size and Grouped Query Attention (GQA) set Llama 3 apart in the realm of text generation.

  • 🎩 Running Llama 3 locally becomes feasible with advancements in model quantization, facilitated by open-source tools like Ollama and HuggingFace Transformers.

Who, What, When, Where, and Why 📝

  • 💡 Who: Meta's Llama 3 family of AI language models

  • 💡 What: This newsletter explores the latest milestone in AI language models with Meta’s Llama 3 family, detailing advancements, benchmarks, and practical implementations for running these models locally.

  • 💡 When: Published as a part of the Data Science Blogathon

  • 💡 Where: Accessible through various channels, including online platforms and newsletters

  • 💡 Why: To provide readers with insights into the advancements of Llama 3 models, their performance benchmarks, and actionable steps for running them locally.

Introduction 💬

Embark on a journey into the cutting-edge realm of AI language models with Meta’s groundbreaking Llama 3 family.

From expanding vocabulary sizes to practical deployment using open-source tools, this newsletter delves into the technical intricacies and performance benchmarks of Llama 3.

Learn the ins and outs of deploying and running these models locally, tapping into their potential on consumer hardware.

Learning Objectives 🌐

Grasp the pivotal advancements and benchmarks of the Llama 3 models, including their performance compared to predecessors and peers.

Master the deployment and local execution of Llama 3 models using accessible open-source tools like HuggingFace Transformers and Ollama.

Explore the technical enhancements within Llama 3, such as increased vocabulary size and the implementation of Grouped Query Attention, and comprehend their implications for text generation tasks.

Garner insights into the applications and future advancements of Llama 3 models, including their open-source nature, multi-modal capabilities, and ongoing improvements in fine-tuning and performance.

Table of Contents

  • 🌈 Introduction to Llama 3

  • 🌈 Performance Highlights

  • 🌈 Running Llama 3 Locally

  • 🌈 Using HuggingFace

  • 🌈 Using Ollama

  • 🌈 Conclusion and Key Takeaways

Introduction to Llama 3 💥

Enter the era of Llama 3, a paradigm shift in language modeling. With pre-trained base and chat models available in 8B and 70B sizes, Llama 3 brings forth monumental advancements. Featuring an expanded vocabulary size of 128k tokens and Grouped Query Attention (GQA) across all models, Llama 3 ensures more coherent and comprehensive responses than ever before.
Meta’s dedication to pushing the boundaries of natural language processing is evident in its training regimen, utilizing an astounding 15 trillion tokens for the 8B model alone.
With plans for multi-modal models and even larger variants on the horizon, the Llama 3 series heralds a new era of AI language modeling.

Performance Highlights 🌟

  • 🎉 Llama 3 models excel in diverse tasks, from creative writing to coding, setting new performance standards.

  • 🎉 The 8B Llama 3 model outshines its predecessors significantly, nearing the performance of the previous 70B model.

  • 🎉 Notably, the Llama 3 70B model surpasses closed models like Gemini Pro 1.5 and Claude Sonnet across various benchmarks.

  • 🎉 The open-source nature of Llama 3 facilitates easy access, fine-tuning, and commercial use, with models offering flexible licensing options.

Running Llama 3 Locally 📊

Leveraging the remarkable performance metrics of Llama 3, running it locally becomes more accessible than ever. Utilizing advancements in model quantization methods, Llama 3 models can now operate within consumer hardware. Various methods exist for local execution, depending on hardware specifications.

Let’s explore two approaches using open-source tools:

Using HuggingFace ☀️

HuggingFace offers seamless support for Llama 3 models. Easily access models from the HuggingFace Hub with the Transformers library. Here's a step-by-step guide to running it on Colab's free tier:

Step 1: Install Libraries

Upgrade necessary libraries and install the Transformers library.

!pip install -U "transformers==4.40.0" --upgrade
!pip install accelerate bitsandbytes

Step 2: Install Model

Install the desired model and initiate querying.

import transformers
import torch

model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
        "low_cpu_mem_usage": True,
    },
)

Step 3: Send Queries

Send queries to the model for inference.

messages = [
    {"role": "system", "content": "You are a helpful assistant!"},
    {"role": "user", "content": """Generate an approximately fifteen-word sentence 
                                   that describes all this data:
                                   Midsummer House eatType restaurant; 
                                   Midsummer House food Chinese; 
                                   Midsummer House priceRange moderate; 
                                   Midsummer House customer rating 3 out of 5; 
                                   Midsummer House near All Bar One"""},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

Output of the query:

Here is a 15-word sentence that summarizes the data:

Midsummer House is a moderate-priced Chinese eatery with a 3-star rating near All Bar One.

Step4: Install Gradio and Run Code

You can wrap this inside a Gradio to have an interactive chat interface. Install Gradio and run the code below.

import gradio as gr

messages = []

def add_text(history, text):
    global messages  #message[list] is defined globally
    history = history + [(text,'')]
    messages = messages + [{"role":'user', 'content': text}]
    return history

def generate(history):
  global messages
  prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

  terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

  outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
  response_msg = outputs[0]["generated_text"][len(prompt):]
  for char in response_msg:
      history[-1][1] += char
      yield history
  pass

with gr.Blocks() as demo:
    
    chatbot = gr.Chatbot(value=[], elem_id="chatbot")
    with gr.Row():
            txt = gr.Textbox(
                show_label=False,
                placeholder="Enter text and press enter",
            )

    txt.submit(add_text, [chatbot, txt], [chatbot, txt], queue=False).then(
            generate, inputs =[chatbot,],outputs = chatbot,)
            
demo.queue()
demo.launch(debug=True)

Using Ollama 🎨

Ollama provides another avenue for running LLMs locally. Follow these steps to utilize Ollama:

Step 1: Starting Local Server

Download Ollama and initiate a local server.

ollama run llama3:instruct  #for 8B instruct model
ollama run llama3:70b-instruct #for 70B instruct model
ollama run llama3  #for 8B pre-trained model
ollama run llama3:70b #for 70B pre-trained

Step 2: Query Through API

Query the local server through API requests.

curl <http://localhost:11434/api/generate> -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Step 3: JSON Response

Receive a JSON response containing the model's output.

{
  "model": "llama3",
  "created_at": "2024-04-19T19:22:45.499127Z",
  "response": "The sky is blue because it is the color of the sky.",
  "done": true,
  "context": [1, 2, 3],
  "total_duration": 5043500667,
  "load_duration": 5025959,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 325953000,
  "eval_count": 290,
  "eval_duration": 4709213000
}

Key Takeaways 📧

  • 💻 Meta introduces the Llama 3 family, featuring four models with exceptional performance metrics.

  • 💻 Llama 3 models boast superior performance across benchmarks in their respective weight categories.

  • 💻 With an enhanced tokenizer and Grouped Query Attention (GQA), Llama 3 sets new standards for text generation.

  • 💻 Despite their size, Llama 3 models can be run on consumer hardware using quantization via open-source tools like Ollama and HuggingFace Transformers.

Wrap It Up 📘

As we conclude our journey through Llama 3, it's evident that we're witnessing a paradigm shift in AI language modeling. With the ability to run these advanced models locally, the possibilities are endless.
Whether you're a developer, researcher, or enthusiast, Llama 3 opens doors to innovation and accessibility like never before.

Why does lt Matter to You and What Actions can You Take? 🤖

  • 🚀 Stay updated with the latest advancements in AI language models by following Meta's developments closely.

  • 🚀 Experiment with running Llama 3 models locally using open-source tools like HuggingFace Transformers and Ollama.

  • 🚀 Explore the potential applications of Llama 3 models in your industry or field of interest, and leverage their capabilities for innovation and efficiency.

Generative AI Tools 📧

  1.  🎥 AI Autocomplete by Shortwave - Finish your sentences based on your previous emails.

  2. 🤖 Supadash - AI-generated dashboard and charts from your database.

  3. 👩🏼‍🦰 Daydream - BI for C-level, finance and ops, with AI.

  4. 📝 Whisperfusion - Seamless conversations with AI with ultra-low latency.

  5. ✈️ Podnotes - Turn podcasts into content with AI.

News 📰

About Think Ahead With AI (TAWAI) 🤖

Empower Your Journey With Generative AI.

"You're at the forefront of innovation. Dive into a world where AI isn't just a tool, but a transformative journey. Whether you're a budding entrepreneur, a seasoned professional, or a curious learner, we're here to guide you."

Founded with a vision to democratize Generative AI knowledge,
 Think Ahead With AI is more than just a platform.

It's a movement.
It’s a commitment.
It’s a promise to bring AI within everyone's reach.

Together, we explore, innovate, and transform.

Our mission is to help marketers, coaches, professionals and business owners integrate Generative AI and use artificial intelligence to skyrocket their careers and businesses. 🚀

TAWAI Newsletter By:

Sujata Ghosh
 Gen. AI Explorer

“TAWAI is your trusted partner in navigating the AI Landscape!” 🔮🪄

- Think Ahead With AI (TAWAI)