Think Ahead With AI
Posts
🤖 Want Faster, Cheaper AI? Discover the Secret Sauce of Model Quantization! 🤖

🤖 Want Faster, Cheaper AI? Discover the Secret Sauce of Model Quantization! 🤖

🚀 How Analytics.gov is Revolutionizing AI Hosting with AWS Sagemaker and Quantization 🚀

Think Ahead With AI
June 10, 2024

Story Highlights 🧑‍💻

💡 Faster, Cheaper AI: Discover how quantization boosts AI model performance while slashing costs.
💡 Open-Source Perks: Learn why open-source models are the go-to for security and control.
💡 AWS Sagemaker: Understand how this tool simplifies AI model deployment.
💡 Real Results: See the impressive benchmarks that prove quantization's effectiveness.

Context at a Glance 🧠

🔍 Who: Analytics.gov (AG), GovTech Singapore’s Data Science and Artificial Intelligence Division (DSAID)
🔍 What: Using AWS Sagemaker to host quantized Large Language Models (LLMs)
🔍 When: Current AI and machine learning trends
🔍 Where: Singapore’s government agencies
🔍 Why: To improve AI model efficiency and reduce costs

The AI Revolution with Quantisation 📈

Struggling with slow and costly AI models?

Get ready to supercharge your AI game with a trick so simple you'll wonder why you didn't think of it sooner.

Welcome to the world of model quantization!

Introducing Analytics.gov (AG), a pioneering MLOps platform from GovTech Singapore's DSAID, empowering the Whole-of-Government (WOG) with secure, streamlined ML and AI capabilities.

AG Sagemaker Simple Architecture for MLOps

Leveraging Government Commercial Cloud (GCC) 2.0, AG facilitates easy access to compute resources and managed AI services directly from government-issued laptops.

AG's tailored features include simplified setup of production-ready inference endpoints for quantized models via AWS Sagemaker. This translates to swift deployment, slashing setup time from weeks to minutes and democratizing GenAI adoption across government agencies.

We delve into AG's role in optimizing LLM operations, delving into model quantization, seamless hosting on AWS Sagemaker, and showcasing performance and cost-efficiency benchmarks.

SOURCE: https://towardsdatascience.com/applied-llm-quantisation-with-aws-sagemaker-analytics-gov-ab210bd6697d

Why Bother with Open-Source Models? ⏱️

For an insightful read on open LLMs -

1. 📊 Security & Sensitivity: Open-source models can be hosted privately, safeguarding sensitive data from third-party providers.

2. 📊 Controlled Output Generation: With locally hosted open-source models, users have precise control over output, unlike closed-source models reliant on commercial APIs.

3. 📊 Variety: Hugging Face hosts over 600k models, including ones tailored by major companies like Meta and Google, as well as specialized variants like AI Singapore’s SEA-LION model for Southeast Asian languages.

The Open-Source AI Dilemma 🖥️

Hosting Open-Source Large Language Models (LLMs) presents several challenges:

1. 📑 Memory Requirements: LLMs demand significant GPU memory, with larger models needing multiple GPUs, making hosting them resource-intensive and costly.

2. 📑 Computation Requirements: Larger models require more computational power for each task, resulting in slower inference speeds and increased hosting expenses.

3. 📑 Inference Speed: The slower inference speeds of larger models negatively impact user experience and reduce throughput for applications like text summarization and report generation.

To maximize the speed of inference, models have to be fully loaded in GPU memory as any movement between disk and GPU memory or CPU and GPU memory would introduce overheads that can substantially slow down inference speeds.

LLMs require massive amounts of memory to host, the bigger the LLM, the more GPU memory is required to host it. Most large models demand multiple GPUs to fully host in memory, making it an extremely resource intensive and expensive task.

Naturally, as the size of the model increases, more computation is required for each inference task.

Consequently, the larger the LLMs, the lower the inference speed.

Just how big are these models?

The size of these LLMs can be estimated with the following formula (Note, this is a naïve estimation and model sizes are almost always slightly larger.)

Using the formula we can estimate the model size for some popular models:

Quantization: The AI Diet Plan 🌟

What if I told you that you could cut your AI model’s size in half without major side effects?

Quantization is a method for shrinking model size by reducing the number of bits used to store each weight, typically from 16 bits to fewer. This decreases storage requirements and memory usage, leading to faster inference speeds and lower costs.

For instance, converting from FP16 to 8-bit can halve the model size, enhancing performance and cost-efficiency. While reducing precision can affect output quality, moderate quantization like 8-bit often has minimal impact.

More aggressive quantization may compromise quality, but techniques like AWQ can mitigate this.

Different frameworks like GGUF, GPTQ, EXL2, and AWQ cater to various needs, with options optimized for CPU-only systems or GPUs, balancing performance and quality.

Quantization reduces the number of bits required to store each number, this allows the storage size of the model to be reduced as less bits are used to store each model weight.

However, using fewer bits per weight means the precision of the weights is reduced. This is why Quantization is aptly described by most articles as “reducing the precision of model weights”.

For visual learners here is π represented in different precisions:

Using this floating point calculator

Note: Modern quantization methods may use bespoke number formats rather than FP series to quantize models. These can go as low as 1 bit quantization (Q1).

As seen in the table, the precision of π is reduced as the number of bits decreases. This not only affects the number of decimal places, but also in the approximation of the number itself.

For example, 3.141592502593994 cannot be represented exactly in FP8, so it has to be rounded off to the nearest possible value that FP8 can represent — 3.125, this is also known as Floating Point Error.

How AWS Sagemaker Makes It Easy 🔥

AWS SageMaker Endpoints enable hosting of model inference with native tools, offering benefits like easy auto-scaling configuration, seamless updates with zero downtime, and flexibility via custom containers.

💬 How It Works - SageMaker Endpoints utilize inference containers based on the SageMaker Inference Toolkit library, supporting diverse frameworks such as TensorRT-LLM for intricate LLMs and their quantized versions.
💬 Custom Containers - Custom containers accommodate different inference engines.
For instance, AG utilizes a custom container to host GGUF models with the Llama-cpp-python inference engine, needing minimal code tweaks to comply with SageMaker endpoint standards.

Hosting Quantized Models: Easy as Pie c

Hosting a quantized LLM in AG’s Sagemaker environment is simplified to a few lines of code, allowing users to focus on developing their LLM use cases without the complexity of managing the backend infrastructure.

# Code will vary depending on how you have curated your own custom container.

from sagemaker.model import Model

endpoint_name = "<Name of endpoint>"

image_uri = "<ECR Image URI to Llama-cpp-python Image>"

model_artifact_location = "<S3 Path to Model Artifacts>"

model = "<Path to model file>"

# All other ENV variables defined in documentation

model_endpoint = Model(

image_uri = image_uri,

model_data = model_artifact_location,

role = role,

env = {

"MODEL": model_file_path_in_container,

"N_GPU_LAYERS": "999",

"INVOCATIONS_ROUTE": "/v1/completions"

}

)

model_endpoint.deploy(

initial_instance_count=1,

instance_type="ml.g4dn.xlarge",

endpoint_name=endpoint_name

)

The Proof is in the Pudding: Benchmarks 🧩

We benchmarked various models to compare performance and cost-efficiency.

For instance, using GGUF on older Nvidia T4s from the g4dn instance families yielded significant cost savings while maintaining performance. GPTQ on ExllamaV2, hosted on newer Nvidia A10g instances, demonstrated even greater speed improvements.

The following are the specifications for the benchmarking:

Note - ExllamaV2 refers to the inference engine, while EXL2 is the quantization format native to the ExllamaV2, in this case, ExllamaV2 also supports inference for GPTQ. ExllamaV2 will only be benchmarked with Q4_0 as some Q8_0 quants are not found on Hugging Face.

Wrap It Up: 🎯

In summary, quantization offers significant advantages for large language models (LLMs), such as reducing memory requirements, increasing inference speeds, and lowering costs, while maintaining output quality.

AG's functionalities on AWS Sagemaker Endpoints facilitate the creation and management of production-ready quantized Open LLM APIs for government agencies, streamlining deployment processes and enhancing accessibility, efficiency, and cost-effectiveness

Moving forward, AG plans to enhance its GenAI capabilities by integrating closed-source models like Azure OpenAI and VertexAI's Gemini, alongside existing AWS Bedrock services. This integration will empower users to customize models according to their requirements, fostering innovation in the public sector.

Why It Matters and What You Should Do 📢

Quantization is like the ultimate life hack for AI hosting—better performance, lower costs, and minimal quality loss. By integrating this with AWS Sagemaker Endpoints, AG is making advanced AI more accessible to government agencies.

This isn’t just about saving money; it’s about driving innovation and making AI a powerful tool for public good.

Actionable Insights: 📥

📬 Embrace Quantization: Start implementing quantization to cut costs and improve performance.
📬 Leverage AWS Sagemaker: Utilize AWS Sagemaker Endpoints for hassle-free model deployment.
📬 Prioritize Security: opt for open-source models to maintain control and security of your data.
📬 Benchmark Regularly: Conduct regular benchmarks to monitor and enhance model performance.

Quote- “The future belongs to those who prepare for it today.”

So, what are you waiting for?

Dive into the world of quantization and watch your AI models soar!

By transforming your AI strategy with quantization, you’re not just keeping up with the times—you’re ahead of the curve.

Happy quantizing! 🚀

Generative AI Tools 📧

🎥 Whisper - real-time in-browser speech recognition.
🤖 Driver AI - explains millions of lines of code in minutes instead of months.
📝 Touring - the ultimate app for travelers, works literally everywhere.
✈️ Cartwheel - generates 3D animations from scratch to power up creators.
👩🏼‍🦰 Riffo - AI renaming to organize messy files.

News 📰

About Think Ahead With AI (TAWAI) 🤖

Empower Your Journey With Generative AI.

"You're at the forefront of innovation. Dive into a world where AI isn't just a tool, but a transformative journey. Whether you're a budding entrepreneur, a seasoned professional, or a curious learner, we're here to guide you."

Founded with a vision to democratize Generative AI knowledge,
Think Ahead With AI is more than just a platform.

It's a movement.
It’s a commitment.
It’s a promise to bring AI within everyone's reach.

Together, we explore, innovate, and transform.

Our mission is to help marketers, coaches, professionals and business owners integrate Generative AI and use artificial intelligence to skyrocket their careers and businesses. 🚀

TAWAI Newsletter By:

Sujata Ghosh
Gen. AI Explorer

“TAWAI is your trusted partner in navigating the AI Landscape!” 🔮🪄

- Think Ahead With AI (TAWAI)