How to 8‑bit quantize large models using bits and bytes

April 9, 2025
10:44 am

[[{“value”:”

Deep learning is consistently changing so many fields, from NLP (natural language processing) to computer vision.

However, as these models continue to grow in size and complexity, the demands on the hardware required for memory and compute continue to skyrocket. In light of this, there are promising strategies to overcome these challenges, one of which is quantization. This lowers the precision of numbers used in the model without a noticeable loss in performance.

In this article, I will dive into the theoretical processes underlying this strategy and show the practical implementation of 8‑bit quantization within a large parameter model, in this case, we will be using the IBM Granite model and BitsAndBytes for quantization.

Introduction

The quick growth of deep learning has resulted in an arms race of models boasting billions of parameters, which, in most cases, achieve stellar performance but require enormous computational resources.

As engineers and researchers look for methods to make these large models more efficient, quantization has shown to be an incredibly effective solution. By lowering the bit width of number representations from 32‑bit floating point to x‑bit integers, quantization decreases the overall model size, speeds up inference, and cuts energy consumption, all while keeping a high accuracy in the output.

I will explore the concepts and techniques behind 8‑bit quantization in this article. I will explain the approach’s benefits, outline the theory behind it, and walk you through the process step by step.

I will then show you a practical application: quantizing the IBM Granite model using BitsAndBytes.

Understanding quantization

At its core, quantization is the process of mapping input values from a quite large set (usually continuous and high-precision) to a much smaller and more discrete set, which has lower precision. Deep learning typically involves converting 32‑bit floating‑point numbers to x‑bit integer alternatives.

The result is a massive reduction in memory usage and computation time.

Benefits of quantization

Lower memory footprint: Lower precision means that each parameter requires much less memory.Increased speed: Integer math is generally much faster than floating‑point operations (FlOps), especially on hardware optimized for low‑bit computations.Energy efficiency: Lower precision computations consume far less power, making them ideal for mobile and edge devices.

Types of quantization

Uniform quantization: This method maps a range of floating‑point values uniformly to integer values.Non‑uniform quantization: Uses a more complicated mapping based on the distribution of the weights or activations of the network.Symmetric vs. asymmetric quantization:Symmetric: Uses the same scale and zero‑point for positive and negative values.Asymmetric: Allows different scales and zero‑points, which is useful for distributions that are not centered around zero.

AI assistants: Only as smart as your knowledge base

AI assistants need real-time, seamless connections to your company’s databases, documents, and internal communication tools to realize their full potential.

AI Accelerator InstituteMarisa Garanhel

Why 8‑bit quantization?

8‑bit quantization is when each weight or activation in the model is fully represented with 8 bits, thus offering us 256 discrete values.

This approach helps maintain compression and precision by enabling:

Memory savings: Lowering the uint from 32 bits to 8 bits per parameter can cut the memory footprint by up to 75%.Speed gains: Many hardware accelerators and CPUs are fully optimized for 8‑bit arithmetic, which massively improves inference times.Minimal accuracy loss: With careful calibration and potentially fine‑tuning, the degradation in performance with 8-bit quantization is often minimal.Deployment on edge devices: The reduced model size and faster computations make 8‑bit quantized models perfect for devices with limited computational resources.

Theoretical underpinnings of quantization

Quantization is thoroughly rooted in signal processing and numerical analysis. The objective here is to reduce precision whilst also controlling the quantization error, the difference between the original value and its quantized version.

Quantization error

Scale and zero‑point

A linear mapping is normally used to perform quantization:

Scale (S): Sets the step size between our quantized values.Zero‑point (Z): The integer value assigned to the real number zero.

The process normally involves a calibration phase to determine the optimal scale and zero‑point values. This is then followed by the actual quantization of weights and activations.

Quantization Aware Training (QAT) vs. Post‑Training Quantization (PTQ)

Quantization Aware Training (QAT): This integrates a simulated quantization into the training process, allowing the model to adapt its weights to quantization noise.Post‑Training Quantization (PTQ): Applies quantization to a pre‑trained model using calibration data. PTQ is simpler and faster to implement but it may incur a slightly larger accuracy drop compared to QAT.

Steps in 8‑bit quantization

Applying 8‑bit quantization includes some essential steps:

Preprocessing and calibration

Step 1: Investigate the Model’s Dynamic Range

Before quantization, we need to know the weights and activation ranges:Collect Statistics: Pass a part of the dataset through the model to collect statistics (min, max, mean, standard deviation) for all the layers.Establish Ranges: Based on these statistics, create quantization ranges, possibly clipping outliers to create a tighter range.

Step 2: Calibration

Calibration is the process of selecting the best scale and zero-point for each tensor or layer:

Min/Max Calibration: Uses the minimum and maximum that were observed.Percentile Calibration: Uses some percentile (e.g., 99.9th percentile) to avoid outliers. Calibration must be correct since poor decisions will result in significant loss of accuracy.

Quantization Aware Training vs. Post‑Training Quantization

Quantization Aware Training (QAT):

Advantages: Greater precision as the model learns how to compensate for quantization distortion.Cons: Involves modifying the training procedure and extra computation.

Post‑Training Quantization (PTQ):

Advantages: It’s much easier to implement because the model is already pre-trained.Disadvantages: It can sometimes result in a greater reduction in accuracy, specifically in precision-based models.

For most big models, a small loss of accuracy from PTQ is fine, while mission-critical applications can use QAT.

LLM economics: How to avoid costly pitfalls

Avoid costly LLM pitfalls: Learn how token pricing, scaling costs, and strategic prompt engineering impact AI expenses—and how to save.

AI Accelerator InstituteMarisa Garanhel

8-bit quantization applied

No matter which deep learning environment—PyTorch, TensorFlow, or ONNX—the concepts of 8‑bit quantization remain the same.

Practical considerations

Before implementing quantization, consider the following:

Hardware support

Ensure that the target hardware (CPUs, GPUs, or special accelerators like TPUs) natively supports 8‑bit operations.

Libraries

PyTorch: Gives us built-in support for QAT and PTQ through its designated quantization module.TensorFlow Lite: Offers us utilities to transform models to an 8‑bit quantized format, especially for embedded and mobile applications.ONNX Runtime: Supports quantized models for use across different platforms.

Model Structure: Not all the layers in the model are created equal when quantized.

Convolutional and fully connected layers will generally be fine, but some activation and normalization layers may need further special treatment.

Fine-Tuning: Fine-tuning the quantized model on a small calibration dataset can help restore any performance loss due to quantization noise.

BitsAndBytes: A specialized library for 8‑bit quantization

BitsAndBytes is an independent library that helps us further streamline the 8‑bit quantization process for very large models. Frameworks like PyTorch offer us native quantization support. However, BitsAndBytes provides additional optimizations designed to convert 32‑bit floating point weights into 8‑bit integers.

With a simple config flag (e.g., load_in_8bit=True), it enables significant reductions in memory usage and speeds up inference without requiring massive code modifications.

Model structure: Not all layers are equally amenable to quantization. Convolutional and fully connected layers usually perform well under quantization, but some of the activation and normalization layers may need special treatment.Fine‑tuning: Fine‑tuning the quantized model on a small calibration dataset can help us recover any performance loss due to quantization noise.

Integrating BitsAndBytes with your workflow

For seamless integration, BitsAndBytes can be used alongside other popular frameworks like PyTorch. When you pre-configure your model with BitsAndBytes, you simply have to specify the quantization configuration during model loading.

This tells the system to automatically convert the weights from 32‑bit integers to 8‑bit integers on the fly thus reducing the overall memory footprint by up to 75% and enhancing inference speed, which is ideal for deployment in resource-constrained environments.

For example, by setting up your model with:

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

you can achieve a quick switch to 8‑bit precision. This approach not only optimizes memory usage but also maintains high performance, making it a valuable addition to your deep learning workflow.

Case study: Quantizing IBM Granite with 8‑bit using BitsAndBytes

IBM Granite is a 2‑billion parameter model designed for instruction‑following tasks. Due to its enormous size, it is possible to quantize IBM Granite to 8‑bit to reduce its memory footprint significantly with good performance.

IBM Granite quantization: Example code

The following is the code segment for configuring IBM Granite with 8‑bit quantization:

# Setup IBM Granite model using 8-bit quantization.
model_name = “ibm-granite/granite-3.1-2b-instruct”
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map=“balanced”, # Adjust as needed based on available GPU memory.
torch_dtype=torch.float16
)
tokeniser = AutoTokeniser.from_pretrained(model_name)

Code breakdown

Model Selection:

The model_name variable sets up the IBM Granite model to be used for instruction execution.

Quantization Setup:

BitsAndBytesConfig(load_in_8bit=True) activates 8‑bit quantization. It is a flag that informs the model loader to quantize 32‑bit floating point to 8‑bit integer.

Model loading:

AutoModelForCausalLM.from_pretrained() loads the model using the specified configuration. The parameter device_map=”balanced” helps distribute the model across available GPUs, and torch_dtype=torch.float16 ensures that any remaining computation uses half‑precision.

Tokenizer initialization:

The tokenizer is instantiated with AutoTokeniser.from_pretrained(model_name) and guarantees the input text undergoes correct preprocessing for the quantized model.This method not only lowers the memory usage of the model by as much as 75%, it also increases inference speed, making it particularly suitable for deployment in memory-limited settings, such as edge devices.

Gold-copy data & AI in the trade lifecycle process

Use AI to streamline the trade lifecycle, reduce manual breaks, and sync data across systems for faster, more accurate investment decisions.

AI Accelerator InstituteParth Prafulbhai Sonara

Barriers and best practices

Even though 8-bit quantization is highly advantageous, it also has some challenges:

Challenges

Accuracy degradation

Some models can suffer from a loss of accuracy after quantization due to quantization noise.

Calibration difficulty

It is important to determine appropriate calibration data and techniques and may be difficult, especially for models with a broad dynamic range.

Hardware constraints

Ensure that your target deployment platform fully supports 8‑bit operation, or performance will be disappointing.

Best practices full calibration

Use a representative data set to accurately calibrate the model’s weights and activations.

Layer-by-layer analysis

Determine which layers are sensitive to quantization and evaluate the necessity to retain them at a higher precision.

Progressive evaluation

Quantization is not a one-shot fix. Repeat your strategy in turn experimenting with different calibration techniques and potentially mixing PTQ with QAT.

Use framework tools

Utilize the high-level quantization utilities integrated into frameworks such as PyTorch and TensorFlow, as these utilities are always being improved and updated.

Fine‑tuning

If possible, optimize the quantized model on a subset of your data to recover any performance loss due to quantization.

Conclusion

Quantization and 8‑bit quantization are powerful techniques for reducing the memory footprint and accelerating the inference of large models. By converting 32‑bit floating‑point values to 8‑bit integers, you can achieve significant memory savings and speedups with minimal accuracy loss.

In the current article, we discussed the theoretical foundations of quantization and expounded on the steps involved in preprocessing, calibration, and choosing between quantization-aware training and post-training quantization.

We then gave practical examples using popular frameworks, finishing with a case study involving the quantization of the IBM Granite model using BitsAndBytes.

As models in deep learning increase in size, mastering techniques like 8‑bit quantization will be needed to deploy efficient state‑of‑the‑art systems: right from the data center down to edge devices.

Regardless of whether you’re an AI researcher or a deployment engineer, understanding how to make large models optimized is a needed skill in today’s AI landscape.

The application of 8-bit quantization through tools such as BitsAndBytes allows the reduction of the computational and memory overhead of big models, such as IBM Granite, to be achieved for more scalable, efficient, and energy-consumption-friendly deployment in diverse applications and hardware platforms.

Happy quantizing, and may every bit and byte count in your models become leaner, faster, and more efficient!

Connect with like-minded AI professionals and enthusiasts at our in-person events across the globe.

Check out where we’ll be this year, and join us to discuss emerging topics with some of the world’s leading AI minds.

AI Accelerator Institute | Summit calendar

Unite with applied AI’s builders & execs. Join Generative AI Summit, Agentic AI Summit, LLMOps Summit & Chief AI Officer Summit in a city near you.

.style-65cdffdcd48a30cb7da3635f-logo- { position: relative; display: block; &:hover .WrapperHandleClick { display: block; } display: flex;position: relative;width: 136px;height: 100%;margin-right: 0px;float: left;background-size: contain;background-repeat: no-repeat;background-position-y: center;background-image: url(‘https://assetsacara.com/production/organizations/62876e3d645e9fcb6e40225e/1722353907815-AIAIFULLLOGOSECONDARYONWHITE.webp’);margin-left: 30px; @media (max-width: 76.8em) { margin-right: 2rem;width: 120px;margin-left: 10px;background-image: url(‘https://assetsacara.com/production/organizations/62876e3d645e9fcb6e40225e/1722353907815-AIAIFULLLOGOSECONDARYONWHITE.webp’); } @media (max-width: 37.5em) { background-image: url(‘https://assetsacara.com/production/organizations/62876e3d645e9fcb6e40225e/1722353907815-AIAIFULLLOGOSECONDARYONWHITE.webp’);display: block;flex-direction: row-reverse;width: 5.5rem;min-width: 120px; } }

“}]]

Learn how 8-bit quantization reduces deep learning model size, boosts inference speed, and maintains accuracy using IBM Granite and BitsAndBytes.