​
 [[{“value”:”
Deep learning is consistently changing so many fields, from NLP (natural language processing) to computer vision.
However, as these models continue to grow in size and complexity, the demands on the hardware required for memory and compute continue to skyrocket. In light of this, there are promising strategies to overcome these challenges, one of which is quantization. This lowers the precision of numbers used in the model without a noticeable loss in performance.Â
In this article, I will dive into the theoretical processes underlying this strategy and show the practical implementation of 8‑bit quantization within a large parameter model, in this case, we will be using the IBM Granite model and BitsAndBytes for quantization.Â
Introduction
The quick growth of deep learning has resulted in an arms race of models boasting billions of parameters, which, in most cases, achieve stellar performance but require enormous computational resources.Â
As engineers and researchers look for methods to make these large models more efficient, quantization has shown to be an incredibly effective solution. By lowering the bit width of number representations from 32‑bit floating point to x‑bit integers, quantization decreases the overall model size, speeds up inference, and cuts energy consumption, all while keeping a high accuracy in the output.
I will explore the concepts and techniques behind 8‑bit quantization in this article. I will explain the approach’s benefits, outline the theory behind it, and walk you through the process step by step.Â
I will then show you a practical application: quantizing the IBM Granite model using BitsAndBytes.Â
Understanding quantization
At its core, quantization is the process of mapping input values from a quite large set (usually continuous and high-precision) to a much smaller and more discrete set, which has lower precision. Deep learning typically involves converting 32‑bit floating‑point numbers to x‑bit integer alternatives.Â
The result is a massive reduction in memory usage and computation time.
Benefits of quantization
Lower memory footprint: Lower precision means that each parameter requires much less memory.Increased speed: Integer math is generally much faster than floating‑point operations (FlOps), especially on hardware optimized for low‑bit computations.Energy efficiency: Lower precision computations consume far less power, making them ideal for mobile and edge devices.
Types of quantization
Uniform quantization: This method maps a range of floating‑point values uniformly to integer values.Non‑uniform quantization: Uses a more complicated mapping based on the distribution of the weights or activations of the network.Symmetric vs. asymmetric quantization:Symmetric: Uses the same scale and zero‑point for positive and negative values.Asymmetric: Allows different scales and zero‑points, which is useful for distributions that are not centered around zero.
Why 8‑bit quantization?
8‑bit quantization is when each weight or activation in the model is fully represented with 8 bits, thus offering us 256 discrete values.Â
This approach helps maintain compression and precision by enabling:
Memory savings: Lowering the uint from 32 bits to 8 bits per parameter can cut the memory footprint by up to 75%.Speed gains: Many hardware accelerators and CPUs are fully optimized for 8‑bit arithmetic, which massively improves inference times.Minimal accuracy loss: With careful calibration and potentially fine‑tuning, the degradation in performance with 8-bit quantization is often minimal.Deployment on edge devices: The reduced model size and faster computations make 8‑bit quantized models perfect for devices with limited computational resources.
Theoretical underpinnings of quantization
Quantization is thoroughly rooted in signal processing and numerical analysis. The objective here is to reduce precision whilst also controlling the quantization error, the difference between the original value and its quantized version.
Quantization error
Scale and zero‑point
A linear mapping is normally used to perform quantization:
Scale (S): Sets the step size between our quantized values.Zero‑point (Z): The integer value assigned to the real number zero.
The process normally involves a calibration phase to determine the optimal scale and zero‑point values. This is then followed by the actual quantization of weights and activations.
Quantization Aware Training (QAT) vs. Post‑Training Quantization (PTQ)
Quantization Aware Training (QAT): This integrates a simulated quantization into the training process, allowing the model to adapt its weights to quantization noise.Post‑Training Quantization (PTQ): Applies quantization to a pre‑trained model using calibration data. PTQ is simpler and faster to implement but it may incur a slightly larger accuracy drop compared to QAT.
Steps in 8‑bit quantization
Applying 8‑bit quantization includes some essential steps:
Preprocessing and calibration
Step 1: Investigate the Model’s Dynamic Range
Before quantization, we need to know the weights and activation ranges:Collect Statistics: Pass a part of the dataset through the model to collect statistics (min, max, mean, standard deviation) for all the layers.Establish Ranges: Based on these statistics, create quantization ranges, possibly clipping outliers to create a tighter range.
Step 2: Calibration
Calibration is the process of selecting the best scale and zero-point for each tensor or layer:
Min/Max Calibration: Uses the minimum and maximum that were observed.Percentile Calibration: Uses some percentile (e.g., 99.9th percentile) to avoid outliers. Calibration must be correct since poor decisions will result in significant loss of accuracy.
Quantization Aware Training vs. Post‑Training Quantization
Quantization Aware Training (QAT):
Advantages: Greater precision as the model learns how to compensate for quantization distortion.Cons: Involves modifying the training procedure and extra computation.
Post‑Training Quantization (PTQ):
Advantages: It’s much easier to implement because the model is already pre-trained.Disadvantages: It can sometimes result in a greater reduction in accuracy, specifically in precision-based models.
For most big models, a small loss of accuracy from PTQ is fine, while mission-critical applications can use QAT.
8-bit quantization applied
No matter which deep learning environment—PyTorch, TensorFlow, or ONNX—the concepts of 8‑bit quantization remain the same.
Practical considerations
Before implementing quantization, consider the following:
Hardware support
Ensure that the target hardware (CPUs, GPUs, or special accelerators like TPUs) natively supports 8‑bit operations.
Libraries
PyTorch: Gives us built-in support for QAT and PTQ through its designated quantization module.TensorFlow Lite: Offers us utilities to transform models to an 8‑bit quantized format, especially for embedded and mobile applications.ONNX Runtime: Supports quantized models for use across different platforms.
Model Structure: Not all the layers in the model are created equal when quantized.Â
Convolutional and fully connected layers will generally be fine, but some activation and normalization layers may need further special treatment.Â
Fine-Tuning: Fine-tuning the quantized model on a small calibration dataset can help restore any performance loss due to quantization noise.
BitsAndBytes: A specialized library for 8‑bit quantization
BitsAndBytes is an independent library that helps us further streamline the 8‑bit quantization process for very large models. Frameworks like PyTorch offer us native quantization support. However, BitsAndBytes provides additional optimizations designed to convert 32‑bit floating point weights into 8‑bit integers.Â
With a simple config flag (e.g., load_in_8bit=True), it enables significant reductions in memory usage and speeds up inference without requiring massive code modifications.
Model structure: Not all layers are equally amenable to quantization. Convolutional and fully connected layers usually perform well under quantization, but some of the activation and normalization layers may need special treatment.Fine‑tuning: Fine‑tuning the quantized model on a small calibration dataset can help us recover any performance loss due to quantization noise.
Integrating BitsAndBytes with your workflow
For seamless integration, BitsAndBytes can be used alongside other popular frameworks like PyTorch. When you pre-configure your model with BitsAndBytes, you simply have to specify the quantization configuration during model loading.Â
This tells the system to automatically convert the weights from 32‑bit integers to 8‑bit integers on the fly thus reducing the overall memory footprint by up to 75% and enhancing inference speed, which is ideal for deployment in resource-constrained environments.
For example, by setting up your model with:
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
you can achieve a quick switch to 8‑bit precision. This approach not only optimizes memory usage but also maintains high performance, making it a valuable addition to your deep learning workflow.
Case study: Quantizing IBM Granite with 8‑bit using BitsAndBytes
IBM Granite is a 2‑billion parameter model designed for instruction‑following tasks. Due to its enormous size, it is possible to quantize IBM Granite to 8‑bit to reduce its memory footprint significantly with good performance.Â
IBM Granite quantization: Example code
The following is the code segment for configuring IBM Granite with 8‑bit quantization:
# Setup IBM Granite model using 8-bit quantization.
model_name = “ibm-granite/granite-3.1-2b-instruct”
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  quantization_config=quantization_config,
  device_map=“balanced”, # Adjust as needed based on available GPU memory.
  torch_dtype=torch.float16
)
tokeniser = AutoTokeniser.from_pretrained(model_name)
Code breakdown
Model Selection:
The model_name variable sets up the IBM Granite model to be used for instruction execution.
Quantization Setup:
BitsAndBytesConfig(load_in_8bit=True) activates 8‑bit quantization. It is a flag that informs the model loader to quantize 32‑bit floating point to 8‑bit integer.
Model loading:
AutoModelForCausalLM.from_pretrained() loads the model using the specified configuration. The parameter device_map=”balanced” helps distribute the model across available GPUs, and torch_dtype=torch.float16 ensures that any remaining computation uses half‑precision.
Tokenizer initialization:
The tokenizer is instantiated with AutoTokeniser.from_pretrained(model_name) and guarantees the input text undergoes correct preprocessing for the quantized model.This method not only lowers the memory usage of the model by as much as 75%, it also increases inference speed, making it particularly suitable for deployment in memory-limited settings, such as edge devices.
Barriers and best practices
Even though 8-bit quantization is highly advantageous, it also has some challenges:
Challenges
Accuracy degradation
Some models can suffer from a loss of accuracy after quantization due to quantization noise.
Calibration difficulty
It is important to determine appropriate calibration data and techniques and may be difficult, especially for models with a broad dynamic range.
Hardware constraints
Ensure that your target deployment platform fully supports 8‑bit operation, or performance will be disappointing.
Best practices full calibration
Use a representative data set to accurately calibrate the model’s weights and activations.
Layer-by-layer analysis
Determine which layers are sensitive to quantization and evaluate the necessity to retain them at a higher precision.
Progressive evaluation
Quantization is not a one-shot fix. Repeat your strategy in turn experimenting with different calibration techniques and potentially mixing PTQ with QAT.
Use framework tools
Utilize the high-level quantization utilities integrated into frameworks such as PyTorch and TensorFlow, as these utilities are always being improved and updated.
Fine‑tuning
If possible, optimize the quantized model on a subset of your data to recover any performance loss due to quantization.Â
ConclusionÂ
Quantization and 8‑bit quantization are powerful techniques for reducing the memory footprint and accelerating the inference of large models. By converting 32‑bit floating‑point values to 8‑bit integers, you can achieve significant memory savings and speedups with minimal accuracy loss.Â
In the current article, we discussed the theoretical foundations of quantization and expounded on the steps involved in preprocessing, calibration, and choosing between quantization-aware training and post-training quantization.
We then gave practical examples using popular frameworks, finishing with a case study involving the quantization of the IBM Granite model using BitsAndBytes.Â
As models in deep learning increase in size, mastering techniques like 8‑bit quantization will be needed to deploy efficient state‑of‑the‑art systems: right from the data center down to edge devices.Â
Regardless of whether you’re an AI researcher or a deployment engineer, understanding how to make large models optimized is a needed skill in today’s AI landscape.
The application of 8-bit quantization through tools such as BitsAndBytes allows the reduction of the computational and memory overhead of big models, such as IBM Granite, to be achieved for more scalable, efficient, and energy-consumption-friendly deployment in diverse applications and hardware platforms.Â
Happy quantizing, and may every bit and byte count in your models become leaner, faster, and more efficient!
Connect with like-minded AI professionals and enthusiasts at our in-person events across the globe.
Check out where we’ll be this year, and join us to discuss emerging topics with some of the world’s leading AI minds.
“}]]Â
Learn how 8-bit quantization reduces deep learning model size, boosts inference speed, and maintains accuracy using IBM Granite and BitsAndBytes.Â