GGML / ML Tensor Library
by Stephen M. Walker II, Co-Founder / CEO
What is GGML?
GGML is a C library for machine learning, particularly focused on enabling large models and high-performance computations on commodity hardware.
It was created by Georgi Gerganov and is designed to perform fast and flexible tensor operations, which are fundamental in machine learning tasks.
GGML supports various quantization formats, including 16-bit float and integer quantization (4-bit, 5-bit, 8-bit, etc.), which can significantly reduce the memory footprint and computational cost of models.
Optimized for Apple M Chips
The library is optimized for both Apple M1 and M2 processors and x86 architectures, utilizing AVX/AVX2 instructions to accelerate computations. GGML is noted for having no third-party dependencies and not allocating any memory during runtime, which simplifies integration and deployment.
GGML files contain a quantized representation of model weights, which allows for faster inference on CPUs due to lower RAM and bandwidth requirements. The quantization to 4 bits means that the model weights take up one-fourth the space they would in an unquantized format, but with reduced precision.
Iterating Quickly
The legacy GGML file format packages all model data, including vocabulary, architecture, and weights, in a binary layout following a defined versioned structure, which facilitates the deployment of trained models. This format has been superseded by the GGUF file format.
GGML.ai, the company behind the library, emphasizes a minimal and open-core approach, with the codebase available under the MIT license. The company is actively seeking to hire developers to further the development of on-device inference technologies.
Llama.cpp
GGML, in collaboration with llama.cpp, streamlines the inference of Llama models on CPUs. Developed by Georgi Gerganov, llama.cpp is a C/C++ library that efficiently processes GGML-formatted models, facilitating the execution of large language models such as LLaMa, Vicuna, or Wizard on personal computers without requiring a GPU. While the GGML file format has been replaced by the more advanced GGUF format, the integration of GGML with llama.cpp remains a robust solution for CPU-based model inference.
How can I get started with GGML?
To get started with GGML, you'll need to install the ggml-python library, which is a Python interface for the GGML tensor library developed by Georgi Gerganov. GGML is designed for machine learning and is written in C/C++, making it fast, portable, and easily embeddable. It supports various hardware acceleration systems like BLAS, CUDA, OpenCL, and Metal, and it also supports quantized inference for reduced memory footprint and faster inference.
Here are the steps to get started:
-
Installation Requirements — You'll need Python 3.7+ and a C compiler (gcc, clang, msvc, etc).
-
Installation — You can install ggml-python using pip. This will compile ggml using cmake, which requires a C compiler installed on your system.
-
Basic Example — Here's a simple example of using ggml-python's low-level API to compute the value of a function:
import ggml
import ctypes
# Allocate a new context with 16 MB of memory
params = ggml.ggml_init_params(mem_size=16 * 1024 * 1024, mem_buffer=None)
ctx = ggml.ggml_init(params=params)
# Instantiate tensors
x = ggml.ggml_new_tensor_1d(ctx, ggml.GGML_TYPE_F32, 1)
a = ggml.ggml_new_tensor_1d(ctx, ggml.GGML_TYPE_F32, 1)
# Free the context
ggml.ggml_free(ctx)
-
Next Steps — To learn more about ggml-python, check out the API Reference, Examples, and Code Completion Server. You can also check out the CLIP Embeddings example, which shows how to implement CLIP text/image embeddings using ggml-python.
-
Development — If you want to contribute to the development of ggml-python, you can clone the repository from GitHub.
Remember, GGML is designed to be used with CPU, and if you want to use GPU, you'll need to run it with PyTorch. You can also use llama.cpp with BLAS to offload layers to the GPU. If you're interested in training your own GGML model from scratch, you can refer to this tutorial.
What are llama.cpp and whisper.cpp?
llama.cpp
and whisper.cpp
are C++ implementations of two different AI models.
llama.cpp
is a C++ library for running large language models (LLMs). It provides Python bindings, allowing Python programs to access the C++ API. This library is known for its quick inference and ease of embedding as a library or using the example binaries. It supports inference for many LLMs models, which can be accessed on Hugging Face. The library is optimized for CPU usage, but also supports GPU acceleration using various BLAS backends, and even has specific optimizations for MacOS with Apple Silicon Chip. The simplicity of its installation and usage, along with its ability to run large models with 1B+ parameters, has made it popular among developers.
whisper.cpp
, on the other hand, is a C++ port of OpenAI's Whisper automatic speech recognition (ASR) model. Whisper is an ASR system trained on a large amount of multilingual and multitask supervised data. The whisper.cpp
implementation is lightweight and doesn't have any dependencies, making it easy to use in various projects. It has been used for tasks such as audio transcription in iOS apps. This project is also known for its portability, with examples of running the model on various devices like an iPhone, Raspberry Pi 4, and even in a web page via WebAssembly.
Both llama.cpp
and whisper.cpp
have been praised for their simplicity and ease of use, lowering the barrier for developers to work with large language models and speech recognition models respectively.
How can I get started with llama.cpp and whisper.cpp?
To get started with llama.cpp and whisper.cpp, you'll need to set up your development environment, clone the respective repositories, and acquire the necessary models. Here's a step-by-step guide for each:
Getting Started with llama.cpp
- Install the required packages — If you're using MacOS, you can use Homebrew to install the necessary packages. Open Terminal and run the following commands:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install cmake [email protected] git wget
- Clone the llama.cpp repository — In Terminal, run the following commands:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
-
Download the LLaMA 2 model — Meta has released the model to the public. You can request access and download Llama 2 through a site provided by Meta. Once approved, you'll receive a download link. Place the downloaded model in the
models
folder inside thellama.cpp
folder. -
Install Python dependencies — Ensure you're running Python 3.10. In the
llama.cpp
folder in Terminal, create a virtual environment and install the necessary Python packages:
python3 -m venv venv
source venv/bin/activate
pip install torch torchvision torchaudio sentencepiece numpy
- Run the model — You can use a convenient script to run the 7B model in a Python environment.
How do I use a community Llama 2 variant?
The leading Llama 2 community variant models are:
-
MythoMax-L2-13B — This is a popular community variant of the Llama 2 13B model. It is known for its smart and good storytelling capabilities.
-
Nous-Hermes-Llama2 — This is another community variant of the Llama 2 13B model. It is also recognized for its impressive storytelling abilities.
-
Zephyr 7b — This model is not based on Llama 2, but instead the Mistral 7b model as the foundation. The model is optimized for capabilities, and not safe output generations.
These community variants have been developed by researchers and developers who have taken advantage of the open-source nature of the Llama 2 models. They have fine-tuned the base models to create variants that are better suited to specific tasks or use cases. The open-source license of Llama 2 has been a significant factor in enabling this kind of innovation and customization.
Getting Started with Whisper.cpp
-
Clone the Whisper.cpp repository — This will provide you with the source code.
-
Acquire the Whisper Model — You'll need to fetch a Whisper model. The specifics of this step may vary depending on the source of the model.
-
Install and deploy Whisper.cpp — Detailed instructions for this step can vary depending on your specific use case and deployment environment. You may find additional resources on platforms like GitHub, Reddit, or specific tutorials online.
Remember, both llama.cpp and whisper.cpp are written in C/C++, so you'll need a C/C++ compiler installed on your system. If you're using MacOS, you can install the necessary compilers via Homebrew. If you encounter any issues, consider checking the official documentation or relevant discussion forums for troubleshooting tips.
How does GGML handle Large Language Models (LLMs)?
GGML, the underlying technology for llama.cpp and whisper.cpp, is a C-based tensor library for machine learning. It's designed to handle Large Language Models (LLMs) efficiently by balancing GPU and CPU usage. This design is particularly beneficial when the model size exceeds the available VRAM, as GGML can leverage system RAM to outperform other methods. For example, a GGML 33B model can outperform a GPTQ 13B model, even when multiple layers are swapped to the system RAM.
However, due to its design to utilize both GPU and CPU, GGML models may have slower token generation and longer start times compared to models that primarily use the GPU. To optimize performance, it's crucial to correctly configure the GGML loader by assigning a number of threads equal to or smaller than the actual CPU core count.
GGML supports 16-bit float and integer quantization, including 4-bit, 5-bit, and 8-bit, and features built-in optimization algorithms like ADAM and L-BFGS. It's optimized for various architectures, including x86 with AVX / AVX2 intrinsics, and offers web support via WebAssembly and WASM SIMD. GGML operates without third-party dependencies and avoids memory allocations during runtime, contributing to its performance.
On Apple Silicon, for instance, a 7B LLaMA model with 4-bit quantization and 3.5 GB size can generate tokens at a rate of 43 ms per token on an M1 Pro with 8 CPU threads. Similarly, a 13B LLaMA model with 4-bit quantization and 6.8 GB size can generate tokens at a rate of 73 ms per token on the same hardware.
How does GGML work with other libraries?
GGML is a tensor library for machine learning developed by Georgi Gerganov. It is written in C/C++ and is designed to be fast, portable, and easily embeddable, making use of various hardware acceleration systems like BLAS, CUDA, OpenCL, and Metal. It supports quantized inference for reduced memory footprint and faster inference. GGML operates seamlessly across various platforms, including Mac, Windows, Linux, iOS, Android, web browsers, and even Raspberry Pi.
Compared to other machine learning libraries, GGML has several unique features and advantages. It doesn't require a specific format for the model file, meaning you can convert the model file from any other framework (like TensorFlow, Pytorch, etc.) into a binary file in any format that's easy for you to handle later. It also supports a wide range of models like Whisper and LLaMa.
However, GGML also has some limitations. It is still in the development phase and currently lacks comprehensive documentation, which can make it hard for new users to start using it quickly. Reusing the source code across different models can be difficult due to the unique structure of each model.
In terms of performance, GGML has been found to have a steady inference speed, with one user reporting about 82 tokens per second on a GPU with a 4090 and 24gb of RAM. However, the performance can vary depending on the specific model and the hardware used. For instance, when comparing GGML with GPTQ, if you can fit the entire model in VRAM + context, then GPTQ is going to be significantly faster. If not, then GGML is faster to significantly faster depending on how many layers you have to offload.
GGML is a versatile and portable machine learning library that offers several advantages over other libraries, particularly in terms of its flexibility with model file formats and its support for a wide range of models. However, its performance can vary depending on the specific model and hardware used, and it currently lacks comprehensive documentation, which may pose challenges for new users.
What is the GGUF file format?
The GGUF file format is a new extensible binary format for AI models, specifically designed for models developed in frameworks like PyTorch. It was introduced in August 2023 and is focused on fast loading, flexibility, and single-file convenience. The GGUF file format is used with the LLaMA and Llama-2 AI models and runs on llama.cpp.
The GGUF file format improves on previous formats like GGML and GGJT by offering better tokenization, support for special tokens, metadata, and extensibility. It uses a key-value structure for things like hyperparameters instead of just a list of values, making it more flexible and extensible for future changes. New information can be added without breaking compatibility with existing GGUF models.
The GGUF file format is designed to be easy to use, requiring just a small amount of code to load models. There's no need for external libraries as the file contains the full model information. It is a successor to the GGML, GGMF, and GGJT file formats and is designed to be unambiguous by containing all the information needed to load a model. It is also designed to be extensible, so that new features can be added to GGML without breaking compatibility with older models.
In terms of usage, GGUF is meant for models that you want to use for inference with llama.cpp or related systems. The GGUF file contains all information needed to load and run the model.
What are the advantages associated with using GGML?
GGML (Generic Graph Machine Learning) is a tensor library for machine learning that offers several advantages for large-scale model training and high-performance computing on commodity hardware. Here are some of the key benefits:
-
Cross-Platform Compatibility — GGML is written in C, which allows it to operate seamlessly across various platforms, including Mac, Windows, Linux, iOS, Android, web browsers, and even Raspberry Pi.
-
Edge Computing — GGML is designed with a feature that ensures no memory allocation during runtime. It supports half-size float and integer quantization, which allows developers to have better control over memory usage and performance management. This is crucial for making machine learning models work well on edge devices where efficient resource use is important.
-
Model Format Flexibility — Unlike other machine learning inference frameworks, GGML doesn't require a specific format for the model file. This means you can convert the model file from any other framework (like TensorFlow, Pytorch, etc.) into a binary file in any format that's easy for you to handle later.
-
Large-Scale Model Training — GGML is designed to cater to the needs of machine learning experts, providing a comprehensive range of features and optimizations for large-scale model training and high-performance computing on any commodity hardware.
-
Quantization Support — GGML supports a number of different quantization strategies (e.g., 4-bit, 5-bit, and 8-bit quantization), each of which offers different trade-offs between efficiency and performance.
-
CPU Inferencing — GGML allows for CPU inferencing, which can be beneficial when GPU resources are limited. This makes it possible to run large language models on consumer hardware with effective CPU inferencing.
-
Better Response Quality — GGML models can provide better response quality compared to other models. For instance, a GGML 33B model can provide better output quality than a 7B model, even if the latter has faster inference.
However, it's important to note that GGML is still in the development phase and currently lacks comprehensive documentation, which can make it hard for new users to start using it quickly. Also, reusing the source code across different models can be difficult due to the unique structure of each model. Despite these challenges, GGML's benefits make it a promising tool for machine learning applications.
What are the disadvantages of using GGML?
GGML is a tensor library for machine learning that is known for its efficient operation on CPUs and its ability to handle large models on commodity hardware. It is written in C and supports automatic differentiation, making it suitable for model training and inference in cross-platform applications. However, despite its advantages, GGML has several limitations compared to other machine learning libraries.
-
Documentation — GGML is still in the development phase and currently lacks comprehensive documentation. This can make it challenging for new users to start using it quickly.
-
Code Reusability — Reusing the source code across different models can be difficult due to the unique structure of each model. GGML does not provide a universal guide for this, so users often need to create their own inference code, particularly when working with custom models developed in-house. This process requires a deep understanding of how to work with mathematical matrices and the structure of ML models.
-
Quantization — GGML files contain a quantized representation of model weights, which results in lower quality. The benefit is less RAM requirements and faster inference on the CPU, but this means less precision as it needs fewer bits.
-
GPU Support — While GGML is efficient on CPUs, its performance on GPUs is not yet fully explored. Some discussions suggest that GGML-based inference code might not outperform established libraries like PyTorch or Triton in terms of GPU performance. However, there are ongoing efforts to add GPU support to GGML, and it's expected that this will enhance its performance.
-
Model Format — GGML does not require a specific format for the model file, which means you can convert the model file from any other framework (like TensorFlow, Pytorch, etc.) into a binary file in any format that's easy for you to handle later. While this offers flexibility, it might also introduce additional steps in the model deployment process, especially for users accustomed to standard formats like ONNX or TensorFlow's SavedModel.
While GGML has several advantages, such as its efficient operation on CPUs and its flexibility in handling different model formats, it also has limitations, including a lack of comprehensive documentation, challenges in code reusability, and limited GPU support. These factors should be considered when choosing a machine learning library for a specific use case.