GGML is designed to be used in conjunction with the llama.cpp library, also created by Georgi Gerganov, which is written in C/C++ for efficient inference of Llama models. It can load GGML models and run them on a CPU. The GGML file format has been superseded by the GGUF file format.
What is GGML?
GGML is a C library focused on machine learning, created by Georgi Gerganov. It provides foundational elements for machine learning, such as tensors, and a unique binary format to distribute large language models (LLMs) for fast and flexible tensor operations and machine learning tasks.
How can I get started with GGML?
To get started with GGML, you'll need to install the ggml-python library, which is a Python interface for the GGML tensor library developed by Georgi Gerganov. GGML is designed for machine learning and is written in C/C++, making it fast, portable, and easily embeddable. It supports various hardware acceleration systems like BLAS, CUDA, OpenCL, and Metal, and it also supports quantized inference for reduced memory footprint and faster inference.
Here are the steps to get started:
-
Installation Requirements: You'll need Python 3.7+ and a C compiler (gcc, clang, msvc, etc).
-
Installation: You can install ggml-python using pip. This will compile ggml using cmake, which requires a C compiler installed on your system.
-
Basic Example: Here's a simple example of using ggml-python's low-level API to compute the value of a function:
import ggml
import ctypes
# Allocate a new context with 16 MB of memory
params = ggml.ggml_init_params(mem_size=16 * 1024 * 1024, mem_buffer=None)
ctx = ggml.ggml_init(params=params)
# Instantiate tensors
x = ggml.ggml_new_tensor_1d(ctx, ggml.GGML_TYPE_F32, 1)
a = ggml.ggml_new_tensor_1d(ctx, ggml.GGML_TYPE_F32, 1)
# Free the context
ggml.ggml_free(ctx)
-
Next Steps: To learn more about ggml-python, check out the API Reference, Examples, and Code Completion Server. You can also check out the CLIP Embeddings example, which shows how to implement CLIP text/image embeddings using ggml-python.
-
Development: If you want to contribute to the development of ggml-python, you can clone the repository from GitHub.
Remember, GGML is designed to be used with CPU, and if you want to use GPU, you'll need to run it with PyTorch. You can also use llama.cpp with BLAS to offload layers to the GPU. If you're interested in training your own GGML model from scratch, you can refer to this tutorial.
What are llama.cpp and whisper.cpp?
llama.cpp
and whisper.cpp
are C++ implementations of two different AI models.
llama.cpp
is a C++ library for running large language models (LLMs). It provides Python bindings, allowing Python programs to access the C++ API. This library is known for its quick inference and ease of embedding as a library or using the example binaries. It supports inference for many LLMs models, which can be accessed on Hugging Face. The library is optimized for CPU usage, but also supports GPU acceleration using various BLAS backends, and even has specific optimizations for MacOS with Apple Silicon Chip. The simplicity of its installation and usage, along with its ability to run large models with 1B+ parameters, has made it popular among developers.
whisper.cpp
, on the other hand, is a C++ port of OpenAI's Whisper automatic speech recognition (ASR) model. Whisper is an ASR system trained on a large amount of multilingual and multitask supervised data. The whisper.cpp
implementation is lightweight and doesn't have any dependencies, making it easy to use in various projects. It has been used for tasks such as audio transcription in iOS apps. This project is also known for its portability, with examples of running the model on various devices like an iPhone, Raspberry Pi 4, and even in a web page via WebAssembly.
Both llama.cpp
and whisper.cpp
have been praised for their simplicity and ease of use, lowering the barrier for developers to work with large language models and speech recognition models respectively.
How can I get started with llama.cpp and whisper.cpp?
To get started with llama.cpp and whisper.cpp, you'll need to set up your development environment, clone the respective repositories, and acquire the necessary models. Here's a step-by-step guide for each:
Getting Started with llama.cpp
- Install the required packages: If you're using MacOS, you can use Homebrew to install the necessary packages. Open Terminal and run the following commands:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install cmake [email protected] git wget
- Clone the llama.cpp repository: In Terminal, run the following commands:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
-
Download the LLaMA 2 model: Meta has released the model to the public. You can request access and download Llama 2 through a site provided by Meta. Once approved, you'll receive a download link. Place the downloaded model in the models
folder inside the llama.cpp
folder.
-
Install Python dependencies: Ensure you're running Python 3.10. In the llama.cpp
folder in Terminal, create a virtual environment and install the necessary Python packages:
python3 -m venv venv
source venv/bin/activate
pip install torch torchvision torchaudio sentencepiece numpy
- Run the model: You can use a convenient script to run the 7B model in a Python environment.
How do I use a community Llama 2 variant?
The leading Llama 2 community variant models are:
-
MythoMax-L2-13B: This is a popular community variant of the Llama 2 13B model. It is known for its smart and good storytelling capabilities.
-
Nous-Hermes-Llama2: This is another community variant of the Llama 2 13B model. It is also recognized for its impressive storytelling abilities.
-
Zephyr 7b: This model is not based on Llama 2, but instead the Mistral 7b model as the foundation. The model is optimized for capabilities, and not safe output generations.
These community variants have been developed by researchers and developers who have taken advantage of the open-source nature of the Llama 2 models. They have fine-tuned the base models to create variants that are better suited to specific tasks or use cases. The open-source license of Llama 2 has been a significant factor in enabling this kind of innovation and customization.
Getting Started with Whisper.cpp
-
Clone the Whisper.cpp repository: This will provide you with the source code.
-
Acquire the Whisper Model: You'll need to fetch a Whisper model. The specifics of this step may vary depending on the source of the model.
-
Install and deploy Whisper.cpp: Detailed instructions for this step can vary depending on your specific use case and deployment environment. You may find additional resources on platforms like GitHub, Reddit, or specific tutorials online.
Remember, both llama.cpp and whisper.cpp are written in C/C++, so you'll need a C/C++ compiler installed on your system. If you're using MacOS, you can install the necessary compilers via Homebrew. If you encounter any issues, consider checking the official documentation or relevant discussion forums for troubleshooting tips.
How does GGML handle Large Language Models (LLMs)?
GGML is used by llama.cpp and whisper.cpp, and is written in C. It supports 16-bit float and integer quantization (e.g., 4-bit, 5-bit, 8-bit), automatic differentiation, built-in optimization algorithms (e.g., ADAM, L-BFGS), and is optimized for Apple Silicon.
When it comes to handling large language models (LLMs), GGML is designed to balance between GPU and CPU usage. This is particularly useful when the model size exceeds the available VRAM, as GGML can outperform other methods by using system RAM instead of VRAM. For instance, a GGML 33B model can still be faster than a GPTQ 13B model with multiple layers being swapped to the system RAM.
However, it's important to note that GGML models can be slower than other models when run on a GPU, as they are designed to also utilize the CPU. This can lead to slower token generation and longer start times for generating tokens. The performance of GGML models can be improved by correctly assigning the number of threads to the GGML loader, which should be equal to or smaller than the actual CPU core count.
GGML is also optimized for different architectures, including x86 architectures utilizing AVX / AVX2 intrinsics, and offers web support via WebAssembly and WASM SIMD. It has no third-party dependencies and zero memory allocations during runtime, which can contribute to its performance.
In terms of specific performance stats on Apple Silicon, a 7B LLaMA model with 4-bit quantization and 3.5 GB size can generate tokens at a rate of 43 ms per token on an M1 Pro with 8 CPU threads, while a 13B LLaMA model with 4-bit quantization and 6.8 GB size can generate tokens at a rate of 73 ms per token on the same hardware.
GGML balances between GPU and CPU usage, and optimizes performance based on the available hardware. However, it's important to correctly configure the GGML loader to ensure optimal performance.
How does GGML work with other libraries?
GGML is a tensor library for machine learning developed by Georgi Gerganov. It is written in C/C++ and is designed to be fast, portable, and easily embeddable, making use of various hardware acceleration systems like BLAS, CUDA, OpenCL, and Metal. It supports quantized inference for reduced memory footprint and faster inference. GGML operates seamlessly across various platforms, including Mac, Windows, Linux, iOS, Android, web browsers, and even Raspberry Pi.
Compared to other machine learning libraries, GGML has several unique features and advantages. It doesn't require a specific format for the model file, meaning you can convert the model file from any other framework (like TensorFlow, Pytorch, etc.) into a binary file in any format that's easy for you to handle later. It also supports a wide range of models like Whisper and LLaMa.
However, GGML also has some limitations. It is still in the development phase and currently lacks comprehensive documentation, which can make it hard for new users to start using it quickly. Reusing the source code across different models can be difficult due to the unique structure of each model.
In terms of performance, GGML has been found to have a steady inference speed, with one user reporting about 82 tokens per second on a GPU with a 4090 and 24gb of RAM. However, the performance can vary depending on the specific model and the hardware used. For instance, when comparing GGML with GPTQ, if you can fit the entire model in VRAM + context, then GPTQ is going to be significantly faster. If not, then GGML is faster to significantly faster depending on how many layers you have to offload.
GGML is a versatile and portable machine learning library that offers several advantages over other libraries, particularly in terms of its flexibility with model file formats and its support for a wide range of models. However, its performance can vary depending on the specific model and hardware used, and it currently lacks comprehensive documentation, which may pose challenges for new users.
What is the GGUF file format?
The GGUF file format is a new extensible binary format for AI models, specifically designed for models developed in frameworks like PyTorch. It was introduced in August 2023 and is focused on fast loading, flexibility, and single-file convenience. The GGUF file format is used with the LLaMA and Llama-2 AI models and runs on llama.cpp.
The GGUF file format improves on previous formats like GGML and GGJT by offering better tokenization, support for special tokens, metadata, and extensibility. It uses a key-value structure for things like hyperparameters instead of just a list of values, making it more flexible and extensible for future changes. New information can be added without breaking compatibility with existing GGUF models.
The GGUF file format is designed to be easy to use, requiring just a small amount of code to load models. There's no need for external libraries as the file contains the full model information. It is a successor to the GGML, GGMF, and GGJT file formats and is designed to be unambiguous by containing all the information needed to load a model. It is also designed to be extensible, so that new features can be added to GGML without breaking compatibility with older models.
In terms of usage, GGUF is meant for models that you want to use for inference with llama.cpp or related systems. The GGUF file contains all information needed to load and run the model.
What are the advantages associated with using GGML?
GGML (Generic Graph Machine Learning) is a tensor library for machine learning that offers several advantages for large-scale model training and high-performance computing on commodity hardware. Here are some of the key benefits:
-
Cross-Platform Compatibility: GGML is written in C, which allows it to operate seamlessly across various platforms, including Mac, Windows, Linux, iOS, Android, web browsers, and even Raspberry Pi.
-
Edge Computing: GGML is designed with a feature that ensures no memory allocation during runtime. It supports half-size float and integer quantization, which allows developers to have better control over memory usage and performance management. This is crucial for making machine learning models work well on edge devices where efficient resource use is important.
-
Model Format Flexibility: Unlike other machine learning inference frameworks, GGML doesn't require a specific format for the model file. This means you can convert the model file from any other framework (like TensorFlow, Pytorch, etc.) into a binary file in any format that's easy for you to handle later.
-
Large-Scale Model Training: GGML is designed to cater to the needs of machine learning experts, providing a comprehensive range of features and optimizations for large-scale model training and high-performance computing on any commodity hardware.
-
Quantization Support: GGML supports a number of different quantization strategies (e.g., 4-bit, 5-bit, and 8-bit quantization), each of which offers different trade-offs between efficiency and performance.
-
CPU Inferencing: GGML allows for CPU inferencing, which can be beneficial when GPU resources are limited. This makes it possible to run large language models on consumer hardware with effective CPU inferencing.
-
Better Response Quality: GGML models can provide better response quality compared to other models. For instance, a GGML 33B model can provide better output quality than a 7B model, even if the latter has faster inference.
However, it's important to note that GGML is still in the development phase and currently lacks comprehensive documentation, which can make it hard for new users to start using it quickly. Also, reusing the source code across different models can be difficult due to the unique structure of each model. Despite these challenges, GGML's benefits make it a promising tool for machine learning applications.
What are the disadvantages of using GGML?
GGML is a tensor library for machine learning that is known for its efficient operation on CPUs and its ability to handle large models on commodity hardware. It is written in C and supports automatic differentiation, making it suitable for model training and inference in cross-platform applications. However, despite its advantages, GGML has several limitations compared to other machine learning libraries.
-
Documentation: GGML is still in the development phase and currently lacks comprehensive documentation. This can make it challenging for new users to start using it quickly.
-
Code Reusability: Reusing the source code across different models can be difficult due to the unique structure of each model. GGML does not provide a universal guide for this, so users often need to create their own inference code, particularly when working with custom models developed in-house. This process requires a deep understanding of how to work with mathematical matrices and the structure of ML models.
-
Quantization: GGML files contain a quantized representation of model weights, which results in lower quality. The benefit is less RAM requirements and faster inference on the CPU, but this means less precision as it needs fewer bits.
-
GPU Support: While GGML is efficient on CPUs, its performance on GPUs is not yet fully explored. Some discussions suggest that GGML-based inference code might not outperform established libraries like PyTorch or Triton in terms of GPU performance. However, there are ongoing efforts to add GPU support to GGML, and it's expected that this will enhance its performance.
-
Model Format: GGML does not require a specific format for the model file, which means you can convert the model file from any other framework (like TensorFlow, Pytorch, etc.) into a binary file in any format that's easy for you to handle later. While this offers flexibility, it might also introduce additional steps in the model deployment process, especially for users accustomed to standard formats like ONNX or TensorFlow's SavedModel.
While GGML has several advantages, such as its efficient operation on CPUs and its flexibility in handling different model formats, it also has limitations, including a lack of comprehensive documentation, challenges in code reusability, and limited GPU support. These factors should be considered when choosing a machine learning library for a specific use case.