Table of Contents
Models
An “AI” lives in the Model - a large data file loaded by a backend or interface which is then able to perform inference. There are thousands upon thousands of models in existence, each with their own unique behavior.
For a number of reasons, an exhaustive list of models cannot be provided. Instead, this page aims to provide a basic understanding and orientation of model type and selection.
Hardware Requirements
The most important element in model selection is “Can your hardware run it?” The rough rule-of-thumb is that with a 4-bit quantization (Q4*), it takes 1GB of RAM per billion parameters. It is strongly recommended to use the GPU (and therefore the associated VRAM) to perform inference, as CPU (and system RAM) is far slower.
On mobile devices or other unified memory architectures, a better estimate is 1.5GB of RAM per billion parameters.
Quantization
Roughly explained, models are a collection of numbers (weights). When first created, 16 or 32-bit floating point numbers are used. These numbers can be reduced to fewer bits (2-8 bits, commonly), reducing memory footprint and increasing inference speed at the cost of reduced quality.
Fortunately, these trade-offs are non-linear. 4-bit quantization is considered the “sweet spot” with minimal losses in model quality exchanged for large reductions in size and memory requirements. Currently, Q4_K_M is frequently recommended as an ideal quantization format.
Oft-Recommended Models
The most popular recommendation is Mistral Nemo 12B Instruct, with a 4-bit quantization available here.
