Mistral models

Mistral-7B is a high-performance 7B parameter language model featuring sliding window attention, Grouped Query Attention (GQA), and RoPE position embeddings for efficient long-context processing.

Features

Sliding window attention: Efficient processing of long contexts (4096 token window)
Grouped Query Attention (GQA): Reduced KV cache memory with 4:1 query:key-value head ratio
RoPE position embeddings: Rotary position encoding for better extrapolation
Quantization support: 4-bit and 8-bit quantized models available
SwiGLU activation: Improved MLP activation function

Installation

Add to your Cargo.toml:

[dependencies]
mistral-mlx = { path = "../mistral-mlx" }
mlx-rs = "0.18"

Quick start

Download model

Download 4-bit quantized model (recommended):

huggingface-cli download mlx-community/Mistral-7B-Instruct-v0.3-4bit \
    --local-dir ./models/Mistral-7B-4bit

Or full precision model:

huggingface-cli download mlx-community/Mistral-7B-Instruct-v0.3-bf16 \
    --local-dir ./models/Mistral-7B

Run text generation

cargo run --release --example generate_mistral -- \
    --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
    --prompt "Explain machine learning in simple terms"

Use in your code

use mistral_mlx::{load_model, load_tokenizer, Generate, KVCache};
use mlx_rs::ops::indexing::NewAxis;

let mut model = load_model("./models/Mistral-7B-4bit")?;
let tokenizer = load_tokenizer("./models/Mistral-7B-4bit")?;

// Prepare prompt with Mistral Instruct format
let formatted = format!("[INST] {} [/INST]", "Hello");
let encoding = tokenizer.encode(formatted.as_str(), true)?;
let prompt = mlx_rs::Array::from(encoding.get_ids()).index(NewAxis);

// Generate
let mut cache = Vec::new();
let generator = Generate::<KVCache>::new(&mut model, &mut cache, 0.7, &prompt);

for token in generator.take(50) {
    let token = token?;
    print!("{}", tokenizer.decode(&[token.item::<u32>()], true)?);
}

Architecture

Mistral uses an optimized transformer architecture:

Mistral-7B
├── Embedding (vocab_size=32000, dim=4096)
├── 32x DecoderLayer
│   ├── input_layernorm
│   ├── Attention (GQA with sliding window)
│   │   ├── q_proj (32 heads)
│   │   ├── k_proj (8 heads, GQA)
│   │   ├── v_proj (8 heads, GQA)
│   │   └── o_proj
│   ├── post_attention_layernorm
│   └── MLP (SwiGLU)
│       ├── gate_proj
│       ├── up_proj
│       └── down_proj
└── norm

Key optimizations

Sliding window attention

Mistral uses a 4096 token attention window instead of full attention:

// Traditional attention: quadratic in sequence length
// Attention complexity: O(n²)

// Mistral sliding window: linear with fixed window
// Attention complexity: O(n × window_size)
// window_size = 4096 tokens

This allows efficient processing of long sequences while maintaining local context awareness.

Grouped Query Attention (GQA)

32 query heads share 8 key-value heads (4:1 ratio):

Query:  32 heads × 128 dims = 4096 dims
Key:     8 heads × 128 dims = 1024 dims  
Value:   8 heads × 128 dims = 1024 dims

KV cache reduction: 4x smaller than full MHA

This reduces memory usage without significant quality loss.

SwiGLU activation

Replaces traditional ReLU/GELU in MLP:

// Traditional MLP
let hidden = gelu(linear1(x));
let output = linear2(hidden);

// SwiGLU MLP (better performance)
let gate = gate_proj(x);
let up = up_proj(x);
let hidden = silu(gate) * up;  // SwiGLU
let output = down_proj(hidden);

Code example

From examples/generate_mistral.rs:

use mistral_mlx::{load_model, load_tokenizer, Generate, KVCache};
use mlx_rs::ops::indexing::NewAxis;
use clap::Parser;

#[derive(Parser)]
struct Args {
    #[arg(long, default_value = "mlx-community/Mistral-7B-Instruct-v0.2-4bit")]
    model: String,

    #[arg(long)]
    prompt: String,

    #[arg(long, default_value = "100")]
    max_tokens: usize,

    #[arg(long, default_value = "0.7")]
    temperature: f32,
}

fn main() -> anyhow::Result<()> {
    let args = Args::parse();

    // Load model and tokenizer
    let tokenizer = load_tokenizer(&model_dir)?;
    let mut model = load_model(&model_dir)?;

    // Mistral Instruct format: [INST] prompt [/INST]
    let formatted = format!("[INST] {} [/INST]", args.prompt);
    let encoding = tokenizer.encode(formatted.as_str(), true)?;
    let prompt_tokens = mlx_rs::Array::from(encoding.get_ids()).index(NewAxis);

    println!("Prompt ({} tokens): {}", encoding.get_ids().len(), args.prompt);
    println!("---");

    // Generate
    let mut cache = Vec::new();
    let generator = Generate::<KVCache>::new(
        &mut model,
        &mut cache,
        args.temperature,
        &prompt_tokens,
    );

    let mut token_ids = Vec::new();
    let eos_token: u32 = 2;  // </s>

    for token in generator.take(args.max_tokens) {
        let token = token?;
        let token_id = token.item::<u32>();

        if token_id == eos_token {
            break;
        }

        token_ids.push(token_id);
    }

    let response = tokenizer.decode(&token_ids, true)?;
    println!("{}", response);

    Ok(())
}

Supported models

Mistral-7B-v0.3 (bf16)

Parameters: 7B
Size: 14 GB
Precision: bfloat16
Use case: Maximum quality

huggingface-cli download \
  mlx-community/Mistral-7B-Instruct-v0.3-bf16 \
  --local-dir ./models/Mistral-7B

Mistral-7B-v0.3 (4-bit)

Parameters: 7B
Size: 4 GB
Precision: 4-bit quantized
Use case: Recommended for consumer hardware

huggingface-cli download \
  mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --local-dir ./models/Mistral-7B-4bit

Model variants

Model	Parameters	Notes
Mistral-7B-v0.1	7B	Original release
Mistral-7B-v0.2	7B	Improved instruction following
Mistral-7B-v0.3	7B	Latest version (recommended)

All variants available in both base and Instruct versions. Use Instruct for chat applications.

Performance

Benchmark results (Apple M3 Max)

Model	Memory	Tokens/sec (Prefill)	Tokens/sec (Decode)
Mistral-7B (bf16)	14 GB	~180 tok/s	40 tok/s
Mistral-7B (4-bit)	4 GB	~220 tok/s	55 tok/s

4-bit quantization provides:

3.5x less memory (4 GB vs 14 GB)
1.4x faster generation (55 vs 40 tok/s)
Minimal quality degradation

Sliding window efficiency

Mistral’s 4096-token sliding window enables efficient long-context processing:

Context Length	Memory (bf16)	Memory (4-bit)	Speed
2K tokens	14 GB	4 GB	40 tok/s
4K tokens	15 GB	4.5 GB	38 tok/s
8K tokens	16 GB	5 GB	35 tok/s

Memory grows linearly but slowly thanks to GQA and sliding window.

Mistral Instruct format

Mistral Instruct models use a specific chat format:

// Single turn
let prompt = "[INST] Your question here [/INST]";

// Multi-turn conversation
let prompt = "[INST] First question [/INST] First response [INST] Follow-up question [/INST]";

Always wrap user messages in [INST]...[/INST] tags for best results.

Converting models

Convert from HuggingFace with quantization:

pip install mlx-lm

# 4-bit quantization
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q

# Without quantization
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3

Model configuration

Mistral-7B configuration:

{
  "hidden_size": 4096,
  "num_hidden_layers": 32,
  "num_attention_heads": 32,
  "num_key_value_heads": 8,
  "intermediate_size": 14336,
  "sliding_window": 4096,
  "vocab_size": 32000,
  "rope_theta": 10000.0,
  "rms_norm_eps": 1e-05
}

Key parameters:

GQA ratio: 32:8 = 4:1 (4x KV cache reduction)
Sliding window: 4096 tokens
Large intermediate: 14336 dims (3.5× hidden size)

API reference

Loading functions

pub fn load_model(model_dir: impl AsRef<Path>) -> Result<Model, Error>
pub fn load_tokenizer(model_dir: impl AsRef<Path>) -> Result<Tokenizer, Error>

Generation

pub struct Generate<C: KeyValueCache> {
    // fields omitted
}

impl<C: KeyValueCache> Generate<C> {
    pub fn new(
        model: &mut Model,
        cache: &mut Vec<C>,
        temperature: f32,
        prompt: &Array,
    ) -> Self
}

Iterator yielding tokens. Temperature 0.0 enables greedy sampling.

Re-exported types

pub use mlx_rs_core::{
    KVCache, ConcatKeyValueCache, KeyValueCache,
    create_attention_mask, scaled_dot_product_attention,
    initialize_rope, AttentionMask, SdpaMask,
};

Benchmarking

Run the benchmark example:

cargo run --release --example benchmark_mistral -- \
    ./models/Mistral-7B-4bit --iterations 10

This measures:

Model load time
Prompt processing speed
Token generation speed
Memory usage

Troubleshooting

Unexpected output format

Make sure to use Mistral Instruct format:

// ❌ Wrong
let prompt = "What is the capital of France?";

// ✅ Correct
let prompt = "[INST] What is the capital of France? [/INST]";

Out of memory

Mistral-7B (bf16) requires 16GB+ memory. Solutions:

Use 4-bit quantized model
Close other applications
Reduce max generation length

Slow generation speed

Use --release build mode (essential for performance)
Verify Metal GPU is active in Activity Monitor
Use 4-bit model for faster inference
Update to latest macOS for Metal optimizations

Mixtral - MoE variant with 8x7B experts
Qwen3-8B - Alternative 8B dense model

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

Features

Installation

Quick start

Architecture

Key optimizations

Sliding window attention

Grouped Query Attention (GQA)

SwiGLU activation

Code example

Supported models

Mistral-7B-v0.3 (bf16)

Mistral-7B-v0.3 (4-bit)

Model variants

Performance

Benchmark results (Apple M3 Max)

Sliding window efficiency

Mistral Instruct format

Converting models

Model configuration

API reference

Loading functions

Generation

Re-exported types

Benchmarking

Troubleshooting

Unexpected output format

Out of memory

Slow generation speed

Get Started

Core Concepts

Language Models

Vision-Language Models

Speech Recognition

Text-to-Speech

Image Generation

API Server

Advanced

Documentation Index

​Features

​Installation

​Quick start

​Architecture

​Key optimizations

​Sliding window attention

​Grouped Query Attention (GQA)

​SwiGLU activation

​Code example

​Supported models

Mistral-7B-v0.3 (bf16)

Mistral-7B-v0.3 (4-bit)

​Model variants

​Performance

​Benchmark results (Apple M3 Max)

​Sliding window efficiency

​Mistral Instruct format

​Converting models

​Model configuration

​API reference

​Loading functions

​Generation

​Re-exported types

​Benchmarking

​Troubleshooting

​Unexpected output format

​Out of memory

​Slow generation speed

​Related models

Features

Installation

Quick start

Architecture

Key optimizations

Sliding window attention

Grouped Query Attention (GQA)

SwiGLU activation

Code example

Supported models

Model variants

Performance

Benchmark results (Apple M3 Max)

Sliding window efficiency

Mistral Instruct format

Converting models

Model configuration

API reference

Loading functions

Generation

Re-exported types

Benchmarking

Troubleshooting

Unexpected output format

Out of memory

Slow generation speed

Related models