ExLlamaV2

Fast inference library for running LLMs locally on NVIDIA GPUs

Score: 47/100

Type

Execution

aot

Interface

sdk

About

ExLlamaV2 is a fast inference library for running LLMs locally on modern NVIDIA GPUs. It's optimized for low VRAM usage with its EXL2 quantization format, supports dynamic batching, and offers excellent tokens/second performance.

Performance

1000ms

Cold Start

300MB

Base Memory

200ms

Startup Overhead

✓ Last Verified

Date: Jan 18, 2026

Method: manual test

Manually verified

Languages

PythonC++CUDA

Details

Isolation: process
Maturity: stable
License: MIT

Links

Website GitHub