ExLlamaV2

Fast inference library for running LLMs locally on NVIDIA GPUs

D
Score: 47/100
Type
Execution
aot
Interface
sdk

About

ExLlamaV2 is a fast inference library for running LLMs locally on modern NVIDIA GPUs. It's optimized for low VRAM usage with its EXL2 quantization format, supports dynamic batching, and offers excellent tokens/second performance.

Performance

1000ms
Cold Start
300MB
Base Memory
200ms
Startup Overhead

Last Verified

Date: Jan 18, 2026
Method: manual test

Manually verified

Languages

PythonC++CUDA

Details

Isolation
process
Maturity
stable
License
MIT

Links