🤗 Assisted Generation Demo

  • Model: meta-llama/Llama-3.1-8B (4-bit quantization)
  • Assistant Model: meta-llama/Llama-3.2-1B (FP16)
  • Recipe for good speedup: a) >10x model size difference in parameters; b) assistant trained similarly; c) CPU is not a bottleneck

Generation Settings

1 500
0 2

Tokens per second