🤗 Assisted Generation Demo
Model: meta-llama/Llama-3.1-8B (4-bit quantization)
Assistant Model: meta-llama/Llama-3.2-1B (FP16)
Recipe for good speedup: a) >10x model size difference in parameters; b) assistant trained similarly; c) CPU is not a bottleneck
Prompt
A sequence: one, two, three,
Model output
Submit
Generation Settings
Use Assisted Generation
Max New Tokens
↺
1
500
Temperature (0.0 = Greedy)
↺
0
2
Tokens per second
Textbox