Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

how much vram does the 8B model use?


In general you can swap B for GB (and use the q8 quantization), so 8GB VRAM can probably just about work.


If you want to not quantize at all, you need to double it for fp16—16GB.


Yes, but I think it's standard to do inference at q8, not fp16.


You can use 5 bits per parameter with negligible loss of capability as a general rule. 4 bits for a tiny bit worse results. This is subject to changes in how good quantization is in general and on the specific model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: