Learning LLM optimization notes (WIP)
Inference
To be read
- https://dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/
- https://huggingface.co/blog/assisted-generation
General thoughts
VLLM docs colossal ai medusa sglang
you will find basically all the concepts in the docs of these websites. i think understanding stuff is important but maybe, just maybe i will be good enough if i know how to apply stuff.
Speculative decoding
I am referring to this video for overview of kv cache, continous batching, speculative decoding (using draft model and target model, medusa, n-grams method)
Another good read which I understood like 50% on my first read is the Pytorch's hitchhikers-guide-speculative-decoding
What is a kernel
Kernels is too low level stuff. Maybe I can look into them in the long term. However, pytorch and triton level code is ok.
If I think about it, it's time for to go into the deeper layers than API. I mean Product level stuff is fine but a deeper layer moat is liberating and might give me more ideas.