Learning LLM optimization notes (WIP)

26 Aug, 2024

Inference

To be read

https://dipkumar.dev/becoming-the-unbeatable/posts/gpt-kvcache/
https://huggingface.co/blog/assisted-generation

General thoughts

VLLM docs colossal ai medusa sglang

you will find basically all the concepts in the docs of these websites. i think understanding stuff is important but maybe, just maybe i will be good enough if i know how to apply stuff.

Speculative decoding

I am referring to this video for overview of kv cache, continous batching, speculative decoding (using draft model and target model, medusa, n-grams method)

Another good read which I understood like 50% on my first read is the Pytorch's hitchhikers-guide-speculative-decoding

What is a kernel

Screenshot 2024-08-26 at 4

Kernels is too low level stuff. Maybe I can look into them in the long term. However, pytorch and triton level code is ok.

If I think about it, it's time for to go into the deeper layers than API. I mean Product level stuff is fine but a deeper layer moat is liberating and might give me more ideas.

#AI #ml #notes #technical