-
Notifications
You must be signed in to change notification settings - Fork 13.1k
[WIP] Rpc split row #16020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP] Rpc split row #16020
Conversation
…ort split-mode row
Is anybody still working on a backend-agnostic row splitting implementation? |
I don't think anyone is working on this atm. Will tag @slaren and @JohannesGaessler in case they are aware of any ongoing efforts. |
@koush had an initial implementation (#13818 (comment)), but I am not sure if that's still being worked on. |
My current priorities not specific to CUDA are automating how to distribute tensors to GPUs (by reusing the code from #15860) and then I intend to get back to working on backend-agnostic tensor parallelism. In parallel I'm refactoring and deduplicating the FlashAttention CUDA code and optimizing it for AMD. Since I've already invested the effort to read the AMD ISA documentations I'll probably buy an RDNA4 GPU and implement better support for the AMD equivalent of tensor cores. |
Backend agnostic approach would be much more valuable in the big picture, if you ask me. On the other hand, if I can get in touch with @koush and get sync about the current situation, I can gladly get on board with another PR to make this feature work on both alone and cluster mode. |
The PR
This work aims one goal; having row-splitting mode on RPC clusters.
TL;DR
I got bored, wanted to contribute and join you guys on this beautiful journey. I hope you welcome me.
Details & Background
Heavy WIP situation, including this description, I will work on this PR and make sure it fits well with the rest of the project.
For the background:
Metal devices have only one GPU. This is a bit tricky because
Row splitting
has no use on one device/backend. But the ultimate goal is having it, so with RPC, devices can calculate inference effectively and faster. For this, I worked on both sides. I implemented a very, very early stage of row wise splitting mode on Metal backend and then make it work with RPC too.The current PR has the implementation, but, -be aware- the performance is unacceptable and every device you add to the cluster, it gets worse. I will inspect the PR and will read sources I can find to have the Domain Knowledge I need to have to solve this.
Tests & Results:
In any cases, every comment, suggestion and support are very welcome.