Skip to content

Conversation

LeaveNhA
Copy link

@LeaveNhA LeaveNhA commented Sep 16, 2025

The PR

This work aims one goal; having row-splitting mode on RPC clusters.

TL;DR

I got bored, wanted to contribute and join you guys on this beautiful journey. I hope you welcome me.

Details & Background

Heavy WIP situation, including this description, I will work on this PR and make sure it fits well with the rest of the project.

For the background:

Metal devices have only one GPU. This is a bit tricky because Row splitting has no use on one device/backend. But the ultimate goal is having it, so with RPC, devices can calculate inference effectively and faster. For this, I worked on both sides. I implemented a very, very early stage of row wise splitting mode on Metal backend and then make it work with RPC too.

The current PR has the implementation, but, -be aware- the performance is unacceptable and every device you add to the cluster, it gets worse. I will inspect the PR and will read sources I can find to have the Domain Knowledge I need to have to solve this.

Tests & Results:

❯ ./build-rpc-split-mode-row-release/bin/llama-bench -m ../llama.cpp.org.new.rpc/hfmodels/models/llama-2-7b.Q4_0.gguf --split-mode row --rpc 127.0.0.1:50052
| model                          |       size |     params | backend    | threads |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ----: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS,RPC |       8 |   row |           pp512 |         54.29 ± 6.55 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Metal,BLAS,RPC |       8 |   row |           tg128 |          0.64 ± 0.01 |

build: 997e3047 (6444)

In any cases, every comment, suggestion and support are very welcome.

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 16, 2025
@LeaveNhA LeaveNhA changed the title Rpc split row [WIP] Rpc split row Sep 16, 2025
@jeffbolznv
Copy link
Collaborator

Is anybody still working on a backend-agnostic row splitting implementation?

@ggerganov
Copy link
Member

Is anybody still working on a backend-agnostic row splitting implementation?

I don't think anyone is working on this atm. Will tag @slaren and @JohannesGaessler in case they are aware of any ongoing efforts.

@slaren
Copy link
Member

slaren commented Sep 16, 2025

@koush had an initial implementation (#13818 (comment)), but I am not sure if that's still being worked on.

@JohannesGaessler
Copy link
Collaborator

My current priorities not specific to CUDA are automating how to distribute tensors to GPUs (by reusing the code from #15860) and then I intend to get back to working on backend-agnostic tensor parallelism.

In parallel I'm refactoring and deduplicating the FlashAttention CUDA code and optimizing it for AMD. Since I've already invested the effort to read the AMD ISA documentations I'll probably buy an RDNA4 GPU and implement better support for the AMD equivalent of tensor cores.

@LeaveNhA
Copy link
Author

Is anybody still working on a backend-agnostic row splitting implementation?

Backend agnostic approach would be much more valuable in the big picture, if you ask me.

On the other hand, if I can get in touch with @koush and get sync about the current situation, I can gladly get on board with another PR to make this feature work on both alone and cluster mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants