Block Cyclic Distribution in MPI
تم الإنشاء في: ١٠ يناير ٢٠٢٥
تم الإنشاء في: ١٠ يناير ٢٠٢٥
I am trying to distribute my matrix in block cyclic fashion. I learned a lot from this question (MPI IO Reading and Writing Block Cyclic Matrix), but that is not what I really need.
Let me explain my problem.
Suppose I have this matrix of dimension 12 x 12 which I want to distribute over a processor grid of dimension 2 x 3 such that first processor gets bolded elements:
A =
text1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
109 110 111 112 113 114 115 116 117 118 119 120
121 122 123 124 125 126 127 128 129 130 131 132
133 134 135 136 137 138 139 140 141 142 143 144
So, basically, I want to partition my matrix in blocks of dimensions 2 x 2, and then distribute those blocks to processors (numbered from 1 to 6) in this way:
1 2 3 1 2 3
4 5 6 4 5 6
1 2 3 1 2 3
4 5 6 4 5 6
I tried to achieve that as in above linked question, but the problem is that my local array for first processor is formed column-wise, i.e. it look like this
1, 13, 49, 61, 97, 109, 2, 14, 50, 62, 98, 110, 7, 19, 55, 67, 103, 115, 8, 20, 56, 68, 104, 116
This is my C code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "mpi.h"
#define N 12
#define P 2
#define Q 3
int main(int argc, char **argv) {
int rank;
int size;
textdouble *A; int A_size; MPI_Datatype filetype; MPI_File fin; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); /** * Reading from file. */ int gsizes[2], distribs[2], dargs[2], psizes[2]; gsizes[0] = N; /* no. of rows in global array */ gsizes[1] = N; /* no. of columns in global array*/ distribs[0] = MPI_DISTRIBUTE_CYCLIC; distribs[1] = MPI_DISTRIBUTE_CYCLIC; dargs[0] = 2; // no of rows in block dargs[1] = 2; // no of cols in block psizes[0] = P; /* no. of processes in vertical dimension of process grid */ psizes[1] = Q; /* no. of processes in horizontal dimension of process grid */ MPI_Type_create_darray(P * Q, rank, 2, gsizes, distribs, dargs, psizes, MPI_ORDER_FORTRAN, MPI_DOUBLE, &filetype); MPI_Type_commit(&filetype); MPI_File_open(MPI_COMM_WORLD, "A.txt", MPI_MODE_RDONLY, MPI_INFO_NULL, &fin); MPI_File_set_view(fin, 0, MPI_DOUBLE, filetype, "native", MPI_INFO_NULL); A_size = (N * N) / (P * Q); A = (double*) malloc(A_size * sizeof(double)); MPI_File_read_all(fin, A, A_size, MPI_DOUBLE, &status); MPI_File_close(&fin); printf("\n======\ni = %d\n", rank); printf("A : "); for (int i = 0; i < A_size; i++) { printf("%lg ", A[i]); } MPI_Finalize(); return 0;
}
What I really want is that those 2 x 2 blocks are written consecutive, i.e. that local array of first processor looks like this;
1, 13, 2, 14, 49, 61, 50, 62, 97, 109, 98, 110, ...
I assume that I will need to define another MPI_Datatype (like vector or subarray), but I just cannot figure it out how would I achieve that.
Edit
I think I have partially solved my problem. Basically, each processor will end up with 4 x 6 matrix in FORTRAN order, and then with MPI_Create_subarray(...) I can easily extract 2 x 2 block.
But I want that each processor sends its block-row to each processor in same column and vice-versa. Processors are numbered in grid
1 2 3
4 5 6
so, for example, in first step, processor 1 should send its block-row
1 2 7 8
13 14 19 20
to processor 4; and its block-column
1 2
13 14
49 50
61 62
97 98
109 110
to processors 2 and 3.
I created Cartesian communicator, and used MPI_Cart_sub() to create row-wise and column-wise communicators, too.
I think I should use MPI_Bcast(), but I do not know how to combine MPI_Bcast() with MPI_Type_create_subarray(). I should first copy extracted subarray to some local_array and then Bcast(local_array). However, MPI_Type_create_subarray() gives me only "view" on subarray, not actually it, so the best solution I came up with is to Isend-Irecv root->root.
Is there a more elegant solution?
Below is an annotated discussion of the main points in your question plus some strategies you can use to organize your data and communicate the sub-blocks more elegantly.
When you use
cMPI_Type_create_darray( P*Q, rank, 2, gsizes, distribs, dargs, psizes, MPI_ORDER_FORTRAN, MPI_DOUBLE, &filetype );
the data is read from file in Fortran order (column-major ordering). The DARRAY type in MPI will interpret “global” indexing in Fortran fashion, and then map the piece that belongs to each process into your local buffer.
Hence, the layout of each process’s local portion A
is in column-major order within that buffer. The first elements you see in memory (e.g., 1, 13, 49, 61, …) reflect how Fortran indexing jumps down columns first.
If you want the internal storage of the local array in C row-major order, you can try changing the MPI_ORDER_FORTRAN
to MPI_ORDER_C
. However, be aware that DARRAY with MPI_ORDER_C
can sometimes yield unexpected distributions if you are mixing it with code or data arranged in Fortran style. In practice, many people leave the DARRAY as is (Fortran ordering) and then do a local reorder later, or define a different MPI datatype for the final rearranged layout.
Strictly speaking, a single MPI_Type_create_darray
call will not rearrange the sub-blocks in memory so that each of your 2×2 tiles appears contiguously in row-major order. The DARRAY logic is designed to pick out which elements each rank owns, but not necessarily to reorder them in memory the way you want.
MPI_Type_create_darray(...)
to read in your chunk of the matrix. You end up with a local buffer A
(size = (N*N)/(P*Q)
elements) in Fortran-like ordering.A
into a new buffer A_reordered
that you control.This two-step approach is by far the most common in practice because it is simpler to debug, even though it requires a local copy.
You could define a “chain” of derived datatypes:
But doing so is typically far more complicated than a local rearrangement in memory. You would need careful offset/stride calculations so that each sub-block arrives in a contiguous segment. Usually, HPC programmers accept the simpler approach A.
You mentioned that, after reading data, you want to send certain block-rows or block-columns from one processor to another. For instance, in a 2×3 grid:
text+-----+-----+-----+ | 1 | 2 | 3 | +-----+-----+-----+ | 4 | 5 | 6 | +-----+-----+-----+
You already discovered that creating 2D Cartesian sub-communicators via MPI_Cart_sub
is the right step:
c// 2D communicator “comm2d” of size 2 x 3 int dims[2] = {P, Q}; int periods[2] = {0, 0}; // usually no wrap-around MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm2d); // Then create row-wise communicator (height=1) int remain_dims[2] = {0, 1}; // keep dimension 1, drop dimension 0 MPI_Comm rowcomm; MPI_Cart_sub(comm2d, remain_dims, &rowcomm); // Then create column-wise communicator (width=1) remain_dims[0] = 1; // keep dimension 0, drop dimension 1 remain_dims[1] = 0; MPI_Comm colcomm; MPI_Cart_sub(comm2d, remain_dims, &colcomm);
There are two main ways:
Local copy + MPI_Bcast (or MPI_Send/MPI_Recv)
block_buffer
).MPI_Bcast(block_buffer, block_size, MPI_DOUBLE, root, rowcomm)
or a series of MPI_Send
/MPI_Recv
.block_buffer
.Use a derived datatype describing that sub-block
A
at the correct offsets so that a call like
actually sends (and receives) just those 2×2 elements.cMPI_Bcast(A, 1, block_subarray_type, root, rowcomm);
Most people opt for the first approach (local copy into a small buffer) because it is much simpler and typically not a performance bottleneck, especially for relatively small 2×2 or 4×4 blocks. Trying to maintain a derived datatype for each possible sub-block or offset is more error-prone.
Yes, absolutely. The MPI standard allows you to define a subarray (or vector) datatype that describes where the elements live in your local array A
. For example, you could do something like:
cMPI_Datatype sub_block_t; // Suppose your local array is 4 x 6 in Fortran order, and you want to broadcast // the block-row consisting of the top 2 rows and 4 columns. int sizes[2] = {4, 6}; // full local "matrix" shape int subsizes[2] = {2, 4}; // shape of the sub-block to broadcast int starts[2] = {0, 0}; // offset from top-left corner MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_FORTRAN, MPI_DOUBLE, &sub_block_t); MPI_Type_commit(&sub_block_t); // Now you can broadcast: MPI_Bcast(A, 1, sub_block_t, root, rowcomm);
A
(assuming A
is allocated identically on all processes).However, if your processes do not want to place the broadcasted data at the exact same indices in memory, you either:
Hence the simplest path is almost always:
MPI_Bcast
that buffer with a simple (count, MPI_DOUBLE)
callGiven the small size of blocks (2×2, 4×4, etc.), this is usually faster to implement and less error-prone.
MPI_ORDER_C
if you absolutely want row-major, but keep in mind that you can still end up with a non-intuitive layout in memory).MPI_Bcast
(or MPI_Send
/MPI_Recv
) of that contiguous buffer,MPI_Type_create_subarray
or MPI_Type_vector
) that describes your sub-block’s location in memory.MPI_Bcast(A, 1, sub_block_type, root, comm)
to send/receive in-place.In practice, most codes keep the DARRAY approach for reading/writing files (since it nicely hides the complexity of global distribution) and then do local copies (or a simpler derived data type) for sub-block communication. Zero-copy solutions via complicated derived datatypes are possible but rarely worth the trouble unless you are chasing very specific performance bottlenecks.