Block Cyclic Distribution in MPI
Creato il: 10 gennaio 2025
Creato il: 10 gennaio 2025
Clarifying the problem
The MPI-based parallel code distributes a 12 x 12 matrix across a 2 x 3 grid using a darray distribution. Each process's local array is initially in column-major order but needs to be rearranged into contiguous 2 x 2 blocks. Broadcasting blocks over rows or columns from the root process is also required.
Broadcast or buffer?
MPI allows for direct subarray broadcasting or copying into a local buffer. Defining derived MPI data types can specialize sub-block broadcasting. Using MPI_DISTRIBUTE_BLOCK or MPI_DISTRIBUTE_CYCLIC aligns the pattern.
Preparing for a targeted layout
To ensure a specific local array layout, I recommend defining MPI_ORDER_C when creating the darray. For row-wise broadcast, use MPI_Cart_sub to form row and column comms, then employ MPI_Bcast with a subarray type.
Crafting a derived datatype
To broadcast a specific sub-block, define a derived datatype using MPI_Type_create_subarray or MPI_Type_vector, commit it, and employ MPI_Bcast. Receiving processes will maintain the sub-block's original memory layout.
Evaluating simple vs. advanced
I'm assessing whether to use a local copy and standard MPI_Bcast for the sub-block or define a derived data type for a more nuanced approach.
Opting for simplicity
I’m inclined to use a local copy in a contiguous array for broadcasting, which is straightforward and effective.
Thought about block cyclic distribution in MPI for 14 seconds
Below is an annotated discussion of the main points in your question plus some strategies you can use to organize your data and communicate the sub-blocks more elegantly.
When you use
MPI_Type_create_darray( P*Q, rank, 2, gsizes, distribs, dargs, psizes, MPI_ORDER_FORTRAN, MPI_DOUBLE, &filetype );
the data is read from file in Fortran order (column-major ordering). The DARRAY type in MPI will interpret “global” indexing in Fortran fashion, and then map the piece that belongs to each process into your local buffer.
Hence, the layout of each process’s local portion A
is in column-major order within that buffer. The first elements you see in memory (e.g., 1, 13, 49, 61, …) reflect how Fortran indexing jumps down columns first.
If you want the internal storage of the local array in C row-major order, you can try changing the MPI_ORDER_FORTRAN
to MPI_ORDER_C
. However, be aware that DARRAY with MPI_ORDER_C
can sometimes yield unexpected distributions if you are mixing it with code or data arranged in Fortran style. In practice, many people leave the DARRAY as is (Fortran ordering) and then do a local reorder later, or define a different MPI datatype for the final rearranged layout.
Strictly speaking, a single MPI_Type_create_darray
call will not rearrange the sub-blocks in memory so that each of your 2×2 tiles appears contiguously in row-major order. The DARRAY logic is designed to pick out which elements each rank owns, but not necessarily to reorder them in memory the way you want.
MPI_Type_create_darray(...)
to read in your chunk of the matrix. You end up with a local buffer A
(size = (N*N)/(P*Q)
elements) in Fortran-like ordering.A
into a new buffer A_reordered
that you control.This two-step approach is by far the most common in practice because it is simpler to debug, even though it requires a local copy.
You could define a “chain” of derived datatypes:
But doing so is typically far more complicated than a local rearrangement in memory. You would need careful offset/stride calculations so that each sub-block arrives in a contiguous segment. Usually, HPC programmers accept the simpler approach A.
You mentioned that, after reading data, you want to send certain block-rows or block-columns from one processor to another. For instance, in a 2×3 grid:
+-----+-----+-----+
| 1 | 2 | 3 |
+-----+-----+-----+
| 4 | 5 | 6 |
+-----+-----+-----+
You already discovered that creating 2D Cartesian sub-communicators via MPI_Cart_sub
is the right step:
// 2D communicator “comm2d” of size 2 x 3 int dims[2] = {P, Q}; int periods[2] = {0, 0}; // usually no wrap-around MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm2d); // Then create row-wise communicator (height=1) int remain_dims[2] = {0, 1}; // keep dimension 1, drop dimension 0 MPI_Comm rowcomm; MPI_Cart_sub(comm2d, remain_dims, &rowcomm); // Then create column-wise communicator (width=1) remain_dims[0] = 1; // keep dimension 0, drop dimension 1 remain_dims[1] = 0; MPI_Comm colcomm; MPI_Cart_sub(comm2d, remain_dims, &colcomm);
There are two main ways:
Local copy + MPI_Bcast (or MPI_Send/MPI_Recv)
block_buffer
).MPI_Bcast(block_buffer, block_size, MPI_DOUBLE, root, rowcomm)
or a series of MPI_Send
/MPI_Recv
.block_buffer
.Use a derived datatype describing that sub-block
A
at the correct offsets so that a call like
actually sends (and receives) just those 2×2 elements.MPI_Bcast(A, 1, block_subarray_type, root, rowcomm);
Most people opt for the first approach (local copy into a small buffer) because it is much simpler and typically not a performance bottleneck, especially for relatively small 2×2 or 4×4 blocks. Trying to maintain a derived datatype for each possible sub-block or offset is more error-prone.
Yes, absolutely. The MPI standard allows you to define a subarray (or vector) datatype that describes where the elements live in your local array A
. For example, you could do something like:
MPI_Datatype sub_block_t; // Suppose your local array is 4 x 6 in Fortran order, and you want to broadcast // the block-row consisting of the top 2 rows and 4 columns. int sizes[2] = {4, 6}; // full local "matrix" shape int subsizes[2] = {2, 4}; // shape of the sub-block to broadcast int starts[2] = {0, 0}; // offset from top-left corner MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_FORTRAN, MPI_DOUBLE, &sub_block_t); MPI_Type_commit(&sub_block_t); // Now you can broadcast: MPI_Bcast(A, 1, sub_block_t, root, rowcomm);
A
(assuming A
is allocated identically on all processes).However, if your processes do not want to place the broadcasted data at the exact same indices in memory, you either:
Hence the simplest path is almost always:
MPI_Bcast
that buffer with a simple (count, MPI_DOUBLE)
callGiven the small size of blocks (2×2, 4×4, etc.), this is usually faster to implement and less error-prone.
MPI_ORDER_C
if you absolutely want row-major, but keep in mind that you can still end up with a non-intuitive layout in memory).MPI_Bcast
(or MPI_Send
/MPI_Recv
) of that contiguous buffer,MPI_Type_create_subarray
or MPI_Type_vector
) that describes your sub-block’s location in memory.MPI_Bcast(A, 1, sub_block_type, root, comm)
to send/receive in-place.In practice, most codes keep the DARRAY approach for reading/writing files (since it nicely hides the complexity of global distribution) and then do local copies (or a simpler derived data type) for sub-block communication. Zero-copy solutions via complicated derived datatypes are possible but rarely worth the trouble unless you are chasing very specific performance bottlenecks.