Block Cyclic Distribution in MPI

Question

I am trying to distribute my matrix in block cyclic fashion. I learned a lot from this question (MPI IO Reading and Writing Block Cyclic Matrix), but that is not what I really need.

Let me explain my problem.

Suppose I have this matrix of dimension 12 x 12 which I want to distribute over a processor grid of dimension 2 x 3 such that first processor gets bolded elements:

A =

text
 1     2     3     4     5     6     7     8     9    10    11    12
13    14    15    16    17    18    19    20    21    22    23    24
25    26    27    28    29    30    31    32    33    34    35    36
37    38    39    40    41    42    43    44    45    46    47    48
49    50    51    52    53    54    55    56    57    58    59    60
61    62    63    64    65    66    67    68    69    70    71    72
73    74    75    76    77    78    79    80    81    82    83    84
85    86    87    88    89    90    91    92    93    94    95    96
97    98    99   100   101   102   103   104   105   106   107   108

109 110 111 112 113 114 115 116 117 118 119 120
121 122 123 124 125 126 127 128 129 130 131 132
133 134 135 136 137 138 139 140 141 142 143 144
So, basically, I want to partition my matrix in blocks of dimensions 2 x 2, and then distribute those blocks to processors (numbered from 1 to 6) in this way:

1 2 3 1 2 3
4 5 6 4 5 6
1 2 3 1 2 3
4 5 6 4 5 6
I tried to achieve that as in above linked question, but the problem is that my local array for first processor is formed column-wise, i.e. it look like this

1, 13, 49, 61, 97, 109, 2, 14, 50, 62, 98, 110, 7, 19, 55, 67, 103, 115, 8, 20, 56, 68, 104, 116
This is my C code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "mpi.h"

#define N 12
#define P 2
#define Q 3

int main(int argc, char **argv) {
int rank;
int size;

text
double *A;
int A_size;

MPI_Datatype filetype;
MPI_File fin;

MPI_Status status;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

/**
 * Reading from file.
 */
int gsizes[2], distribs[2], dargs[2], psizes[2];

gsizes[0] = N; /* no. of rows in global array */
gsizes[1] = N; /* no. of columns in global array*/

distribs[0] = MPI_DISTRIBUTE_CYCLIC;
distribs[1] = MPI_DISTRIBUTE_CYCLIC;

dargs[0] = 2; // no of rows in block
dargs[1] = 2; // no of cols in block

psizes[0] = P; /* no. of processes in vertical dimension
 of process grid */
psizes[1] = Q; /* no. of processes in horizontal dimension
 of process grid */

MPI_Type_create_darray(P * Q, rank, 2, gsizes, distribs, dargs, psizes,
        MPI_ORDER_FORTRAN, MPI_DOUBLE, &filetype);
MPI_Type_commit(&filetype);

MPI_File_open(MPI_COMM_WORLD, "A.txt",
        MPI_MODE_RDONLY,
        MPI_INFO_NULL, &fin);

MPI_File_set_view(fin, 0, MPI_DOUBLE, filetype, "native",
        MPI_INFO_NULL);

A_size = (N * N) / (P * Q);
A = (double*) malloc(A_size * sizeof(double));
MPI_File_read_all(fin, A, A_size,
        MPI_DOUBLE, &status);

MPI_File_close(&fin);

printf("\n======\ni = %d\n", rank);
printf("A : ");
for (int i = 0; i &lt A_size; i++) {
    printf("%lg ", A[i]);
}

MPI_Finalize();
return 0;

}

What I really want is that those 2 x 2 blocks are written consecutive, i.e. that local array of first processor looks like this;

1, 13, 2, 14, 49, 61, 50, 62, 97, 109, 98, 110, ...
I assume that I will need to define another MPI_Datatype (like vector or subarray), but I just cannot figure it out how would I achieve that.

Edit
I think I have partially solved my problem. Basically, each processor will end up with 4 x 6 matrix in FORTRAN order, and then with MPI_Create_subarray(...) I can easily extract 2 x 2 block.

But I want that each processor sends its block-row to each processor in same column and vice-versa. Processors are numbered in grid

1 2 3
4 5 6
so, for example, in first step, processor 1 should send its block-row

1 2 7 8
13 14 19 20
to processor 4; and its block-column

1 2
13 14
49 50
61 62
97 98
109 110
to processors 2 and 3.

I created Cartesian communicator, and used MPI_Cart_sub() to create row-wise and column-wise communicators, too.

I think I should use MPI_Bcast(), but I do not know how to combine MPI_Bcast() with MPI_Type_create_subarray(). I should first copy extracted subarray to some local_array and then Bcast(local_array). However, MPI_Type_create_subarray() gives me only "view" on subarray, not actually it, so the best solution I came up with is to Isend-Irecv root->root.

Is there a more elegant solution?

Answer 1

Below is an annotated discussion of the main points in your question plus some strategies you can use to organize your data and communicate the sub-blocks more elegantly.

1. Why does my local array come out column-wise?

When you use

c
MPI_Type_create_darray(
    P*Q, rank, 2, gsizes, distribs, dargs, psizes,
    MPI_ORDER_FORTRAN, MPI_DOUBLE, &filetype
);

the data is read from file in Fortran order (column-major ordering). The DARRAY type in MPI will interpret “global” indexing in Fortran fashion, and then map the piece that belongs to each process into your local buffer.

Hence, the layout of each process’s local portion A is in column-major order within that buffer. The first elements you see in memory (e.g., 1, 13, 49, 61, …) reflect how Fortran indexing jumps down columns first.

If you want the internal storage of the local array in C row-major order, you can try changing the MPI_ORDER_FORTRAN to MPI_ORDER_C. However, be aware that DARRAY with MPI_ORDER_C can sometimes yield unexpected distributions if you are mixing it with code or data arranged in Fortran style. In practice, many people leave the DARRAY as is (Fortran ordering) and then do a local reorder later, or define a different MPI datatype for the final rearranged layout.

2. How do I get 2×2 blocks written consecutively within each process’s local array?

Strictly speaking, a single MPI_Type_create_darray call will not rearrange the sub-blocks in memory so that each of your 2×2 tiles appears contiguously in row-major order. The DARRAY logic is designed to pick out which elements each rank owns, but not necessarily to reorder them in memory the way you want.

Approach A: Two-step read

Step 1: Use MPI_Type_create_darray(...) to read in your chunk of the matrix. You end up with a local buffer A (size = (N*N)/(P*Q) elements) in Fortran-like ordering.
Step 2: Rearrange locally into your “tiling” layout.
- You already know exactly which 2×2 blocks belong to you and in what order.
- Create a small function that loops over your local 2D region in the correct increments and writes out the data in the order you want.
- This is often easiest: just do a local copy from the DARRAY buffer A into a new buffer A_reordered that you control.

This two-step approach is by far the most common in practice because it is simpler to debug, even though it requires a local copy.

Approach B: A more advanced single-step read

You could define a “chain” of derived datatypes:

One describing the entire 12×12 in block-cyclic distribution,
Another describing how to reorder each rank’s sub-block into 2×2 tiles,
Possibly combined with a subarray or vector inside the file view.

But doing so is typically far more complicated than a local rearrangement in memory. You would need careful offset/stride calculations so that each sub-block arrives in a contiguous segment. Usually, HPC programmers accept the simpler approach A.

3. Distributing (broadcasting) sub-blocks among processes

You mentioned that, after reading data, you want to send certain block-rows or block-columns from one processor to another. For instance, in a 2×3 grid:

text
+-----+-----+-----+
|  1  |  2  |  3  |
+-----+-----+-----+
|  4  |  5  |  6  |
+-----+-----+-----+

Processor 1 should send its block-row to processor 4 (i.e., the portion of the data that corresponds to blocks horizontally adjacent to P1).
Processor 1 should also send its block-column to processors 2 and 3, etc.

You already discovered that creating 2D Cartesian sub-communicators via MPI_Cart_sub is the right step:

c
// 2D communicator “comm2d” of size 2 x 3
int dims[2]    = {P, Q};   
int periods[2] = {0, 0};   // usually no wrap-around
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm2d);

// Then create row-wise communicator (height=1)
int remain_dims[2] = {0, 1}; // keep dimension 1, drop dimension 0
MPI_Comm rowcomm;
MPI_Cart_sub(comm2d, remain_dims, &rowcomm);

// Then create column-wise communicator (width=1)
remain_dims[0] = 1; // keep dimension 0, drop dimension 1
remain_dims[1] = 0;
MPI_Comm colcomm;
MPI_Cart_sub(comm2d, remain_dims, &colcomm);

How do we actually send the subarray?

There are two main ways:

Local copy + MPI_Bcast (or MPI_Send/MPI_Recv)
- Extract the 2×2 (or bigger) sub-block into a contiguous temporary buffer (e.g., block_buffer).
- Call MPI_Bcast(block_buffer, block_size, MPI_DOUBLE, root, rowcomm) or a series of MPI_Send/MPI_Recv.
- Everyone then has that sub-block in block_buffer.
Use a derived datatype describing that sub-block
- Define a derived datatype that points into your local array A at the correct offsets so that a call like
```
c
MPI_Bcast(A, 1, block_subarray_type, root, rowcomm);
```
  actually sends (and receives) just those 2×2 elements.
- The advantage is “zero-copy” at the sender. The disadvantage is that all processes (the sender and receivers) must define and commit the exact same derived type so that each ranks' MPI library understands how to pick out (and/or place) the elements.

Most people opt for the first approach (local copy into a small buffer) because it is much simpler and typically not a performance bottleneck, especially for relatively small 2×2 or 4×4 blocks. Trying to maintain a derived datatype for each possible sub-block or offset is more error-prone.

Can I Bcast a subarray directly without copying?

Yes, absolutely. The MPI standard allows you to define a subarray (or vector) datatype that describes where the elements live in your local array A. For example, you could do something like:

c
MPI_Datatype sub_block_t;
// Suppose your local array is 4 x 6 in Fortran order, and you want to broadcast
// the block-row consisting of the top 2 rows and 4 columns.

int sizes[2]    = {4, 6}; // full local "matrix" shape
int subsizes[2] = {2, 4}; // shape of the sub-block to broadcast
int starts[2]   = {0, 0}; // offset from top-left corner

MPI_Type_create_subarray(2, sizes, subsizes, starts,
                         MPI_ORDER_FORTRAN, MPI_DOUBLE, &sub_block_t);
MPI_Type_commit(&sub_block_t);

// Now you can broadcast:
MPI_Bcast(A, 1, sub_block_t, root, rowcomm);

On the sending side, MPI will “walk” that sub_block_t in memory, sending only the 2×4 region.
On the receiving side, the same call will deposit those elements into the corresponding “positions” in A (assuming A is allocated identically on all processes).

However, if your processes do not want to place the broadcasted data at the exact same indices in memory, you either:

Use the same derived type but also an “origin” offset (which can be tricky).
Or define an equivalent type on the receiving side that references a different region.
Or simply do a local copy on the receiver.

Hence the simplest path is almost always:

Root: copy sub-block to a small contiguous buffer
MPI_Bcast that buffer with a simple (count, MPI_DOUBLE) call
Receiver: store that buffer wherever it likes

Given the small size of blocks (2×2, 4×4, etc.), this is usually faster to implement and less error-prone.

Summary and Recommendations

Accept the Fortran ordering in the DARRAY read (or set MPI_ORDER_C if you absolutely want row-major, but keep in mind that you can still end up with a non-intuitive layout in memory).
If you want each 2×2 block contiguous in your local buffer, do a post-read local rearrange. This is far simpler than trying to build a fancy DARRAY that also re-tiles memory.
For sending sub-blocks to other ranks (block-rows or block-columns):
- Easiest method:
  1. Extract the sub-block into a small contiguous buffer on the sender,
  2. Use MPI_Bcast (or MPI_Send/MPI_Recv) of that contiguous buffer,
  3. Place it in the desired location on the receiver.
- More advanced method:
  - Build a custom MPI derived datatype (e.g., MPI_Type_create_subarray or MPI_Type_vector) that describes your sub-block’s location in memory.
  - Call MPI_Bcast(A, 1, sub_block_type, root, comm) to send/receive in-place.
  - Make sure every process sees the same derived type description or an equivalent mapping for correct reception.

In practice, most codes keep the DARRAY approach for reading/writing files (since it nicely hides the complexity of global distribution) and then do local copies (or a simpler derived data type) for sub-block communication. Zero-copy solutions via complicated derived datatypes are possible but rarely worth the trouble unless you are chasing very specific performance bottlenecks.