Programming Examples

Code examples

We give examples of some of the core functionalities of Mamba including:

  • memory space allocation

  • Mamba array construction

  • array tiling

  • tile access

Allocations

An Allocation object provides an abstract container for a memory allocation in a specific memory space. To create an allocation object for CPU and GPU, we must first request memory space. Note that in this step the size, limits, default behaviours, etc can be set.

mmbMemSpaceConfig *dram_space_config;
mmb_memspace_config_create_default (&dram_space_config);
mmbMemSpace *dram_space, *gpu_space;
mmb_request_space(MMB_DRAM, MMB_EXECUTION_CONTEXT_DEFAULT,
                   dram_space_config, &dram_space);
mmb_request_space(MMB_GDRAM, MMB_GPU_CUDA, NULL, &gpu_space);

Next, request a memory interface for the allocation:

mmbMemInterfaceConfig dram_interface_config =
 {.provider = MMB_PROVIDER_DEFAULT, .strategy = MMB_POOLED};
mmbMemInterface *dram_interface, *gpu_interface;
mmb_request_interface(dram_space, &dram_interface_config,
                       &dram_interface);
mmb_request_interface(gpu_space, NULL, &gpu_interface);

Finally, allocate a buffer on the CPU, and another buffer on the GPU:

mmbAllocation *host_buffer, *gpu_buffer;
mmbAllocation *host_buffer, *gpu_buffer;
mmb_allocate (n_bytes, dram_interface, &host_buffer);
mmb_allocate (n_bytes, gpu_interface, &gpu_buffer):
fill_host_buffer (host_buffer);

A generic copy function between the buffers can be performed as follows:

mmb_copy (gpu_buffer, host_buffer);

Arrays

A Mamba array is an array-like data structure that forms the core abstraction of the Mamba library. Subsets of the array may be duplicated or moved between memories. During construction we can specificy different array distributions such as block cyclic across tiles, pre-tiled across different spaces, etc. We construct a regular 2D array layout:

mmbLayout *layout;
mmb_layout_create_regular_nd(sizeof(float), 2, MMB_ROWMAJOR,
                             MMB_PADDING_NONE, &layout);

To construct a Mamba array of size MxN:

size_t adims[2] = {array_size_M, array_size_N};
mmbDimensions array_size = {2, adims};
mmbArray *array;
mmb_array_create(&array_size, layout, dram_interface,
                 MMB_READ_WRITE, &array);

Fill the newly created array with data:

fill_array(array);

And move the entire array to the GPU:

mmb_array_migrate(array, gpu_interface);

Note that the gpu_interface could be an interface to a single GPU, or additional options can be provided using e.g. device index.

Array Tiles

Mamba arrays may be decomposed into subsets, so-called array tiles, for iteration or movement between memory spaces. This decomposition process is called tiling an array. To tile an array, we first create an array tile:

mmb_array_tile_2d(array, tile_size_M, tile_size_N);

To loop over the tiling, request a tile at each index (iteration over tile sets using iterator objects such as schedules, prefetching, automatic sizing, etc. is also possible):

mmbArrayTile *tile;
mmbDimensions *tiling_dims;
mmb_tiling_dimensions(mba, &tiling_dims);
for (size_t ti = 0; ti < tiling_dims->d[0]; ++ti) {
 for (size_t tj = 0; tj < tiling_dims->d[1]; ++tj) {
   mmb_tile_at_2d(mba, ti, tj, &tile);

Within the loop, we can now duplicate tile data on the GPU. As we created a 2D array tile, the duplicate uses e.g. cudaMemcpy2D implicitly:

mmbArrayTile *duplicate_tile;
mmb_tile_duplicate(tile, gpu_interface,
                  MMB_READ_WRITE, &duplicate_title);

Run the GPU kernel:

run_cude_kernel(duplicate_tile);

Merge the duplicate tile back to the original via overwrite and close the for loops:

   mmb_tile_merge(duplicate_tile, MMB_OVERWRITE);
 }
}

Tile Access

Tile access can be performed using multiple approaches - with or without direct pointer access, as well as with user indexing. While macros are not necessary, they can provide convenient indexing for non-standard layouts such as block-cyclic tiles.

void zero_tile(mmbArrayTile *t) {
 float *ptr = mmb_tile_get_ptr(t);
 for(size_t i = t->lower[0]; i < t->upper[0]; i++)
   for(size_t j = t->lower[1]; j < t->upper[1]; j++){
     // Without direct pointer access
     MMB_IDX_2D(t, i, j, float) = 0;
     // OR: With direct pointer access
     ptr[MMB_IDX_EXPR_2D(i,j)] = 0;
     // OR: With user indexing
     ptr[i * t->dim[1] + j)] = 0; }
}

Fortran tiles can pass in an appropriately dimensioned pointer for regular indexing:

block
 real, pointer, dimension(:,:) :: tp
 type(mmbTileData) tile_mdata
 call mmb_tile_get_mdata(tile_c,tile_mdata,tp)
 do j=tile_mdata%lower(1),tile_mdata%upper(1)
   do i=tile_mdata%lower(2),tile_mdata%upper(2)
     tp(i,j) = 0.0
   end do
 end do
end block

Tile metadata is by default located in the CPU local space. The API to request a space-local handle for e.g. GPU-local tile metadata to pass into a kernel would be used as follows:

extern "C" void run_cuda_kernel(mmbArrayTile *tile){
 size_t block_size = 16;
 dim3 block_dim = dim3(block_size, block_size)
 dim3 grid_dim = (tile->dim[0] / block_width,
                 tile->dim[1] / block_height);
 mmbArrayTile *dev_tile;
 mmb_tile_get_space_local_handle(tile, &dev_tile);
 cuda_compute_kernel<<<grid_dim, block_dim>>>(dev_tile);
}

Examples Overview

Examples are found in mamba/build/examples/, or /path/to/install/dir/examples. Each example is shown in C and fortran, and briefly described here with instructions on use.

1d_array_copy

This shows the construction, tiled initialisation, and copy of a 1d mamba array to another 1d mamba array with matching layout and size, with full error checking.

Source file: examples/c/1d_array_copy.c | examples/fortran/1d_array_copy.f90

Usage: ./1d_array_copy | ./1d_array_copy_f

1d_array_copy_wrapped

The same as 1d_array_copy but using arrays contructed from existing user pointers.

Source file: examples/c/1d_array_copy_wrapped.c | examples/fortran/1d_array_copy_wrapped.f90

Usage: ./1d_array_copy_wrapped | ./1d_array_copy_wrapped_f

tile_duplicate

This shows construction of a 1d array, tiling, duplication and merging of tiles.

Source file: examples/c/tile_duplicate.c

Usage: ./tile_duplicate

matrix_multiply

This demonstrates a tiled matrix multiply using 3 mamba arrays constructed on top of pre-initialised (with random or identity values) matrix buffers.

Source file: examples/c/matrix_multiply.c

Usage:

(all args optional): ./matrix_multiply -v (for verbose mode) -t N (for tile size NxN) -m N (for matrix size NxN) -i (use identity for matrix B)

matrix_multiply_cuda (C only)

This demonstrates a tiled matrix multiply using multiple mamba arrays constructed on top of pre-initialised (with random or identity values) matrix buffers. This example also present how to allocate and use memory on different memory devices (DRAM, GPU, HBM, …), and how to copy from one memory tier to an other. This example shows as well how to use different strategies and/or different memory providers.

This example works the same as the matrix_multiply example, excepted that it requires extra steps to pass the data to the actual kernel (in addition to allocate the data on the GPU memory, the tiling information needs to be forwarded as well). The CUDA file only deals with this forwarding (the packing is done in examples/c/matrix_multiply_cuda.c). For now the tiles are not executed in parallel, but it is a work in progress.

Source files: examples/c/matrix_multiply_cuda.c, examples/c/matrix_multiply_cuda_ker.cu, examples/c/matrix_multiply_cuda.h

Usage:

(all args optional): ./matrix_multiply_cuda -v (for verbose mode) -t N (for tile size NxN) -m N (for matrix size NxN) -i (use identity for matrix B)

loop description (C only)

This example demonstrates the description of a loop using the loop description, followed by PET/ISL based polyhedral analysis of the loop with dependence computation. The loop description, auxiliary analysis information and calculated loop dependencies are output to the terminal.

Source files: examples/c/loop_description.c

Usage: ./loop_description

report_mem_state (C only)

This example show the output of the function mmb_dump_memory_state that dump to the FILE * given as parameter the current state of the memory system as retained by the MAMBA Memory Manager.

Source file: examples/c/report_mem_state.c

Usage: ./report_mem_state