Update HPC docs

2025-08-04 11:36:52 +08:00 · 2025-06-27 17:59:45 +01:00
--- a/README.rst
+++ b/README.rst
@@ -75,7 +75,7 @@ Package overview
 * ``requirements.txt`` is a configuration file for pip that sets up a Python environment with all the required Python packages for gprMax.
 * ``setup.py`` is the centre of all activity in building, distributing, and installing gprMax, including building and compiling the Cython extension modules.

-
+.. _installation:

 Installation
 ============
--- a/docs/source/accelerators.rst
+++ b/docs/source/accelerators.rst
@@ -1,14 +1,17 @@
 .. _accelerators:

-******************
-OpenMP/CUDA/OpenCL
-******************
+**********************
+OpenMP/MPI/CUDA/OpenCL
+**********************

 The most computationally intensive parts of gprMax, which are the FDTD solver loops, have been parallelized using different CPU and GPU accelerators to offer performance and flexibility.

 1. `OpenMP <http://openmp.org>`_ which supports multi-platform shared memory multiprocessing.
-2. `NVIDIA CUDA <https://developer.nvidia.com/cuda-toolkit>`_ for NVIDIA GPUs.
-3. `OpenCL <https://www.khronos.org/api/opencl>`_ for a wider range of CPU and GPU hardware.
+2. `OpenMP <http://openmp.org>`_ + `MPI <https://mpi4py.readthedocs.io/en/stable/>`_ enables parallelism beyond shared mememory multiprocessing (e.g. multiple nodes on a HPC system).
+3. `NVIDIA CUDA <https://developer.nvidia.com/cuda-toolkit>`_ for NVIDIA GPUs.
+4. `OpenCL <https://www.khronos.org/api/opencl>`_ for a wider range of CPU and GPU hardware.
+
+Each of these approaches to acceleration have different characteristics and hardware/software support. While all these approaches can offer increased performance, OpenMP + MPI can also increase the modelling capabilities of gprMax when running on a multi-node system (e.g. HPC environments). It does this by distributing models accoss multiple nodes, increasing the total amount of memory available and allowing larger models to be simulated.

 Additionally, the Message Passing Interface (MPI) can be utilised to implement a simple task farm that can be used to distribute a series of models as independent tasks. This can be useful in many GPR simulations where a B-scan (composed of multiple A-scans) is required. Each A-scan can be task-farmed as an independent model, and within each model, OpenMP or CUDA can still be used for parallelism. This creates mixed mode OpenMP/MPI or CUDA/MPI environments.

@@ -24,29 +27,94 @@ OpenMP

 No additional software is required to use OpenMP as it is part of the standard installation of gprMax.

-By default, gprMax will try to determine and use the maximum number of OpenMP threads (usually the number of physical CPU cores) available on your machine. You can override this behaviour in two ways: firstly, gprMax will check to see if the ``#cpu_threads`` command is present in your input file; if not, gprMax will check to see if the environment variable ``OMP_NUM_THREADS`` is set. This can be useful if you are running gprMax in a High-Performance Computing (HPC) environment where you might not want to use all of the available CPU cores.
+By default, gprMax will try to determine and use the maximum number of OpenMP threads (usually the number of physical CPU cores) available on your machine. You can override this behaviour in two ways: firstly, gprMax will check to see if the ``#omp_threads`` command is present in your input file; if not, gprMax will check to see if the environment variable ``OMP_NUM_THREADS`` is set. This can be useful if you are running gprMax in a High-Performance Computing (HPC) environment where you might not want to use all of the available CPU cores.

 MPI
 ===

-By default, the MPI task farm functionality is turned off. It can be used with the ``-taskfarm`` command line option, which specifies the total number of MPI tasks, i.e. master + workers, for the MPI task farm. This option is most usefully combined with ``-n`` to allow individual models to be farmed out using an MPI task farm, e.g. to create a B-scan with 60 traces and use MPI to farm out each trace: ``(gprMax)$ python -m gprMax examples/cylinder_Bscan_2D.in -n 60 -taskfarm 61``.
+No additional software is required to use MPI as it is part of the standard installation of gprMax. However you will need to :ref:`build h5py with MPI support<h5py_mpi>` if you plan to use the MPI domain decomposition functionality.

-Software required
-----------------
+There are two ways to use MPI with gprMax:

-The following steps provide guidance on how to install the extra components to allow the MPI task farm functionality with gprMax:
+- Domain decomposition - divides a single model across multiple MPI ranks.
+- Task farm - distribute multiple models as independent tasks to each MPI rank.

-1. Install MPI on your system.
+.. _mpi_domain_decomposition:

-Linux/macOS
-^^^^^^^^^^^
-It is recommended to use `OpenMPI <http://www.open-mpi.org>`_.
+Domain decomposition
+--------------------

-Microsoft Windows
-^^^^^^^^^^^^^^^^^
-It is recommended to use `Microsoft MPI <https://docs.microsoft.com/en-us/message-passing-interface/microsoft-mpi>`_. Download and install both the .exe and .msi files.
+Open a Terminal (Linux/macOS) or Command Prompt (Windows), navigate into the top-level gprMax directory, and if it is not already active, activate the gprMax conda environment: ``conda activate gprMax``

-2. Install the ``mpi4py`` Python module. Open a Terminal (Linux/macOS) or Command Prompt (Windows), navigate into the top-level gprMax directory, and if it is not already active, activate the gprMax conda environment :code:`conda activate gprMax`. Run :code:`pip install mpi4py`
+Run one of the 2D test models:
+
+.. code-block:: console
+
+    (gprMax)$ mpirun -n 4 python -m gprMax examples/cylinder_Ascan_2D.in --mpi 2 2 1
+
+The ``--mpi`` argument passed to gprMax takes three integers to define the number of MPI processes in the x, y, and z dimensions to form a cartesian grid. The product of these three numbers shoud equal the number of MPI ranks. In this case ``2 x 2 x 1 = 4``.
+
+.. _fractal_domain_decomposition:
+
+Decomposition of Fractal Geometry
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are some restrictions when using MPI domain decomposition with
+:ref:`fractal user objects <fractals>`.
+
+.. warning::
+
+    gprMax will throw an error during the model build phase if the MPI
+    decomposition is incompatible with the model geometry.
+
+**#fractal_box**
+
+When a ``#fractal_box`` has a mixing model attached, it will perform a
+parallel fast Fourier transforms (FFTs) as part of its construction. To
+support this, the MPI domain decomposition of the fractal box must have
+size one in at least one dimension:
+
+.. _fractal_domain_decomposition_figure:
+.. figure:: ../../images_shared/fractal_domain_decomposition.png
+
+    Example slab and pencil decompositions. These decompositions could
+    be specified with ``--mpi 8 1 1`` and ``--mpi 3 3 1`` respectively.
+
+.. note::
+
+    This does not necessarily mean the whole model domain needs to be
+    divided this way. So long as the volume covered by the fractal box
+    is divided into either slabs or pencils, the model can be built.
+    This includes the volume covered by attached surfaces added by the
+    ``#add_surface_water``, ``#add_surface_roughness``, or
+    ``#add_grass`` commands.
+
+**#add_surface_roughness**
+
+When adding surface roughness, a parallel fast Fourier transform is
+applied across the 2D surface of a fractal box. Therefore, the MPI
+domain decomposition across the surface must be size one in at least one
+dimension.
+
+For example, in figure :numref:`fractal_domain_decomposition_figure`, surface
+roughness can be attached to any surface when using the slab
+decomposition. However, if using the pencil decomposition, it could not
+be attached to the XY surfaces.
+
+**#add_grass**
+
+Domain decomposition of grass is not currently supported. Grass can
+still be built in a model so long as it is fully contained within a
+single MPI rank.
+
+Task farm
+---------
+
+By default, the MPI task farm functionality is turned off. It can be used with the ``--taskfarm`` command line option, which specifies the total number of MPI tasks, i.e. master + workers, for the MPI task farm. This option is most usefully combined with ``-n`` to allow individual models to be farmed out using an MPI task farm, e.g. to create a B-scan with 60 traces and use MPI to farm out each trace:
+
+.. code-block:: console
+
+    (gprMax)$ python -m gprMax examples/cylinder_Bscan_2D.in -n 60 --taskfarm


 CUDA
@@ -68,7 +136,7 @@ Open a Terminal (Linux/macOS) or Command Prompt (Windows), navigate into the top

 Run one of the test models:

-.. code-block:: none
+.. code-block:: console

    (gprMax)$ python -m gprMax examples/cylinder_Ascan_2D.in -gpu

@@ -95,7 +163,7 @@ Open a Terminal (Linux/macOS) or Command Prompt (Windows), navigate into the top

 Run one of the test models:

-.. code-block:: none
+.. code-block:: console

    (gprMax)$ python -m gprMax examples/cylinder_Ascan_2D.in -opencl

@@ -115,10 +183,10 @@ Example

 For example, to run a B-scan that contains 60 A-scans (traces) on a system with 4 GPUs:

-.. code-block:: none
+.. code-block:: console

-    (gprMax)$ python -m gprMax examples/cylinder_Bscan_2D.in -n 60 -taskfarm 5 -gpu 0 1 2 3
+    (gprMax)$ python -m gprMax examples/cylinder_Bscan_2D.in -n 60 --taskfarm -gpu 0 1 2 3

 .. note::

-    The argument given with ``-taskfarm`` is the number of MPI tasks, i.e. master + workers, for the MPI task farm. So in this case, 1 master (CPU) and 4 workers (GPU cards). The integers given with the ``-gpu`` argument are the NVIDIA CUDA device IDs for the specific GPU cards to be used.
+    When running a task farm, one MPI rank runs on the CPU as a coordinator (master) while the remaining worker ranks each use their own GPU. Therefore the number of MPI ranks should equal the number of GPUs + 1. The integers given with the ``-gpu`` argument are the NVIDIA CUDA device IDs for the specific GPU cards to be used.
--- a/docs/source/hpc.rst
+++ b/docs/source/hpc.rst
@@ -4,11 +4,47 @@
 HPC
 ***

-High-performance computing (HPC) environments usually require jobs to be submitted to a queue using a job script. The following are examples of job scripts for an HPC environment that uses `Open Grid Scheduler/Grid Engine <http://gridscheduler.sourceforge.net/index.html>`_, and are intended as general guidance to help you get started. Using gprMax in an HPC environment is heavily dependent on the configuration of your specific HPC/cluster, e.g. the names of parallel environments (``-pe``) and compiler modules will depend on how they were defined by your system administrator.
+Using gprMax in an HPC environment is heavily dependent on the configuration of your specific HPC/cluster, e.g. the and compiler modules, programming environments, and job submission processes will vary between systems.
+
+.. note::
+
+    General details about the types of acceleration available in gprMax are shown in the :ref:`accelerators` section.


-OpenMP example
-==============
+Installation
+============
+
+Full installation instructions for gprMax can be found in the :ref:`Getting Started guide <installation>`, however HPC systems programming environments can vary (and often have pre-installed software). For example, the following can be used to install gprMax on `ARCHER2, the UK National Supercomputing Service <https://www.archer2.ac.uk/>`_:
+
+.. code-block:: console
+
+    $ git clone https://github.com/gprMax/gprMax.git
+    $ cd gprMax
+    $ module load PrgEnv-gnu
+    $ module load cray-python
+    $ module load cray-fftw
+    $ module load cray-hdf5-parallel
+    $ export CC=cc
+    $ export CXX=CC
+    $ export FC=ftn
+    $ python -m venv --system-site-packages --prompt gprMax .venv
+    $ source .venv/bin/activate
+    (gprMax)$ python -m pip install --upgrade pip
+    (gprMax)$ HDF5_MPI='ON' python -m pip install --no-binary=h5py h5py
+    (gprMax)$ python -m pip install -r requirements.txt
+    (gprMax)$ python -m pip install -e .
+
+.. tip::
+
+    Consult your system's documentation for site specific information.
+
+Job Submission examples
+=======================
+
+High-performance computing (HPC) environments usually require jobs to be submitted to a queue using a job script. The following are examples of job scripts for an HPC environment that uses `Open Grid Scheduler/Grid Engine <http://gridscheduler.sourceforge.net/index.html>`_, and are intended as general guidance to help you get started. The names of parallel environments (``-pe``) and compiler modules will depend on how they were defined by your system administrator.
+
+OpenMP
+^^^^^^

 :download:`gprmax_omp.sh <../../toolboxes/Utilities/HPC/gprmax_omp.sh>`

@@ -20,22 +56,15 @@ Here is an example of a job script for running models, e.g. A-scans to make a B-

 In this example 10 models will be run one after another on a single node of the cluster (on this particular cluster a single node has 16 cores/threads available). Each model will be parallelised using 16 OpenMP threads.

-
-MPI + OpenMP
-============
-
-There are two ways to use MPI with gprMax:
-
- Domain decomposition - divides a single model is across multiple MPI ranks.
- Task farm - distribute multiple models as independent tasks to each MPI rank.
-
-.. _mpi_domain_decomposition:
-
-MPI domain decomposition example
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+MPI domain decomposition
+^^^^^^^^^^^^^^^^^^^^^^^^

 Here is an example of a job script for running a model across multiple tasks in an HPC environment using MPI. The behaviour of most of the variables is explained in the comments in the script.

+.. note::
+
+    This example is based on the `ARCHER2 <https://www.archer2.ac.uk/>`_ system and uses the `SLURM <https://slurm.schedmd.com/>`_ scheduler.
+
 .. literalinclude:: ../../toolboxes/Utilities/HPC/gprmax_omp_mpi.sh
    :language: bash
    :linenos:
@@ -51,61 +80,14 @@ In this example, the model will be divided across 8 MPI ranks in a 2 x 2 x 2 pat

 The ``--mpi`` argument is passed to gprMax which takes three integers to define the number of MPI processes in the x, y, and z dimensions to form a cartesian grid.

-The ``NSLOTS`` variable which is required to set the total number of slots/cores for the parallel environment ``-pe mpi`` is usually the number of MPI tasks multiplied by the number of OpenMP threads per task. In this example the number of MPI tasks is 8 and the number of OpenMP threads per task is 16, so 128 slots are required.
-
-Decomposition of Fractal Geometry
---------------------------------
-
-There are some restrictions when using MPI domain decomposition with
-:ref:`fractal user objects <fractals>`.
-
-.. warning::
-
-    gprMax will throw an error during the model build phase if the MPI
-    decomposition is incompatible with the model geometry.
-
-**#fractal_box**
-
-When a ``#fractal_box`` has a mixing model attached, it will perform a
-parallel fast Fourier transforms (FFTs) as part of its construction. To
-support this, the MPI domain decomposition of the fractal box must have
-size one in at least one dimension:
-
-.. _fractal_domain_decomposition:
-.. figure:: ../../images_shared/fractal_domain_decomposition.png
-
-    Example slab and pencil decompositions. These decompositions could
-    be specified with ``--mpi 8 1 1`` and ``--mpi 3 3 1`` respectively.
+Unlike the grid engine examples, here we specify the number of CPUs per task (16) and the number of tasks (8), rather than the total number of CPUs/slots.

 .. note::

-    This does not necessarily mean the whole model domain needs to be
-    divided this way. So long as the volume covered by the fractal box
-    is divided into either slabs or pencils, the model can be built.
-    This includes the volume covered by attached surfaces added by the
-    ``#add_surface_water``, ``#add_surface_roughness``, or
-    ``#add_grass`` commands.
+    Some restrictions apply to the domain decomposition when using fractal geometry as explained :ref:`here <fractal_domain_decomposition>`.

-**#add_surface_roughness**
-
-When adding surface roughness, a parallel fast Fourier transform is
-applied across the 2D surface of a fractal box. Therefore, the MPI
-domain decomposition across the surface must be size one in at least one
-dimension.
-
-For example, in figure :numref:`fractal_domain_decomposition`, surface
-roughness can be attached to any surface when using the slab
-decomposition. However, if using the pencil decomposition, it could not
-be attached to the XY surfaces.
-
-**#add_grass**
-
-Domain decomposition of grass is not currently supported. Grass can
-still be built in a model so long as it is fully contained within a
-single MPI rank.
-
-MPI task farm example
-^^^^^^^^^^^^^^^^^^^^^
+MPI task farm
+^^^^^^^^^^^^^

 :download:`gprmax_omp_taskfarm.sh <../../toolboxes/Utilities/HPC/gprmax_omp_taskfarm.sh>`

@@ -122,8 +104,8 @@ The ``--taskfarm`` argument is passed to gprMax which takes the number of MPI ta
 The ``NSLOTS`` variable which is required to set the total number of slots/cores for the parallel environment ``-pe mpi`` is usually the number of MPI tasks multiplied by the number of OpenMP threads per task. In this example the number of MPI tasks is 11 and the number of OpenMP threads per task is 16, so 176 slots are required.


-Job array example
-=================
+Job array
+^^^^^^^^^

 :download:`gprmax_omp_jobarray.sh <../../toolboxes/Utilities/HPC/gprmax_omp_jobarray.sh>`

--- a/toolboxes/Utilities/HPC/gprmax_omp_mpi.sh
+++ b/toolboxes/Utilities/HPC/gprmax_omp_mpi.sh
@@ -1,37 +1,39 @@
-#!/bin/sh
-#####################################################################################
-### Change to current working directory:
-#$ -cwd
-
-### Specify runtime (hh:mm:ss):
-#$ -l h_rt=01:00:00
-
-### Email options:
-#$ -m ea -M joe.bloggs@email.com
-
-### Resource reservation:
-#$ -R y
-
-### Parallel environment ($NSLOTS):
-#$ -pe mpi 128
+#!/bin/bash

 ### Job script name:
-#$ -N gprmax_omp_mpi.sh
-#####################################################################################
+#SBATCH --job-name="gprMax MPI demo"

-### Initialise environment module
-. /etc/profile.d/modules.sh
+### Number of MPI tasks:
+#SBATCH --ntasks=8

-### Load and activate Anaconda environment for gprMax, i.e. Python 3 and required packages
-module load anaconda
-source activate gprMax
+### Number of CPUs (OpenMP threads) per task:
+#SBATCH --cpus-per-task=16

-### Load OpenMPI
-module load openmpi
+### Runtime limit:
+#SBATCH --time=0:10:0

-### Set number of OpenMP threads per MPI task (each gprMax model)
-export OMP_NUM_THREADS=16
+### Partition and quality of service to use (these control the type and
+### amount of resources allowed to request):
+#SBATCH --partition=standard
+#SBATCH --qos=standard

-### Run gprMax with input file
-cd $HOME/gprMax
-mpirun -n 8 python -m gprMax mymodel.in --mpi 2 2 2
+### Hints to control MPI task layout:
+#SBATCH --hint=nomultithread
+#SBATCH --distribution=block:block
+
+
+# Set number of OpenMP threads from SLURM environment variables
+export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
+
+# Ensure the cpus-per-task option is propagated to srun commands
+export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
+
+# Load system modules
+module load PrgEnv-gnu
+module load cray-python
+
+# Load Python virtual environment
+source .venv/bin/activate
+
+# Run gprMax with input file
+srun python -m gprMax my_model.in --mpi 2 2 2