The thumbnail image is created by chatGPT-4o
This content explains how to to install FeniCSx with CUDA-enabled PETSc using Spack on HPC systems (ALCF Polaris).
1. Acclerating FenicsX simulation using GPUs
Let’s say we are working on FEM simulation and want to get solution field at the next timestep.
This process includes two steps:
1) Assemble
2) Solving linear system, $Ax = b$
Here, Assemble
computes element matrix $A_e$ and element vector $b_e$ for each element, implements boundary conditions for $A_e$ & $b_e$, and constructs global matrix $A$ and global vector $b$.
And a sparse matrix solver from different linear algebra backgrounds (e.g. PETSc, Eigen, CuPy, and etc.) solves the linear system to get the solution at the next timestep.
Generally, for systems with large dofs such as LES simulations, Assemble
process takes 10~20 % of the total computation time and Solving linear system takes 80~90%. Therefore, it is more efficient to accelerate the process of solving linear systems.
PETSc provides GPU support for solving linear systems. To that end, the type of petsc4py.PETSc.Vec
and ‘petsc4py.PETSc.Mat’ should be set as a GPU-compatible type using petsc4py.PETSc.Vec.SetType('CUDA')
and petsc4py.PETSc.Vec.SetType('AIJCUSPARSE')
, respectively.
A.setType("AIJCUSPARSE")
x.setType("CUDA")
b.setType("CUDA")
2. Istalling FenicsX with CUDA-enabled PETSc
To use this option, PETSc should be configured with cuda option. In addition, hypre option should be activated while configuring PETSc to use the hypre preconditioner.
In spack, PETSc with cuda and hypre option can be installed using this spack.yaml
file. In addition, the cuda-enbaled version of mpich should be installed together.
# This is a Spack Environment file.
#
# It describes a set of packages to be installed, along with
# configuration settings.
spack:
specs:
- petsc +cuda +hypre cuda_arch=80
- hypre +cuda cuda_arch=80
- fenics-dolfinx+adios2
- py-fenics-dolfinx cflags=-O3 fflags=-O3
- py-torch+cuda cuda_arch=80
- mpich +cuda
- py-pip
view: true
concretizer:
unify: true
The installation of py-torch
from the source code could take 40-60 minutes.
We can test Fenicsx with the cuda-enbaled PETSc with the following code:
from petsc4py import PETSc
vec = PETSc.Vec().create(comm=PETSc.COMM_WORLD)
vec.setSizes(100)
vec.setType("mpicuda")
vec.setFromOptions()
vec.set(1.0)
print("Vec type:", vec.getType())
mat = PETSc.Mat().createAIJ([100, 100])
mat.setFromOptions()
mat.setType("mpiaijcusparse")
mat.setUp()
print("Mat type:", mat.getType())
Result
hjkim@x3206c0s25b0n0:/grand/NeuralDE/hjkim> mpiexec -n 2 python mpitest.py
Warning: Permanently added 'x3206c0s25b0n0.hsn.cm.polaris.alcf.anl.gov,10.201.0.255' (ECDSA) to the list of known hosts.
Vec type: mpicuda
Mat type: mpiaijcusparse
Vec type: mpicuda
Mat type: mpiaijcusparse
3. Performance check
We can check the performance of GPU acceleration by comparing it with CPUs.
CPU execution code
from petsc4py import PETSc
from mpi4py import MPI
import time
comm = PETSc.COMM_WORLD
mpi_comm = MPI.COMM_WORLD
rank = comm.getRank()
mpi_comm.Barrier()
t0 = MPI.Wtime()
n = 1000000
A = PETSc.Mat().createAIJ([n, n], comm=comm)
A.setType("mpiaij") # CPU matrix
A.setFromOptions()
A.setUp()
start, end = A.getOwnershipRange()
for i in range(start, end):
A.setValue(i, i, 2.0)
if i > 0:
A.setValue(i, i - 1, -1.0)
if i < n - 1:
A.setValue(i, i + 1, -1.0)
A.assemble()
b = PETSc.Vec().create(comm=comm)
b.setSizes(n)
b.setFromOptions()
b.set(1.0)
x = b.duplicate()
ksp = PETSc.KSP().create(comm=comm)
ksp.setOperators(A)
ksp.setType("cg")
ksp.getPC().setType("jacobi")
ksp.setFromOptions()
ksp.solve(b, x)
mpi_comm.Barrier()
t1 = MPI.Wtime()
elapsed = t1 - t0
max_elapsed = mpi_comm.reduce(elapsed, op=MPI.MAX, root=0)
sum_elapsed = mpi_comm.reduce(elapsed, op=MPI.SUM, root=0)
if rank == 0:
print(f"[CPU] Total Wall Time (Wtime): {max_elapsed:.4f} sec")
print(f"[CPU] Sum of All Ranks' Time: {sum_elapsed:.4f} sec")
print(f"[CPU] Residual norm: {ksp.getResidualNorm():.2e}")
GPU execution code
import os
import sys
from mpi4py import MPI
rank = MPI.COMM_WORLD.Get_rank()
from petsc4py import PETSc
import time
import torch
comm = PETSc.COMM_WORLD
mpi_comm = MPI.COMM_WORLD
rank = comm.getRank()
mpi_comm.Barrier()
t0 = MPI.Wtime()
if torch.cuda.is_available():
print(f"[Rank {rank}] Using GPU {torch.cuda.current_device()} - {torch.cuda.get_device_name(torch.cuda.current_device())}")
else:
print(f"[Rank {rank}] No GPU available")
n = 10000000
A = PETSc.Mat().createAIJ([n, n], comm=comm)
A.setType("mpiaijcusparse") # GPU matrix
A.setFromOptions()
A.setUp()
start, end = A.getOwnershipRange()
for i in range(start, end):
A.setValue(i, i, 2.0)
if i > 0:
A.setValue(i, i - 1, -1.0)
if i < n - 1:
A.setValue(i, i + 1, -1.0)
A.assemble()
b = PETSc.Vec().create(comm=comm)
b.setSizes(n)
b.setFromOptions()
b.setType("cuda") # GPU vector
b.set(1.0)
x = b.duplicate()
ksp = PETSc.KSP().create(comm=comm)
ksp.setOperators(A)
ksp.setType("cg")
ksp.getPC().setType("jacobi")
ksp.setFromOptions()
ksp.solve(b, x)
mpi_comm.Barrier()
t1 = MPI.Wtime()
elapsed = t1 - t0
max_elapsed = mpi_comm.reduce(elapsed, op=MPI.MAX, root=0)
sum_elapsed = mpi_comm.reduce(elapsed, op=MPI.SUM, root=0)
if rank == 0:
print(f"[GPU] Total Wall Time (Wtime): {max_elapsed:.4f} sec")
print(f"[GPU] Sum of All Ranks' Time: {sum_elapsed:.4f} sec")
print(f"[GPU] Residual norm: {ksp.getResidualNorm():.2e}")
Results
$n=10^6$
CPU (# of ranks = # of CPUs)
# of ranks 1: 50.46 sec
# of ranks 2: 26.09 sec
# of ranks 8: 6.60 sec
# of ranks 16: 3.47 sec
# of ranks 32: 2.36 sec
GPU (# of ranks = # of GPUs)
# of ranks 1: 3.23 sec
# of ranks 2: 2.7 sec
$n = 10^7$
GPU (# of ranks = # of GPUs)
# of ranks 1: 284.67 sec
# of ranks 4: 73.04 sec