A lot of the rewrites are ill inspired but have good intentions. I think it is worth to rewrite codes to adopt GPUs, since a lot of the original algorithms and paradigms might scale and work poorly on GPUs. I have been part of rewrites and ports and porting a codebase to GPUs without willing to rewrite the architecture ends up producing a difficult to work with result. It doesn’t result in the 20x speedup that you’d want and it might make the CPU slower or basically duplicate the codebase.
I feel that a good rewrite focuses on keeping the algorithms as much as possible while creating new APIs that use these algorithms in a safe way so that you can spend more time designing an architecture that will use the algorithms. For example, this bit for some PPM fluxes:
do concurrent(j=1:ny, i=3:nx - 2) &
local(dh_m1, dh_0, dh_p1, h_left, h_right)
call ppm_limited_slope(bs%h(i - 2, j), bs%h(i - 1, j), bs%h(i, j), dh_m1)
call ppm_limited_slope(bs%h(i - 1, j), bs%h(i, j), bs%h(i + 1, j), dh_0)
call ppm_limited_slope(bs%h(i, j), bs%h(i + 1, j), bs%h(i + 2, j), dh_p1)
h_left = 0.5_wp*(bs%h(i - 1, j) + bs%h(i, j)) - (dh_0 - dh_m1)/6.0_wp
h_right = 0.5_wp*(bs%h(i, j) + bs%h(i + 1, j)) - (dh_p1 - dh_0)/6.0_wp
call ppm_cell_limiter(bs%h(i, j), h_left, h_right)
if (do_pos) call ppm_limit_pos(bs%h(i, j), h_left, h_right, h_min_pos)
this%h_face_right_x%data(i, j, 1) = h_left
this%h_face_left_x%data(i + 1, j, 1) = h_right
end do
Before, the code for ppm_limited_slope
was inline in the do concurrent, repeated 3 times like:
dh_left = h_i - h_im1
dh_right = h_ip1 - h_i
dh_centered = 0.5_wp*(dh_left + dh_right)
if (dh_left*dh_right > 0.0_wp) then
dh = sign(min(abs(dh_centered), &
2.0_wp*abs(dh_left), &
2.0_wp*abs(dh_right)), &
dh_centered)
else
dh = 0.0_wp
end if
So I spent the time first refactoring so that I could create pure subroutines
that are callable from a do concurrent to get to the first bit of code. Then all the data needed to compute the fluxes can be just passed in by: pure subroutine continuity_compute_fluxes_barotropic(grid, metrics, this, bs)
The extent of object orientation for me is data structures that hold all necessary stuff for a computation. Their only type bound procedures are initialization, data transfer, finalization, data deletion:
type :: continuity_t
logical :: is_init = .false.
!! True between `init` and `destroy`.
integer :: ppm_variant = PPM_VARIANT_H3_MONO
!! Active PPM scheme variant.
logical :: monotone = .true.
!! Enforce monotonicity on the reconstructed face values.
real(wp) :: h_min = 1.0e-6_wp
!! Lower clip on cell-centred thickness during the update.
!! Also used as the floor for `ppm_limit_pos`
real(wp) :: cfl_max = 0.5_wp
!! Soft cap on per-face CFL before falling back to upwind.
logical :: use_ppm_limit_pos = .false.
!! Positivity limiter
! ---- Face-reconstruction workspace ----
!
! h_face_left_x(i, j, k) — value AT east face i extrapolated
! from the LEFT-side cell (i-1, j, k),
! i.e. h_R of cell i-1.
! h_face_right_x(i, j, k) — value AT east face i extrapolated
! from the RIGHT-side cell (i, j, k),
! i.e. h_L of cell i.
!
! Upwind: the kernel picks left if u >= 0 (left cell donates),
! right if u < 0.
type(scratch_3d_buffer_t) :: h_face_left_x
!! Left-state thickness at east faces.
type(scratch_3d_buffer_t) :: h_face_right_x
!! Right-state thickness at east faces.
type(scratch_3d_buffer_t) :: h_face_left_y
!! Left-state thickness at north faces.
type(scratch_3d_buffer_t) :: h_face_right_y
!! Right-state thickness at north faces.
contains
procedure :: init => continuity_init
procedure :: destroy => continuity_destroy
procedure :: enter_data => continuity_enter_data
procedure :: exit_data => continuity_exit_data
end type continuity_t
So that I can then, for example use the scratch_3d_buffer_t
like:
! East-face shapes: (nx+1, ny, nz)
call this%h_face_left_x%init(nx + 1, ny, nz, "continuity_h_face_left_x")
call this%h_face_right_x%init(nx + 1, ny, nz, "continuity_h_face_right_x")
! North-face shapes: (nx, ny+1, nz)
call this%h_face_left_y%init(nx, ny + 1, nz, "continuity_h_face_left_y")
call this%h_face_right_y%init(nx, ny + 1, nz, "continuity_h_face_right_y")
this%is_init = .true.
All the simplifications I’ve done make it so that my RK2 step looks like:
do stage = 1, 2
call run_stage_split(grid, metrics, dyn, eos, cor, ct, pgf, hv, bd, ss, &
va, hd, vd, vmix, ms, dt, n_inner, &
sf=sf, stage=stage, vcoord=vcoord, bc=bc, t=t, &
lateral_mix=lateral_mix, epbl=epbl, kshear=kshear)
end do
call rk2_average(ms)
if (allocated(ms%tracers)) then
do it = 1, size(ms%tracers)
call rk2_average_field_3d(ms%tracers(it)%hTr0, ms%tracers(it)%hTr, &
size(ms%tracers(it)%hTr, 1), &
size(ms%tracers(it)%hTr, 2), &
size(ms%tracers(it)%hTr, 3))
end do
end if
It is a bit useless to expose a public API for a continuity PPM solver, since it is quite specific. But then being able to expose a do_one_dynamics_step(handle)
API so that someone can call it from Python and have it run on the GPU? That is what a million dollar refactor should get you. All of this results in me being able to write an integration test that is super easy to read:
pure subroutine ocean_dyn_step_barotropic(grid, metrics, dyn, cor, ct, bs, dt)
!! Public only for the unit-test suite (no production module imports it);
!! ignore when developing production code in other modules.
type(hgrid_t), intent(in) :: grid
type(ocean_metrics_t), intent(in) :: metrics
type(ocean_dyn_t), intent(inout) :: dyn
type(coriolis_adv_t), intent(inout) :: cor
type(continuity_t), intent(inout) :: ct
type(barotropic_cgrid_state_t), intent(inout) :: bs
real(wp), intent(in) :: dt
integer :: i, j, nx, ny, nx_face, ny_uface, nx_vface, ny_face
nx = grid%nx_total
ny = grid%ny_total
nx_face = size(bs%u_face_x, 1)
ny_uface = size(bs%u_face_x, 2)
nx_vface = size(bs%v_face_y, 1)
ny_face = size(bs%v_face_y, 2)
! ---- 1. Save u^n into h0 / u_face_x0 / v_face_y0 ----
do concurrent(j=1:ny, i=1:nx)
bs%h0(i, j) = bs%h(i, j)
end do
do concurrent(j=1:ny_uface, i=1:nx_face)
bs%u_face_x0(i, j) = bs%u_face_x(i, j)
end do
do concurrent(j=1:ny_face, i=1:nx_vface)
bs%v_face_y0(i, j) = bs%v_face_y(i, j)
end do
! ---- 2. Stage 1: tendencies at u^n, FE step -> u^(1) ----
call continuity_compute_fluxes_barotropic(grid, metrics, ct, bs)
call coriolis_adv_compute_tendencies_barotropic(grid, metrics, cor, bs)
call continuity_apply_fluxes_barotropic(bs, dt)
call coriolis_adv_apply_tendencies_barotropic(cor, bs, dt)
! ---- 3. Stage 2: tendencies at u^(1), FE step -> u^(1) + dt*L(u^(1)) ----
call continuity_compute_fluxes_barotropic(grid, metrics, ct, bs)
call coriolis_adv_compute_tendencies_barotropic(grid, metrics, cor, bs)
call continuity_apply_fluxes_barotropic(bs, dt)
call coriolis_adv_apply_tendencies_barotropic(cor, bs, dt)
! ---- 4. RK2 average: u^(n+1) = 1/2 * (u^n + (u^(1) + dt*L(u^(1)))) ----
do concurrent(j=1:ny, i=1:nx)
bs%h(i, j) = 0.5_wp*(bs%h0(i, j) + bs%h(i, j))
end do
do concurrent(j=1:ny_uface, i=1:nx_face)
bs%u_face_x(i, j) = 0.5_wp*(bs%u_face_x0(i, j) + bs%u_face_x(i, j))
end do
do concurrent(j=1:ny_face, i=1:nx_vface)
bs%v_face_y(i, j) = 0.5_wp*(bs%v_face_y0(i, j) + bs%v_face_y(i, j))
end do
dyn%outer_step_count = dyn%outer_step_count + 1
end subroutine ocean_dyn_step_barotropic
Which to me is the magic of Fortran, if you had gone the full object oriented way in C++ would you get (AI generated)…because a lot of the new rewrites focus on migrating to C++:
template <class Op>
concept BarotropicOperator = requires(Op op, const HGrid& g, const OceanMetrics& m,
BarotropicCGridState& s, wp dt) {
op.compute(g, m, s);
op.apply(s, dt);
};
template <BarotropicOperator Cont, BarotropicOperator Cor>
class BarotropicRK2Stepper {
public:
BarotropicRK2Stepper(Cont& cont, Cor& cor) : cont_(cont), cor_(cor) {}
void step(const HGrid& grid, const OceanMetrics& metrics,
OceanDyn& dyn, BarotropicCGridState& s, wp dt) {
s.saveSnapshot(); // 1. save u^n
forwardEulerStage(grid, metrics, s, dt); // 2. u* = u^n + dt L(u^n)
forwardEulerStage(grid, metrics, s, dt); // 3. u* + dt L(u*)
s.averageWithSnapshot(); // 4. Heun average
++dyn.outerStepCount;
}
private:
void forwardEulerStage(const HGrid& grid, const OceanMetrics& metrics,
BarotropicCGridState& s, wp dt) {
cont_.compute(grid, metrics, s);
cor_ .compute(grid, metrics, s);
cont_.apply(s, dt);
cor_ .apply(s, dt);
}
Cont& cont_;
Cor& cor_;
};
But now, unless you’re very familiar with C++ and objects this might read like a foreign language. Because each object abstracts away what compute is doing you get that indirection that is very nice for generality but a bit difficult to cope if you’re starting out. I think heavily from the academic perspective, new students often without experience at all.
I feel that the Fortran implementation has a simplicity to the reader that is difficult to get with C++ unless you write C++ that looks more akin to C. I feel that the Fortran can look like how someone would write a python code for this to prototype: (also AI generated)
def ocean_dyn_step_barotropic(grid, metrics, dyn, cor, ct, bs, dt):
bs.h0[:] = bs.h
bs.u_face_x0[:] = bs.u_face_x
bs.v_face_y0[:] = bs.v_face_y
continuity_compute_fluxes_barotropic(grid, metrics, ct, bs)
coriolis_adv_compute_tendencies_barotropic(grid, metrics, cor, bs)
continuity_apply_fluxes_barotropic(bs, dt)
coriolis_adv_apply_tendencies_barotropic(cor, bs, dt)
bs.h[:] = 0.5 * (bs.h0 + bs.h)
bs.u_face_x[:] = 0.5 * (bs.u_face_x0 + bs.u_face_x)
bs.v_face_y[:] = 0.5 * (bs.v_face_y0 + bs.v_face_y)
dyn.outer_step_count += 1
So, basically my idea of a refactor/port/modernization is hide the complexity away through nice interfaces without abstracting too much that the algorithm gets lost. I like the idea of data structures carrying data used in computation and then use procedural style calls for this. I am sure you could rewrite my dynamics step to look nicer. My main goal is: write your code such that a new student can read it and compare it to the algorithm they see in a book/paper