@@ 695,7 695,7 @@ all cells are cubic. $C_{\rm CFL}$, $C_{
$C_{\rm com}$ are the safety factors with typical values of $\about 0.5$
except for $C_{\rm com} \sim 0.01$.
-Equations (\ref{eq:dt_CFL1}) and (\ref{eq:dt_CFL2}) specify the
+Equations~(\ref{eq:dt_CFL1}) and~(\ref{eq:dt_CFL2}) specify the
Courant-Friedrichs-Lewy (CFL) condition of hydrodynamic schemes,
where $v_x$, $v_y$, $v_z$ are the fluid velocities and $c_s$ is the
sound speed. \eref{eq:dt_CFL1} applies to the RTVD and CTU schemes and
@@ 721,8 721,8 @@ adaptive time-step integration, we only
and particles on the targeted level, with an additional constraint
that the physical time on a child level cannot be ahead of that
on its parent level. Moreover, as mentioned in \sref{subsec:AMR},
-we allow the minimum time-step determined from Equations
-(\ref{eq:dt_CFL1}) -- (\ref{eq:dt_com}) to vary by a small fraction
+we allow the minimum time-step determined from
+Equations~(\ref{eq:dt_CFL1})~--~(\ref{eq:dt_com}) to vary by a small fraction
(typically $\about 10\%$) to help synchronize adjacent levels.
Finally, as mentioned in \sref{subsec:hydro}, we further reduce the
time-step by a fixed ratio ($0.8$ by default) if the previously
@@ 961,7 961,7 @@ We also use GPUs to compute the time-ste
i.e.\ Eqs. (\ref{eq:dt_CFL1}) -- (\ref{eq:dt_acc}), which otherwise
would take a surprisingly large fraction of simulation time (e.g.\
see the timing results of various operations shown in the end of
-\sref{subsec:merger_performance} and \sref{subsec:agora_performance}).
+Sections~\ref{subsec:merger_performance} and~\ref{subsec:agora_performance}).
However, currently we have not ported any particle routines to GPUs,
which will be investigated in the future.
@@ 990,8 990,8 @@ single-precision performance is about 2
double-precision performance, but this ratio can be noticeably
higher on older GPUs. We typically find that single precision provides
a satisfactory accuracy, as demonstrated in the comparison simulations
-shown in Sections \ref{subsec:merger_performance} and
-\ref{subsec:agora_performance}, except for applications requiring
+shown in Sections~\ref{subsec:merger_performance}
+and~\ref{subsec:agora_performance}, except for applications requiring
either an extremely large dynamic range or resolving extremely small
perturbations. Therefore, we adopt single precision throughout this
paper unless otherwise specified.
@@ 1158,8 1158,8 @@ particles between neighboring patches) c
the MPI synchronization between grid and particle routines.
We typically adopt $W_{\rm par} = 1.0-2.0$. The optimal value depends
on the adopted physics, for example, whether or not the radiative
-library \grackle\ is included. See Sections
-\ref{subsec:merger_performance} and \ref{subsec:agora_performance}
+library \grackle\ is included. See
+Sections~\ref{subsec:merger_performance} and~\ref{subsec:agora_performance}
for some comparisons between the simulation performance with and
without applying these optimizations.
@@ 1276,6 1276,9 @@ in a three-dimensional Kelvin-Helmholtz
strong scaling performance of \gamer\ with \flash\ in binary cluster merger
simulations (\sref{subsec:merger}), and with \enzo\ in isolated disk
galaxy simulations (\sref{subsec:agora}).
+Finally, we discuss the scalability of \gamer\
+(\sref{subsec:discuss_scalability}).
+
% 4-1
@@ 1741,8 1744,8 @@ floating-point accuracy, do not have a s
\label{subsec:merger_performance}
Based on the very consistent physical results between \gamer\
-and \flash, as shown in Figs.~\ref{fig:merger_temperature_slice} and
-\ref{fig:merger_profile}, here we compare their strong scaling performance
+and \flash, as shown in Figs.~\ref{fig:merger_temperature_slice}
+and~\ref{fig:merger_profile}, here we compare their strong scaling performance
on Blue Waters. In order to have a fair
comparison between the codes with and without GPU-acceleration, we run
\gamer\ on the XK nodes while \flash\ on the XE nodes: each XK node
@@ 1900,7 1903,7 @@ total wall time (upper left),
maximum CPU memory consumption per MPI process (upper right),
parallel efficiency (lower left), and
doubling efficiency (lower right).
-See Equations (\ref{eq:pe_strong}) and (\ref{eq:de_strong}) for the
+See Equations~(\ref{eq:pe_strong}) and~(\ref{eq:de_strong}) for the
definitions of parallel and doubling efficiencies in strong scaling.
Note that the minimum number of nodes adopted in \flash\ is 16 instead
of 1 due to its much larger memory consumption. Therefore, for a proper
@@ 1963,8 1966,8 @@ routines and MPI communication become th
as these operations exhibit poor scalability.
Note that, for better clarification, the performance shown here does
not consider particle weights for load balancing, which is therefore
-different from the optimized performance shown in Figs.
-~\ref{fig:merger_strong_scaling} and \ref{fig:merger_strong_scaling_metrics}.
+different from the optimized performance shown in
+Figs.~\ref{fig:merger_strong_scaling} and~\ref{fig:merger_strong_scaling_metrics}.
See text for details.
}
\label{fig:merger_time_fraction}
@@ 1998,8 2001,8 @@ the latter of which also includes transf
found that by removing that extra MPI synchronization between grid and
particle routines, the overall performance is improved by $\about 37\%$.
Having $W_{\rm par}=2.0$ further improves the performance by $\about 10\%$.
-The performance shown in Figs.~\ref{fig:merger_strong_scaling} and
-~\ref{fig:merger_strong_scaling_metrics} has incorporated these
+The performance shown in Figs.~\ref{fig:merger_strong_scaling}
+and~\ref{fig:merger_strong_scaling_metrics} has incorporated these
optimizations. These findings reveal the importance of balancing the
workload of both grids and particles simultaneously, as discussed in
\sref{subsec:load_balancing}.
@@ 2258,8 2261,8 @@ The simulations with \gamer\ (upper pane
show very similar filamentary structures. Subtle differences are
expected to some extent because of the stochastic star formation
and the different numerical implementations
-(see \tref{table:agora_setup}). See Figs.~\ref{fig:agora_profile}
---~\ref{fig:agora_sfr} for more quantitative comparisons between the
+(see \tref{table:agora_setup}). See Figs.~\ref{fig:agora_profile}~--~\ref{fig:agora_sfr}
+for more quantitative comparisons between the
two codes. At late times, a significant fraction of gas
has collapsed and merged into large gravitationally bound clouds and
there are no prominent spiral arms, mainly because we do not
@@ 2381,7 2384,7 @@ that found in the AGORA comparison proje
this level of consistency could be achieved after including feedback.
The agreement between the simulations results of \gamer\ and \enzo,
-as verified in Figs.~\ref{fig:agora_density} --~\ref{fig:agora_sfr},
+as verified in Figs.~\ref{fig:agora_density}~--~\ref{fig:agora_sfr},
demonstrate the consistent numerical setup adopted for this code
comparison experiment, including, for example, the initial condition,
spatial and temporal resolution, and grid refinement criteria. It also
@@ 2503,7 2506,7 @@ total wall time (upper left),
maximum CPU memory consumption per MPI process (upper right),
parallel efficiency (lower left), and
doubling efficiency (lower right).
-See Equations (\ref{eq:pe_strong}) and (\ref{eq:de_strong}) for the
+See Equations~(\ref{eq:pe_strong}) and~(\ref{eq:de_strong}) for the
definitions of parallel and doubling efficiencies in strong scaling.
\gamer\ and \enzo\ exhibit very similar parallel
scalability for $\Nnode \le 32$, and \gamer\ scales noticeably better
@@ 2611,6 2614,51 @@ other two grid solvers (i.e.\ hydrodynam
+% 4-5
+% ----------------------------------------------
+\subsection{Discussion on scalability}
+\label{subsec:discuss_scalability}
+
+As shown in Sections~\ref{subsec:merger_performance}
+and~\ref{subsec:agora_performance}, \gamer\ not only runs
+substantially faster but also scales equally well or better than
+\flash\ and \enzo. This finding is highly non-trivial.
+Generally, for GPU-accelerated codes whose computation
+time has been greatly reduced, one would expect them to show
+relatively poor parallel scaling because the MPI communication
+should take a significantly larger fraction of time. So, the fact that
+\gamer\ still exhibits reasonably good scalability compared to both
+\flash\ and \enzo\ indicates that \gamer\ is also better optimized in
+MPI communication and load balancing, which is conceivably due to a
+combination of the following features:
+
+\begin{itemize}
+\item Hybrid OpenMP/MPI parallelization (see \sref{subsec:hybrid}).
+ It reduces inter-node communication and therefore improves the
+ parallel scalability, especially when using a large number of
+ nodes.
+\item Fixed patch size (compared to \enzo). It greatly simplifies
+ the parallel manipulation of AMR hierarchy and load balancing,
+ especially in massively parallel simulations.
+\item Parallel AMR structure. \gamer\ does not require duplicating
+ the entire AMR hierarchy on each MPI process
+ (see \sref{subsec:load_balancing}), which can improve parallel
+ scalability, particularly when running extremely large parallel
+ simulations.
+\item Load balancing with particles (see \sref{subsec:load_balancing},
+ especially \fref{fig:particle_load_balancing}). We take into
+ account the weighting of particles in load balancing, and
+ further minimize the MPI synchronization between grid and
+ particle routines.
+\end{itemize}
+
+Note that the hybrid OpenMP/MPI implementation and the parallel AMR
+structure also reduce the CPU memory overhead associated with
+parallelization. This feature can be important for simulations
+running on systems with high core counts per node.
+
+
+
% section 5
% ----------------------------------------------
\section{Summary and Future Work}
@@ 2801,24 2849,20 @@ approximately a constant of $4-5$ with $
to $5-8$ when using more than 32 nodes, suggesting that \gamer\ not
only runs faster but also scales noticeably better than \enzo.
-The finding that \gamer\ not only runs substantially faster but also
-scales equally well or better than \flash\ and \enzo\ is highly
-non-trivial. Generally, for GPU-accelerated codes whose computation
-time has been greatly reduced, one would expect them to show
-relatively poor parallel scaling because the MPI communication
-should take a significantly larger fraction of time. So, the fact that
-\gamer\ still exhibits reasonably good scalability compared to both
-\flash\ and \enzo\ indicates that \gamer\ is also better optimized in
-MPI communication and load balancing, which is conceivably due to a
-combination of the following features:
+The fact that \gamer\ not only runs substantially faster but also
+scales equally well or better than \flash\ and \enzo\ indicates that
+\gamer\ is also better optimized in MPI communication and load
+balancing, which is conceivably due to a combination of the following
+features:
\begin{itemize}
\item Hybrid OpenMP/MPI parallelization (see \sref{subsec:hybrid}).
It reduces inter-node communication and therefore improves the
parallel scalability, especially when using a large number of
nodes.
-\item Fixed patch size. It greatly simplifies the parallel manipulation
- of AMR hierarchy, especially in massively parallel simulations.
+\item Fixed patch size (compared to \enzo). It greatly simplifies
+ the parallel manipulation of AMR hierarchy and load balancing,
+ especially in massively parallel simulations.
Moreover, we do not require duplicating the entire AMR
hierarchy on each MPI process (see \sref{subsec:load_balancing}).
\item Load balancing with particles (see \sref{subsec:load_balancing},
@@ 2829,8 2873,8 @@ combination of the following features:
\end{itemize}
We have identified several performance bottlenecks from the detailed
-timing analysis conducted in this work (e.g.\ see Figs.
-~\ref{fig:merger_time_fraction} and~\ref{fig:agora_time_fraction}),
+timing analysis conducted in this work (e.g.\ see
+Figs.~\ref{fig:merger_time_fraction} and~\ref{fig:agora_time_fraction}),
including load imbalance due to particles, \grackle\ library, MPI
communication, and CPU performance when preparing the input data
for GPU solvers. To improve performance further, we are currently