d4163b9c5891 — Hsi-Yu Schive 7 years ago
Add the subsection 4-5 "Discussion on scalability
2 files changed, 77 insertions(+), 33 deletions(-)

M manuscript.pdf
M manuscript.tex
M manuscript.pdf +0 -0

        
M manuscript.tex +77 -33
@@ 695,7 695,7 @@ all cells are cubic. $C_{\rm CFL}$, $C_{
 $C_{\rm com}$ are the safety factors with typical values of $\about 0.5$
 except for $C_{\rm com} \sim 0.01$.
 
-Equations (\ref{eq:dt_CFL1}) and (\ref{eq:dt_CFL2}) specify the
+Equations~(\ref{eq:dt_CFL1}) and~(\ref{eq:dt_CFL2}) specify the
 Courant-Friedrichs-Lewy (CFL) condition of hydrodynamic schemes,
 where $v_x$, $v_y$, $v_z$ are the fluid velocities and $c_s$ is the
 sound speed. \eref{eq:dt_CFL1} applies to the RTVD and CTU schemes and

          
@@ 721,8 721,8 @@ adaptive time-step integration, we only 
 and particles on the targeted level, with an additional constraint
 that the physical time on a child level cannot be ahead of that
 on its parent level. Moreover, as mentioned in \sref{subsec:AMR},
-we allow the minimum time-step determined from Equations
-(\ref{eq:dt_CFL1}) -- (\ref{eq:dt_com}) to vary by a small fraction
+we allow the minimum time-step determined from
+Equations~(\ref{eq:dt_CFL1})~--~(\ref{eq:dt_com}) to vary by a small fraction
 (typically $\about 10\%$) to help synchronize adjacent levels.
 Finally, as mentioned in \sref{subsec:hydro}, we further reduce the
 time-step by a fixed ratio ($0.8$ by default) if the previously

          
@@ 961,7 961,7 @@ We also use GPUs to compute the time-ste
 i.e.\ Eqs. (\ref{eq:dt_CFL1}) -- (\ref{eq:dt_acc}), which otherwise
 would take a surprisingly large fraction of simulation time (e.g.\
 see the timing results of various operations shown in the end of
-\sref{subsec:merger_performance} and \sref{subsec:agora_performance}).
+Sections~\ref{subsec:merger_performance} and~\ref{subsec:agora_performance}).
 However, currently we have not ported any particle routines to GPUs,
 which will be investigated in the future.
 

          
@@ 990,8 990,8 @@ single-precision performance is about 2 
 double-precision performance, but this ratio can be noticeably
 higher on older GPUs. We typically find that single precision provides
 a satisfactory accuracy, as demonstrated in the comparison simulations
-shown in Sections \ref{subsec:merger_performance} and
-\ref{subsec:agora_performance}, except for applications requiring
+shown in Sections~\ref{subsec:merger_performance}
+and~\ref{subsec:agora_performance}, except for applications requiring
 either an extremely large dynamic range or resolving extremely small
 perturbations. Therefore, we adopt single precision throughout this
 paper unless otherwise specified.

          
@@ 1158,8 1158,8 @@ particles between neighboring patches) c
 the MPI synchronization between grid and particle routines.
 We typically adopt $W_{\rm par} = 1.0-2.0$. The optimal value depends
 on the adopted physics, for example, whether or not the radiative
-library \grackle\ is included. See Sections
-\ref{subsec:merger_performance} and \ref{subsec:agora_performance}
+library \grackle\ is included. See
+Sections~\ref{subsec:merger_performance} and~\ref{subsec:agora_performance}
 for some comparisons between the simulation performance with and
 without applying these optimizations.
 

          
@@ 1276,6 1276,9 @@ in a three-dimensional Kelvin-Helmholtz 
 strong scaling performance of \gamer\ with \flash\ in binary cluster merger
 simulations (\sref{subsec:merger}), and with \enzo\ in isolated disk
 galaxy simulations (\sref{subsec:agora}).
+Finally, we discuss the scalability of \gamer\
+(\sref{subsec:discuss_scalability}).
+
 
 
 % 4-1

          
@@ 1741,8 1744,8 @@ floating-point accuracy, do not have a s
 \label{subsec:merger_performance}
 
 Based on the very consistent physical results between \gamer\
-and \flash, as shown in Figs.~\ref{fig:merger_temperature_slice} and
-\ref{fig:merger_profile}, here we compare their strong scaling performance
+and \flash, as shown in Figs.~\ref{fig:merger_temperature_slice}
+and~\ref{fig:merger_profile}, here we compare their strong scaling performance
 on Blue Waters. In order to have a fair
 comparison between the codes with and without GPU-acceleration, we run
 \gamer\ on the XK nodes while \flash\ on the XE nodes: each XK node

          
@@ 1900,7 1903,7 @@ total wall time (upper left),
 maximum CPU memory consumption per MPI process (upper right),
 parallel efficiency (lower left), and
 doubling efficiency (lower right).
-See Equations (\ref{eq:pe_strong}) and (\ref{eq:de_strong}) for the
+See Equations~(\ref{eq:pe_strong}) and~(\ref{eq:de_strong}) for the
 definitions of parallel and doubling efficiencies in strong scaling.
 Note that the minimum number of nodes adopted in \flash\ is 16 instead
 of 1 due to its much larger memory consumption. Therefore, for a proper

          
@@ 1963,8 1966,8 @@ routines and MPI communication become th
 as these operations exhibit poor scalability.
 Note that, for better clarification, the performance shown here does
 not consider particle weights for load balancing, which is therefore
-different from the optimized performance shown in Figs.
-~\ref{fig:merger_strong_scaling} and \ref{fig:merger_strong_scaling_metrics}.
+different from the optimized performance shown in
+Figs.~\ref{fig:merger_strong_scaling} and~\ref{fig:merger_strong_scaling_metrics}.
 See text for details.
 }
 \label{fig:merger_time_fraction}

          
@@ 1998,8 2001,8 @@ the latter of which also includes transf
 found that by removing that extra MPI synchronization between grid and
 particle routines, the overall performance is improved by $\about 37\%$.
 Having $W_{\rm par}=2.0$ further improves the performance by $\about 10\%$.
-The performance shown in Figs.~\ref{fig:merger_strong_scaling} and
-~\ref{fig:merger_strong_scaling_metrics} has incorporated these
+The performance shown in Figs.~\ref{fig:merger_strong_scaling}
+and~\ref{fig:merger_strong_scaling_metrics} has incorporated these
 optimizations. These findings reveal the importance of balancing the
 workload of both grids and particles simultaneously, as discussed in
 \sref{subsec:load_balancing}.

          
@@ 2258,8 2261,8 @@ The simulations with \gamer\ (upper pane
 show very similar filamentary structures. Subtle differences are
 expected to some extent because of the stochastic star formation
 and the different numerical implementations
-(see \tref{table:agora_setup}). See Figs.~\ref{fig:agora_profile}
---~\ref{fig:agora_sfr} for more quantitative comparisons between the
+(see \tref{table:agora_setup}). See Figs.~\ref{fig:agora_profile}~--~\ref{fig:agora_sfr}
+for more quantitative comparisons between the
 two codes. At late times, a significant fraction of gas
 has collapsed and merged into large gravitationally bound clouds and
 there are no prominent spiral arms, mainly because we do not

          
@@ 2381,7 2384,7 @@ that found in the AGORA comparison proje
 this level of consistency could be achieved after including feedback.
 
 The agreement between the simulations results of \gamer\ and \enzo,
-as verified in Figs.~\ref{fig:agora_density} --~\ref{fig:agora_sfr},
+as verified in Figs.~\ref{fig:agora_density}~--~\ref{fig:agora_sfr},
 demonstrate the consistent numerical setup adopted for this code
 comparison experiment, including, for example, the initial condition,
 spatial and temporal resolution, and grid refinement criteria. It also

          
@@ 2503,7 2506,7 @@ total wall time (upper left),
 maximum CPU memory consumption per MPI process (upper right),
 parallel efficiency (lower left), and
 doubling efficiency (lower right).
-See Equations (\ref{eq:pe_strong}) and (\ref{eq:de_strong}) for the
+See Equations~(\ref{eq:pe_strong}) and~(\ref{eq:de_strong}) for the
 definitions of parallel and doubling efficiencies in strong scaling.
 \gamer\ and \enzo\ exhibit very similar parallel
 scalability for $\Nnode \le 32$, and \gamer\ scales noticeably better

          
@@ 2611,6 2614,51 @@ other two grid solvers (i.e.\ hydrodynam
 
 
 
+% 4-5
+% ----------------------------------------------
+\subsection{Discussion on scalability}
+\label{subsec:discuss_scalability}
+
+As shown in Sections~\ref{subsec:merger_performance}
+and~\ref{subsec:agora_performance}, \gamer\ not only runs
+substantially faster but also scales equally well or better than
+\flash\ and \enzo. This finding is highly non-trivial.
+Generally, for GPU-accelerated codes whose computation
+time has been greatly reduced, one would expect them to show
+relatively poor parallel scaling because the MPI communication
+should take a significantly larger fraction of time. So, the fact that
+\gamer\ still exhibits reasonably good scalability compared to both
+\flash\ and \enzo\ indicates that \gamer\ is also better optimized in
+MPI communication and load balancing, which is conceivably due to a
+combination of the following features:
+
+\begin{itemize}
+\item Hybrid OpenMP/MPI parallelization (see \sref{subsec:hybrid}).
+      It reduces inter-node communication and therefore improves the
+      parallel scalability, especially when using a large number of
+      nodes.
+\item Fixed patch size (compared to \enzo). It greatly simplifies
+      the parallel manipulation of AMR hierarchy and load balancing,
+      especially in massively parallel simulations.
+\item Parallel AMR structure. \gamer\ does not require duplicating
+      the entire AMR hierarchy on each MPI process
+      (see \sref{subsec:load_balancing}), which can improve parallel
+      scalability, particularly when running extremely large parallel
+      simulations.
+\item Load balancing with particles (see \sref{subsec:load_balancing},
+      especially \fref{fig:particle_load_balancing}). We take into
+      account the weighting of particles in load balancing, and
+      further minimize the MPI synchronization between grid and
+      particle routines.
+\end{itemize}
+
+Note that the hybrid OpenMP/MPI implementation and the parallel AMR
+structure also reduce the CPU memory overhead associated with
+parallelization. This feature can be important for simulations
+running on systems with high core counts per node.
+
+
+
 % section 5
 % ----------------------------------------------
 \section{Summary and Future Work}

          
@@ 2801,24 2849,20 @@ approximately a constant of $4-5$ with $
 to $5-8$ when using more than 32 nodes, suggesting that \gamer\ not
 only runs faster but also scales noticeably better than \enzo.
 
-The finding that \gamer\ not only runs substantially faster but also
-scales equally well or better than \flash\ and \enzo\ is highly
-non-trivial. Generally, for GPU-accelerated codes whose computation
-time has been greatly reduced, one would expect them to show
-relatively poor parallel scaling because the MPI communication
-should take a significantly larger fraction of time. So, the fact that
-\gamer\ still exhibits reasonably good scalability compared to both
-\flash\ and \enzo\ indicates that \gamer\ is also better optimized in
-MPI communication and load balancing, which is conceivably due to a
-combination of the following features:
+The fact that \gamer\ not only runs substantially faster but also
+scales equally well or better than \flash\ and \enzo\ indicates that
+\gamer\ is also better optimized in MPI communication and load
+balancing, which is conceivably due to a combination of the following
+features:
 
 \begin{itemize}
 \item Hybrid OpenMP/MPI parallelization (see \sref{subsec:hybrid}).
       It reduces inter-node communication and therefore improves the
       parallel scalability, especially when using a large number of
       nodes.
-\item Fixed patch size. It greatly simplifies the parallel manipulation
-      of AMR hierarchy, especially in massively parallel simulations.
+\item Fixed patch size (compared to \enzo). It greatly simplifies
+      the parallel manipulation of AMR hierarchy and load balancing,
+      especially in massively parallel simulations.
       Moreover, we do not require duplicating the entire AMR
       hierarchy on each MPI process (see \sref{subsec:load_balancing}).
 \item Load balancing with particles (see \sref{subsec:load_balancing},

          
@@ 2829,8 2873,8 @@ combination of the following features:
 \end{itemize}
 
 We have identified several performance bottlenecks from the detailed
-timing analysis conducted in this work (e.g.\ see Figs.
-~\ref{fig:merger_time_fraction} and~\ref{fig:agora_time_fraction}),
+timing analysis conducted in this work (e.g.\ see
+Figs.~\ref{fig:merger_time_fraction} and~\ref{fig:agora_time_fraction}),
 including load imbalance due to particles, \grackle\ library, MPI
 communication, and CPU performance when preparing the input data
 for GPU solvers. To improve performance further, we are currently