-
Notifications
You must be signed in to change notification settings - Fork 29
ChaNGa User Guide
ChaNGa is a parallel N-body and Smooth Particle Hydrodynamics (SPH) solver that was initially designed for cosmological simulations of galaxy formation. While the gravity and SPH algorithms are closely based on previously developed codes, ChaNGa ’s unique feature is its implementation in the parallel language, Charm++. The motivation for creating the code was to address the challenges of efficiently performing large cosmological simulations, which are, in turn, required to model the gas and gravitational dynamics over large ranges in scales which are relevant for galaxy formation. While cosmology on large machines is the primary motivation, the code is suitable for addressing a wide range of gravitational or hydrodynamical problems, from isolated galaxies to protoplanetary disks. Furthermore, since even laptops are now multicore machines, the parallel machinery of Charm++ allows ChaNGa to effectively use individual computers as well as scaling up to tens of thousands of cores.
The purpose of this guide is to introduce the user to the practical aspects of performing a simulation using ChaNGa . A description of the algorithms used, the code design, and performance tests can be found in @CharmApps. Here we give a quick overview, and then focus on obtaining, compiling, and running the code. We finish with an overview of software that can be used to analyze the results.
The central feature of ChaNGa is a tree-based gravity solver using a variant of the Barnes-Hut @barnes86 algorithm. In order to efficiently perform cosmological gravity simulations it also includes:
-
Periodic boundary conditions: Ewald summation is used to model a finite volume within a periodic Universe.
-
Multistepping: particle trajectories are calculated on block, power-of-two timestep intervals where each particle has an individual timestep.
Gas dynamics and star formation are also important processes in galaxy formation. ChaNGa includes the following processes:
-
SPH using exact
$k^{th}$ neighbor search. -
Isothermal, adiabatic or cooling in primordial gas.
-
Star formation based on a density criterion
-
Energy, metal and mass Feedback from supernovea and stellar winds.
The collaboration between the Parallel Programming Laboratory at UIUC and the N-body shop at UW was initially funded by an NSF ITR grant. The NASA AISR program also funded some of the development of ChaNGa .
While the gravity algorithm can be simply described as a variant of the Barnes-Hut algorithm, Joachim Stadel’s implementation in PKDGRAV and PKGRAV2 made the algorithm more efficient with an optimized walk, and more accurate by extending the multipole forces to fourth order (hexadecapole). The SPH algorithms are take from James Wadsley’s implementation in GASOLINE@wadsley04. The star formation and feedback is based on Greg Stinson’s work of @Stinson06 which, in turn is based on the work of Neal Katz. Hence, ChaNGa is essentially the SPH and gravity algorithms taken from GASOLINE, with some gravity enhancements from PKDGRAV2.
The public release version of ChaNGa is currently being distributed under the Gnu Public License (GPL) version 2.
First you need to install Charm. Version 7.0.0 or later is required for the version 3.5 release of ChaNGa. The latest development version of Charm is needed for the latest development version of ChaNGa.
ChaNGa itself can be downloaded from the N-body shop site or from github. The most recent version of ChaNGa can be obtained from the git archive. If you install the development version, you will generally also need to install the development version of Charm.
When downloading from the git repository, you will need both the changa
and
the utility
modules:
git clone http://github.com/N-BodyShop/changa.git
git clone http://github.com/N-BodyShop/utility.git
Download the latest Charm with:
git clone http://github.com/UIUC-PPL/charm.git
It is best to make sure you compile the most up-to-date stable release. So before you compile, from within the charm directory do:
git checkout v7.0.0
However, if you are looking for the latest/greatest performance enhancements, or you are using the development version of ChaNGa, you may need to use the development branch. In this case do:
git checkout master
Charm needs to be built first. This can be done with the build
script in the top level of the charm distribution.
ChaNGa uses a couple of Charm libraries that may
not be built by default. To be sure that these are built, use ChaNGa
as
the first argument to the build command. A typical build command will be:
./build ChaNGa netlrts-linux-x86_64 --with-production
This example is for a 64 bit linux cluster that is only connected by ethernet, or an isolated multi-core machine. Usually one wants to use the high performance interconnect. A convenient way to use that interconnect is to build with the provided MPI libraries:
./build ChaNGa mpi-linux-x86_64 --with-production
Charm has native drivers for many high performance networks. A native Infiniband build can be gotten with:
./build ChaNGa verbs-linux-x86_64 --with-production
There are also drivers for most flavors of Crays: mpi-crayxt
for the XT4
and XT5, gni-crayxe
for the XE6/XK7 and "gni-crayxc" for the XC30.
For clusters that are composed of SMP nodes with many cores, charm can take
advantage of the shared memory among cores to reduce messages and memory use.
Adding smp
as an argument to build
will enable this feature. E.g., for an Infiniband cluster:
./build ChaNGa verbs-linux-x86_64 smp --with-production
ChaNGa itself is compiled by first using the configure
script to generate
a Makefile. ./configure --help
in the changa directory will give you the
available options. Features of note include --enable-cooling
: adiabatic gas
is the default, so a cooling module can be selected with this option, (e.g.,
--enable-cooling=cosmo
). For simulations with large dynamic range, greater
than 1 million in spatial scales, --enable-bigkeys
will be needed to handle
the very deep trees.
configure
needs to be able to find the charm distribution. If charm
and changa
are both subdirectories of a common directory, this will happen
automatically. Otherwise set the environment variable CHARM_DIR
to point
at the charm source directory.
Once configure
runs successfully, make
will compile the code, usually
producing two executables: charmrun
and ChaNGa
. It is generally a good
idea to do a make clean
before running make
after rerunning configure
or updating from the git repository.
Some small test problems are included with the ChaNGa distribution to allow one to quickly determine that the program is operating properly.
In the subdirectory teststep there is a King model problem that checks
whether energy is conserved while multistepping a King model realized
with 36000 particles. Typing the command make test
in this directory
will run ChaNGa on this problem and compare the energy conservation with
a benchmark integration. The energies should agree to three significant
figures. By default, the test is run on two cores. Edit the file
runtest.sh
to run this test on a different number of cores.
In the subdirectory testcosmo is a small cosmological volume of an
run.sh
will run this
simulation to mf.sh
will run a halo finder on the
In the subdirectory testshock is a simple test of the gas dynamics using
a shock tube. make test
will run the shock tube simulation and create a
file, pres.out
that has pressure and density as a function of distance
along the shock tube. This can be compared with a standard calculation
in pres\_gasoline.out
.
In the subdirectory testcollapse is a combined test of gravity and SPH
via modelling the collapse of a self-gravitating sphere of adiabatic
gas. make test
will run the simulation, and calls tipsy
to create a
radial profile of the results. A python script is used to compare the
resulting profile, in particular comparing the inner entropy profile and
the outer velocity profile with a standard calculation.
###testonestar
Test energy and metal production from stellar evolution.
A non-trivial aspect of running on multi-node parallel machines is getting
the executable running on all the allocated nodes. For the MPI builds, this
is done using the installed MPI infrastructure (usually mpirun
or mpiexec
).
See the documentation specific for the cluster for details.
For net
builds, the charmrun
executable performs this task by using ssh
to start ChaNGa
on all the allocated nodes. charmrun
needs to know what
hosts have been allocated; this is specifed in a nodelist
file which has
the format:
group main ++shell /usr/bin/ssh
host node1
host node2
...
where node1
etc. are the available nodes. On a cluster with a batch
queuing system, these nodes need to be obtained from the batch environment.
Once the nodelist
is constructed, ChaNGa is started with (e.g.):
charmrun +p 128 ++nodelist nodelist ChaNGa test.param
For clusters with an MPI installation, charmrun
can use the MPI
infrastructure instead of using ssh. This is done using the ++mpiexec
flag:
charmrun ++mpiexec ChaNGa test.param
Part of the power of the charm run-time system is its ability to automatically
balance the load among processors by moving parts of the computation around.
In order for this to happen a load balancing strategy needs to be specified.
There are serveral load balancers that work well for ChaNGa. For single
stepping (i.e. all particles on the largest timestep) Orb3DLB_notopo
or
OrbLB
can work well. For multistepping runs MultistepLB_notopo
is an
extension of Orb3D_notopo
that handles multistepping. Loadbalancers are
specified with the +balancer option, e.g.:
charmrun ++mpiexec ChaNGa +balancer MultistepLB_notopo test.param
ChaNGa uses the object migration capabilities of charm for checkpointing: all
the objects in the program are migrated to disk. They are stored in a large
number of files (serveral per core) in a directory ending in either chk0
or
chk1
. The directories are alternated so that if there is a system crash in
the middle of writing one directory, the other is still available. The last
checkpoint written is indicated by an 0 or 1 written to the file
lastcheckpoint
. Restarts are done by specifying +restart and the restart
directory, e.g.:
charmrun ++mpiexec ChaNGa +restart test.chk0
Note that the param file is ignored on restart, and only a select group of
parameters can be changed on restart.
As well as providing diagnostics on how the simulation is progressing, the log
file gives some provenence information. In particular, the first few lines
of the log file give the version of ChaNGa, the command line that was used, the
number of processors used, the date it was compiled and the compile options.
Following that, all the values of all the run parameters are printed, one per
line. After all the parameters are printed, one line is printed every
iLogInterval
timesteps. The line contains the simulation time, the redshift
(if cosmological) various measures of energy, the total angular momentum, and
the wallclock time of the last timestep. Note that the first line will contain
the wallclock time of the first force calculation, while subsequent lines have
the wallclock time of a big step.
The screen output starts with information from the charm runtime system about about the system topology. This can be used to verify that you are running on the expected number and size of compute nodes. Following the ChaNGa version stamp is how ChaNGa has configured the nodes. This can be different than the Charm++ report if, for example you run multiple processes on SMP nodes rather than one process per node.
Following the startup information, timing information is printed for each phase of the calculation. For each smallest timestep there is Domain Decomposition (assigning particles to TreePiece objects), Load Balancing (assigning those pieces to processors to balance the processor load), building trees, and calculating gravity and SPH. Before each large step, the distribution of timestep rungs is also printed.
One of the features of the Charm++ runtime system is the automatic load balancing framework, but there are circumstances where the framework does not work, and some tuning is needed to get good performance. The load balancer developed for ChaNGa outputs some diagnostics to indicate how the load balancer is performing. Consider the following diagnostics:
Orb3dLB_notopo stats: maxObjLoad 6.218762
Orb3dLB_notopo stats: minWall 24.518440 maxWall 24.580847 avgWall 24.551503 maxWall/avgWall 1.001195
Orb3dLB_notopo stats: minIdle 9.332150 maxIdle 15.885612 avgIdle 13.328586 minIdle/avgIdle 0.700161
Orb3dLB_notopo stats: minPred 7.976786 maxPred 16.010665 avgPred 10.970380 maxPred/avgPred 1.459445
Orb3dLB_notopo stats: minPiece 2.000000 maxPiece 92.000000 avgPiece 33.032258 maxPiece/avgPiece 2.785156
Orb3dLB_notopo stats: minBg 0.065539 maxBg 1.300723 avgBg 0.224308 maxBg/avgBg 5.798818
Orb3dLB_notopo stats: orb migrated 2435 refine migrated 0 objects
First, all units are in seconds. The maxObjLoad line is saying that the most computationally intensive piece of the tree is taking 6.2 seconds of CPU time. Compare this with the maxWall time on the next line and we see that this is a significant fraction of one processor's time. Looking at the minPred and maxPred numbers says that the next step is predicted to be badly load balanced: some processors have twice as much work to do as others. The minPiece and maxPiece numbers say the the load balancer has moved work around by quite a bit: one processor has 3 times as many particles while another processor only has about a 10th the average number of particles. This indicates that perhaps using more pieces may help with the load balancing. Also, the fact that maxPred is significantly smaller than maxWall indicates that there is significant unused idle time, perhaps due to message latency.
- TreePiece.cpp:(.text+0x62f9): undefined reference to hilbert3d_double You have changed the BIGKEYS configuration (``configure --enable-bigkeys'') without doing a make clean.
-
My simulation runs slowly! Find out why: there is now a script "timings_sph.awk" or "timings_grav.awk" that will summarize where the time is being spent based on the diagnostic output. ``awk -f timings_sph.awk < DIAG'' will give you a summary report.
-
My simulation is running out of memory! (The OOM killer!) There are a number of options that can reduce memory consumption. Note, however, that these are all usually memory/compute speed trade-offs.
-maxbal X
to keep the load balancer from going crazy.-d 3
to lower cache sizes. For SMP runs, fewer processes/node (more threads/process, i.e. increase++ppn
) can help. For example, say you are running on 128 BW nodes, then your aprun command looks like:
aprun -N 4 -n 512 -d 8 ChaNGa ++ppn 7 .... *.param
That's 4 processes/node which is probably optimal since that puts one process on each 8 core socket.
aprun -N 2 -n 256 -d 16 ChaNGa ++ppn 15 .... *.param
will put 2 process on each node, each using 15( +1 for communication) cores. Or
aprun -N 1 -n 128 -d 32 ChaNGa ++ppn 31 .... *.param
will put 1 process on each node, using 31 + 1 cores. This last option
uses the least memory, and is probably the slowest.
The key constraint for units is that ChaNGa requires G = 1. Hence if you choose two out of the three possible dimensions, the third is always chosen for you. The two dimensions that are specifiable in the parameter file are mass in "dMSolUnit" and length in "dKpcUnit". The time dimension is determined from these two. A simple example is from the solar system: if the mass unit is 1 Solar mass ("dMSolUnit = 1") and the length unit is 1 AU ("dKpcUnit = 4.8481368e-09"), then the time unit is (1 year/ (2 pi)), since the Earth moves one radian in one time unit. Another example is convenient unit system for galactic dynamics: with a time unit of 1 gigayear and a length unit of 1 kiloparsec (dKpcUnit = 1), the mass unit is required to be 2.2222858e5 Solar Masses ("dMSolUnit = 2.2222858e5"). Conveniently, 1 kpc/gigayear = .9778 km/s.
For cosmology, the Hubble parameter enters as a unit, so it needs to be consistent with the other unit choices, and with G = 1. Many choices are possible, but our convention is the following:
-
Boxsize=1=length unit
-
Lengths and velocities are comoving.
-
Velocities are peculiar from the Hubble flow.
-
The critical density of the Universe in simulation units is 1.
- The total mass in the box is rho/rho_crit = Omega_{matter} at z = 0.
- From the Friedmann equation, H_0^2 = 8 pi rho_crit / 3; hence, H_0 = sqrt(8 pi/3) = 2.894405 in simulation units.
- The velocity unit gotten from (Hubble velocity across the box/2.894405).
- The time unit is 2.894405/(Hubble parameter).
Omega0 = 1 Box size = 100mpc at z=0 h=0.5
(just in case you are wondering: sigma8 and P(k) are not important)
mass unit= 6.398e16 solar masses
Total mass in the box = 1 code mass unit
box size = 100Mpc/(1+z)
Velocity unit= 1727.47/(1+z) = (50 100 /2.894405)/(1+z)
Omega0 = 0.3 (lambda or non lambda), Box size 100mpc at z=0, h=0.7
If H0=70Km/sec/Mpc units mass is (70/50)^2 6.938e16 = 13.6e16 Msol
Total mass in the box = Omega0 13.6e16 Msol =4.08e16
Velocity Unit: 70 100/2.894405 = 2418.46/(1+z) km/sec
Omega0 = 0.3 (lambda or non lambda), Box size 22.222mpc at z=0, h=0.7
If H0=70Km/sec/Mpc units mass is (22.222/100)^3 (70/50)^2 6.938e16 Msol = 1.492e15 msol
Total mass in the box = 4.476e14
Velocity Unit: 70 22.222/2.894405 = 537.43/(1+z) km/sec
Note: if got the units right for a given box and cosmology then the mass unit scales easily with boxsize^3 and the velocity unit is proportional to the boxsize for any other boxsizes.
Here be Dragons.
Going from comoving distance to physical distance is straightforward: r_phys = a r_cm. Physical velocity requires including the Hubble flow: v_phys = a (v_cm + H r_cm), where H is the hubble constant at the given redshift. Beware that v_circular is calculated differently, and therefore scales differently: v_c_phys = v_c_cm/sqrt(a). Finally, sometimes you want to know the ratio of the density to the critical density (e.g. to calculate the virial radius). This depends on cosmology. For Lambda + Matter critical Universes, we have rho/rho_c = rho_cm/(a^3 Omega_Lambda + Omega_matter), where the Omegas are present values and rho_c at the present is 1 as described above.