Code Description A. General description: AMG2013 is a parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids. See the following papers for details on the algorithm and its parallel implementation/performance: Van Emden Henson and Ulrike Meier Yang, "BoomerAMG: A Parallel Algebraic Multigrid Solver and Preconditioner", Appl. Num. Math. 41 (2002), pp. 155-177. Also available as LLNL technical report UCRL-JC-141495. Hans De Sterck, Ulrike Meier Yang and Jeffrey Heys, "Reducing Complexity in Parallel Algebraic Multigrid Preconditioners", SIAM Journal on Matrix Analysis and Applications 27 (2006), pp. 1019-1039. Also available as LLNL technical reports UCRL-JRNL-206780. Hans De Sterck, Robert D. Falgout, Josh W. Nolting and Ulrike Meier Yang, "Distance-Two Interpolation for Parallel Algebraic Multigrid", Numerical Linear Algebra with Applications 15 (2008), pp. 115-139. Also available as LLNL technical report UCRL-JRNL-230844. U. M. Yang, "On Long Range Interpolation Operators for Aggressive Coarsening", Numer. Linear Algebra Appl., 17 (2010), pp. 453-472. LLNL-JRNL-417371. A. H. Baker, R. D. Falgout, T. V. Kolev, and U. M. Yang, "Multigrid Smoothers for Ultraparallel Computing", SIAM J. Sci. Comput., 33 (2011), pp. 2864-2887. LLNL-JRNL-473191. The driver provided with AMG2013 builds linear systems for various 3-dimensional problems, which are described in Section D. To determine when the solver has converged, the driver uses the relative-residual stopping criteria, ||r_k||_2 / ||b||_2 < tol with tol = 10^-6. B. Coding: AMG2013 is written in ISO-C. It is an SPMD code which uses MPI. Parallelism is achieved by data decomposition. The driver provided with AMG2013 achieves this decomposition by simply subdividing the grid into logical P x Q x R (in 3D) chunks of equal size. C. Parallelism: AMG2013 is a highly synchronous code. The communications and computations patterns exhibit the surface-to-volume relationship common to many parallel scientific codes. Hence, parallel efficiency is largely determined by the size of the data "chunks" mentioned above, and the speed of communications and computations on the machine. AMG2013 is also memory-access bound, doing only about 1-2 computations per memory access, so memory-access speeds will also have a large impact on performance. D. Test problems Problem 1 (default): The default problem is a Laplace type problem on an unstructured domain with an anisotropy in one part. A 2-dimensional projection of the grid with the corresponding 2-dimensional stencils is illustrated in the file 'mg_grid_labels.pdf'. The problem is made 3-dimensional by extending the domain uniformly in z-direction. The default problem size is 384 unknowns, but this is easily refined on the amg2013 command line (see "Running the Code" for details); Suggestions for test runs are given in Section "Suggested Test Runs". Problem 2 (-laplace): Solves - cx u_xx - cy u_yy - cz u_zz = (1/h)^2 with Dirichlet boundary conditions of u = 0, where h is the mesh spacing in each direction on the unit cube. Standard finite differences are used to discretize the equations yielding 7-pt stencils in 3D. This problem can also be used to generate 2D or 1D problems by setting the length in one or two of the directions (, or ) to 1. Problem 3 (-27pt): Solves a Laplace type problem using a 27-point stencil. Problem 4 (-jumps): Solves the PDE - a(x,y,z)(u_xx + u_yy = u_zz) = (1/h)^2 with Dirichlet boundary conditions of u=0 on the unit cube, and a(x,y,z) = 1000 on [0.1,0.9] x [0.1,0.9] x [0.1,0.9] = 0.01 on the 8 corner cubes of size 0.1 x 0.1 x 0.1 = 1 elsewhere %========================================================================== %========================================================================== Important Kernels in this Distribution Here the important files that are used for the linear solver, both preconditioner and solver, are listed. Files that don't take much time such as wrappers (files starting with HYPRE_, files that are not used during the suggested runs or files that are used for the generation of the problems are not included here. A complete listing of all directories and files as well as a short description of each directory can be found in the next section. In the 'krylov' directory: pcg.c functions for the conjugate gradient algorithm gmres.c functions for the GMRES algorithm In the 'parcsr_ls' directory: par_amg.c Setup phase of the AMG preconditioner par_amg_setup.c Setup phase of the AMG preconditioner par_coarsen.c various coarsening algorithms par_strength.c computes a strength matrix for coarsening and interpolation par_indepset.c independent set function needed for coarsening par_interp.c interpolation algorithms for solvers 0 and 3 par_lr_interp.c interpolation algorithms for solvers 1 and 4 aux_interp.c auxiliary functions needed in par_lr_interp.c par_multi_interp.c interpolation algorithm for fine level in solvers 1 and 4 par_rap.c generates coarse grid operator par_rap_communication.c sets up communication in par_rap.c par_amg_solve.c Solve phase of the AMG preconditioner par_cycle.c AMG cycle par_relax.c AMG smoothers In the 'parcsr_mv' directory: par_csr_communication.c communication routines for global partitioning new_commpkg.c communication routines for assumed partitioning par_csr_assumed_part.c communication routines for assumed partitioning par_csr_matrix.c basic parallel matrix operations par_csr_matvec.c parallel matrix vector multiplication par_csr_matop.c additional parallel matrix operations par_vector.c basic vector operations In the 'seq_mv' directory: big_csr_matrix.c basic sequential matrix operations csr_matrix.c basic sequential matrix operations csr_matvec.c sequential matrix vector multiplications csr_matop.c additional sequential matrix operations vector.c basic sequential vector operations %========================================================================== %========================================================================== Files in this Distribution NOTE: The AMG2013 code is derived directly from the hypre library, a large linear solver library that is being developed in the Center for Applied Scientific Computing (CASC) at LLNL. In the amg2013 directory the following files are included: COPYING_LESSER COPYRIGHT HYPRE.h Makefile Makefile.include The following subdirectories are also included: docs Documentation IJ_mv Linear algebraic interface routines krylov Krylov solvers, such as PCG and GMRES parcsr_ls routines needed to generate solvers and preconditioners as well as Problems 2-4 parcsr_mv parallel matrix and vector routines (ParCSR data structure) seq_mv sequential matrix and vector routines sstruct_mv semistructured matrix and vector routines - included to generate Problem 1 struct_mv structured matrix and vector routines - included to generate Problem 1 test driver and input file for Problem 1 utilities functions for memory allocation, timing, error codes, sorting, searching, etc. In the 'docs' directory the following files are included: amg2013.readme mg_grid_labels.pdf In the 'IJ_mv' directory the following files are included: aux_parcsr_matrix.c aux_parcsr_matrix.h aux_par_vector.c aux_par_vector.h headers.h HYPRE_IJMatrix.c HYPRE_IJ_mv.h HYPRE_IJVector.c IJMatrix.c IJ_matrix.h IJMatrix_parcsr.c IJ_mv.h IJVector.c IJ_vector.h IJVector_parcsr.c Makefile In the 'krylov' directory the following files are included: all_krylov.h gmres.c gmres.h HYPRE_gmres.c HYPRE_MatvecFunctions.h HYPRE_pcg.c krylov.h Makefile pcg.c pcg.h In the 'parcsr_ls' directory the following files are included: aux_interp.c aux_interp.h gen_redcs_mat.c headers.h HYPRE_parcsr_amg.c HYPRE_parcsr_gmres.c HYPRE_parcsr_ls.h HYPRE_parcsr_pcg.c Makefile par_amg.c par_amg.h par_amg_setup.c par_amg_solve.c par_cg_relax_wt.c par_coarsen.c par_coarse_parms.c parcsr_ls.h par_cycle.c par_difconv.c par_indepset.c par_interp.c par_jacobi_interp.c par_laplace_27pt.c par_laplace.c par_lr_interp.c par_multi_interp.c par_nodal_systems.c par_rap.c par_rap_communication.c par_relax.c par_relax_interface.c par_relax_more.c par_scaled_matnorm.c par_stats.c par_strength.c partial.c par_vardifconv.c pcg_par.c In the 'parcsr_mv' directory the following files are included: headers.h HYPRE_parcsr_matrix.c HYPRE_parcsr_mv.h HYPRE_parcsr_vector.c Makefile new_commpkg.c new_commpkg.h par_csr_assumed_part.c par_csr_assumed_part.h par_csr_communication.c par_csr_communication.h par_csr_matop.c par_csr_matop_marked.c par_csr_matrix.c par_csr_matrix.h par_csr_matvec.c parcsr_mv.h par_vector.c par_vector.h In the 'seq_mv' directory the following files are included: big_csr_matrix.c csr_matop.c csr_matrix.c csr_matrix.h csr_matvec.c genpart.c headers.h HYPRE_csr_matrix.c HYPRE_seq_mv.h HYPRE_vector.c Makefile seq_mv.h vector.c vector.h In the 'sstruct_mv' directory the following files are included: box_map.c box_map.h headers.h HYPRE_sstruct_graph.c HYPRE_sstruct_grid.c HYPRE_sstruct_matrix.c HYPRE_sstruct_mv.h HYPRE_sstruct_stencil.c HYPRE_sstruct_vector.c Makefile sstruct_axpy.c sstruct_copy.c sstruct_graph.c sstruct_graph.h sstruct_grid.c sstruct_grid.h sstruct_innerprod.c sstruct_matrix.c sstruct_matrix.h sstruct_matvec.c sstruct_mv.h sstruct_overlap_innerprod.c sstruct_scale.c sstruct_stencil.c sstruct_stencil.h sstruct_vector.c sstruct_vector.h In the 'struct_mv' directory the following files are included: assumed_part.c assumed_part.h box_algebra.c box_alloc.c box_boundary.c box.c box.h box_manager.c box_manager.h box_neighbors.c box_neighbors.h box_pthreads.h communication_info.c computation.c computation.h grow.c headers.h HYPRE_struct_grid.c HYPRE_struct_matrix.c HYPRE_struct_mv.h HYPRE_struct_stencil.c HYPRE_struct_vector.c Makefile new_assemble.c new_box_neighbors.c project.c struct_axpy.c struct_communication.c struct_communication.h struct_copy.c struct_grid.c struct_grid.h struct_innerprod.c struct_io.c struct_matrix.c struct_matrix.h struct_matrix_mask.c struct_matvec.c struct_mv.h struct_overlap_innerprod.c struct_scale.c struct_stencil.c struct_stencil.h struct_vector.c struct_vector.h In the 'test' directory the following files are included: amg2013.c Makefile sstruct.in.MG.FD In the 'utilities' directory the following files are included: amg_linklist.c amg_linklist.h binsearch.c exchange_data.c exchange_data.h exchange_data.README general.h hypre_error.c hypre_error.h hypre_memory.c hypre_memory.h hypre_qsort.c hypre_smp_forloop.h HYPRE_utilities.h Makefile memory_dmalloc.c mpistubs.c mpistubs.h qsplit.c random.c threading.c threading.h thread_mpistubs.c thread_mpistubs.h timer.c timing.c timing.h umalloc_local.c umalloc_local.h utilities.h %========================================================================== %========================================================================== Building the Code AMG2013 uses a simple Makefile system for building the code. All compiler and link options are set by modifying the file 'amg2013/Makefile.include' appropriately. This file is then included in each of the following makefiles: krylov/Makefile IJ_mv/Makefile parcsr_ls/Makefile parcsr_mv/Makefile seq_mv/Makefile sstruct_mv/Makefile struct_mv/Makefile test/Makefile utilities/Makefile To build the code, first modify the 'Makefile.include' file appropriately, then type (in the amg2013 directory) make Other available targets are make clean (deletes .o files) make veryclean (deletes .o files, libraries, and executables) To configure the code to run with: 1 - MPI only , add '-DTIMER_USE_MPI' to the 'INCLUDE_CFLAGS' line in the 'Makefile.include' file and use a valid MPI. 2 - OpenMP with MPI, add vendor dependent compilation flag for OMP 3 - to use the assumed partition (recommended for several thousand processors or more), add '-DHYPRE_NO_GLOBAL_PARTITION' 4 - to be able to solve problems that are larger than 2^31-1, add '-DHYPRE_LONG_LONG' %========================================================================== %========================================================================== Optimization and Improvement Challenges This code is memory-access bound. We believe it would be very difficult to obtain "good" cache reuse with an optimized version of the code. %========================================================================== %========================================================================== Parallelism and Scalability Expectations AMG2013 has been run on the following platforms: BG/Q - up to over 1,000,000 MPI processes BG/P - up to 125,000 MPI processes Sierra - up to 13,824 MPI processes and more Consider increasing both problem size and number of processors in tandem. On scalable architectures, time-to-solution for AMG2013 will initially increase, then it will level off at a modest numbers of processors, remaining roughly constant for larger numbers of processors. Iteration counts will also increase slightly for small to modest sized problems, then level off at a roughly constant number for larger problem sizes. For example, we get the following timing results (in seconds) for a 3D Laplace problem with cx = cy = cz = 1.0, distributed on a logical P x Q x R processor topology, with fixed local problem size per process given as 40 x 40 x 40: P x Q x R procs solver similar to solver 0 --------------------------------------------------------------- 16x16x16 4096 5.75 20x20x20 8000 6.88 32x32x32 32768 8.11 44x44x44 91125 10.48 50x50x50 125000 10.54 These results were obtained on BG/P using the assumed partition option -DHYPRE_NO_GLOBAL_PARTITION and -DHYPRE_LONG_LONG. %========================================================================== %========================================================================== Running the Code The driver for AMG2013 is called `amg2013', and is located in the amg2013/test subdirectory. Type mpirun -np 1 amg2013 -help to get usage information. This prints out the following: Usage: amg2013 [] -in : input file (default is `sstruct.in.AMG.FD') -P : define processor topology per part Note that for test problem 1, which has 8 parts this leads to 8*Px*Py*Pz MPI processes! For all other test problems, the total amount of MPI processes is Px*Py*Pz. -pooldist

: pool distribution to use -r : refine part(s) for default problem -b : refine and block part(s) for default problem -n : define size per processor for problems on cube -c : define anisotropies for Laplace problem -laplace : 3D Laplace problem on a cube -27pt : Problem with 27-point stencil on a cube -jumps : PDE with jumps on a cube -solver : solver ID (default = 0) 0 - PCG with AMG precond 1 - PCG with diagonal scaling 2 - GMRES(10) with AMG precond 3 - GMRES(10) with diagonal scaling -printstats : print out detailed info on AMG preconditioner -printsystem : print out the system -rhsfromcosine : solution is cosine function (default), can be used for default problem only -rhsone : rhs is vector with unit components All of the arguments are optional. The most important option for the AMG2013 compact application is the `-P' option. It specifies the MPI process topology on which to run. For the default problem, there are two possible pool distributions, which lead to different partitionings of the problem. Pool distribution 0 will give each process a portion of one of the 8 parts of the test problem, thus assigning disjoint subdomains to each process. Pool distribution 1 uses a more natural partitioning, assigning each process a subdomain in one of the 8 parts, and therefore requires the total number of processes to be a multiple of 8, i.e. it needs to be run as follows: mpirun -np amg2013 -pooldist 1 -P ... with = 8***. Both partitionings lead to a load balanced distribution of the original problem. The problem size per MPI process can be increased using the `-r' option, which defines the refinement factor for the grid on each process in each direction, or the '-b' option, which increases the number of blocks per process. For the other three problems (laplace, 27pt and jumps) the `-n' option allows one to specify the local problem size per MPI process, leading to a global problem size of * by * by *. %========================================================================== %========================================================================== Timing Issues If using MPI, the whole code is timed using the MPI timers. If not using MPI, standard system timers are used. Timing results are printed to standard out, and are divided into "Setup Phase" times and "Solve Phase" times. Timings for a few individual routines are also printed out. %========================================================================== %========================================================================== Memory Needed AMG2013 's memory needs are somewhat complicated to describe. They are very dependent on the type of problem solved and the AMG options used. In general, solver 1 and solver 4 will need less memory than solver 0 and 3. When turning on the '-printstats' option, operator complexities are displayed, which are defined by the sum of nonzeros of the original matrix and all coarse grid matrices divided by the number of nonzeros of the original matrix, i.e. for original matrix and coarse grid operators about times as much space is needed as for the original matrix. However this does not include memory needed for interpolation operators, communication, etc. %========================================================================== %========================================================================== About the Data AMG2013 requires one input file to generate the default problem, which is located in the test directory. Apart from this all control is on the command line. %========================================================================== %========================================================================== Expected Results Consider the following run, which was compiled using options -DTIMER_USE_MPI -DHYPRE_USING_OPENMP, linking with OpenMP and setting OMP_NUM_THREADS to 2: mpirun -np 8 amg2013 -pooldist 1 P 1 1 1 -r 4 4 4 -printstats This is what AMG2013 prints out: ============================================= SStruct Interface: ============================================= SStruct Interface: SStruct Interface wall clock time = 0.014211 seconds SStruct Interface cpu clock time = 0.010000 seconds Number of MPI processes: 8 , Number of OpenMP threads: 2 BoomerAMG SETUP PARAMETERS: Max levels = 25 Num levels = 6 Strength Threshold = 0.250000 Interpolation Truncation Factor = 0.000000 Maximum Row Sum Threshold for Dependency Weakening = 0.900000 Coarsening Type = HMIS Hybrid Coarsening (switch to CLJP when coarsening slows) measures are determined locally no. of levels of aggressive coarsening: 1 Interpolation = extended+i interpolation Operator Matrix Information: nonzero entries per row row sums lev rows entries sparse min max avg min max =================================================================== 0 82944 648936 0.000 4 9 7.8 -4.274e-15 3.000e+02 1 8985 159896 0.002 4 45 17.8 -2.069e-13 9.293e+02 2 2763 72864 0.010 6 121 26.4 -2.487e-14 1.668e+03 3 1001 21100 0.021 3 167 21.1 8.298e-02 3.147e+03 4 320 8354 0.082 2 79 26.1 1.938e-01 1.098e+02 5 21 171 0.388 4 12 8.1 5.854e+00 6.784e+00 Interpolation Matrix Information: entries/row min max row sums lev rows cols min max weight weight min max ================================================================= 0 82944 x 8985 1 10 1.488e-02 9.980e-01 1.759e-01 1.000e+00 1 8985 x 2763 1 4 8.769e-03 1.000e+00 1.624e-01 1.000e+00 2 2763 x 1001 0 4 2.076e-03 1.000e+00 0.000e+00 1.000e+00 3 1001 x 320 0 4 -4.281e-01 1.452e+00 -7.104e-03 1.000e+00 4 320 x 21 0 4 2.627e-03 5.150e-02 0.000e+00 1.000e+00 Complexity: grid = 1.157817 operator = 1.404331 BoomerAMG SOLVER PARAMETERS: Maximum number of cycles: 1 Stopping Tolerance: 0.000000e+00 Cycle type (1 = V, 2 = W, etc.): 1 Relaxation Parameters: Visiting Grid: down up coarse Number of partial sweeps: 1 1 1 Type 0=Jac, 3=hGS, 6=hSGS, 9=GE: 8 8 8 Point types, partial sweeps (1=C, -1=F): Pre-CG relaxation (down): 0 Post-CG relaxation (up): 0 Coarsest grid: 0 ============================================= Setup phase times: ============================================= PCG Setup: PCG Setup wall clock time = 0.066036 seconds PCG Setup cpu clock time = 0.090000 seconds System Size / Setup Phase Time: 1.674723e+06 ============================================= Solve phase times: ============================================= PCG Solve: PCG Solve wall clock time = 0.103601 seconds PCG Solve cpu clock time = 0.140000 seconds AMG2013 Benchmark version 1.0 Iterations = 8 Final Relative Residual Norm = 6.945422e-07 System Size * Iterations / Solve Phase Time: 8.539842e+06 %========================================================================== %========================================================================== Suggested Test Runs 1. For the default problem: mpirun -np <8*px*py*pz> amg2013 -pooldist 1 -r 12 12 12 -P px py pz This will generate a problem with 82,944 variables per MPI process leading to a total system size of 663,552*px*py*pz. mpirun -np <8*px*py*pz> amg2013 -pooldist 1 -r 24 24 24 -P px py pz This will generate a problem with 663,552 variables per process leading to a total system size of 5,308,416*px*py*pz and solve it using conjugate gradient preconditioned with AMG. If one wants to use AMG-GMRES(10) append -solver 2 . The domain (for a 2-dimensional projection of the domain see mg_grid_labels.pdf) can be scaled up by increasing the values for px, py and pz. 2. For the 7pt 3D Laplace problem: mpirun -np amg2013 -laplace -n 40 40 40 -P px py pz This will generate a problem with 64,000 grid points per MPI process with a domain of the size 40*px x 40*py x 40*pz . mpirun -np amg2013 -laplace -n 80 80 80 -P px py pz This will generate a problem with 512,000 grid points per MPI process with a domain of the size 80*px x 80*py x 80*pz . %========================================================================== %========================================================================== For further information on AMG2013 contact Ulrike Yang ph: (925)422-2850 email: umyang@llnl.gov %========================================================================== %========================================================================== Release and Modification Record LLNL code release number: UCRL-CODE-222953. See the files COPYRIGHT and COPYING.LESSER for a complete copyright notice, additional contact information, disclaimer and license.