VASP
The Vienna Ab initio Simulation Package (VASP) [1] is a package for performing electronic structure calculations from first principles, based on density-functional-theory [2]. In VASP, central quantities, like the one-electron wave functions, the electronic charge density, and the local potential are expressed in plane wave basis sets, and the interactions between ions and electrons are described using the projector-augmented-wave method [3]. The atomic structures studied with VASP are specified by a unit cell, subject to periodic boundary conditions. This latter is illustrated by Figure 1, that shows a contour plot of the self-consistent charge density in a simple cubic unit cell of Si.
To determine the electronic groundstate, VASP makes use of efficient iterative matrix diagonalisation techniques, like the residual minimisation method with direct inver-sion of the iterative subspace (RMM-DIIS) used in the benchmarks presented below. These are coupled to highly efficient Broyden and Pulay density mixing schemes to speed up the self-consistency cycle (see [1] for a detailed description of VASP).
Scaling
The computational cost of the RMM-DIIS iterative diagonalisation of the
Hamiltonian scales as
Nb Npw ln Npw,
where Nb is the number of occupied
electronic orbitals in the system, Npw is the number of plane waves in
the basis set, and Npw ln Npw is the cost of a Fast Fourier Transform.
Since Nb and Npw scale linearly with increasing system size N, the fundamental
scaling behaviour of the RMM-DIIS is N2 ln N.
To end up with a robust algorithm, the one-electron wave functions obtained after several iterations of the RMM-DIIS diagonalisation have to be explicitly orthonormalised. This is done by Choleski (LU) decomposition, which unfortunately scales as Nb2 Npw (i.e. N3).
For very large systems, the orthonormalisation will become the dominating step. The following benchmarks, however, were still strongly characterized by the cost of the RMM-DIIS.
Figure 2 shows the overall scaling of the self-consistency cycle (red line) and cost of the orthonormalisation (blue line) with increasing system size (diamond, with N=256, 512, 1024, 2048, and 4096 atoms in the unit cell; 2 valence states per atom) on 32 cores of the VSC. The green line shows the ratio between the time per iteration in the self-consistency cycle and the time per orthonormalisation step. Clearly, as the system size increases, the cost of orthonormalisation makes up an ever larger part of the total effort. Note that in the examples in Figure 2, the orthonormalisation is not the only part of the complete algorithm that scales as N3. To analyse this further, however, is beyond the scope of the present contribution.
The scaling behaviour of VASP with respect to the number of compute cores is illustrated in Figure 3, for diamond with 1024 (red line), 2048 (blue line), and 4096 (green line) atoms in the unit cell. The black line represents the nominal speedup (linear w.r.t. the number of cores). For the benchmarks systems presented here, VASP scales nicely up to 64 cores. The largest two systems, with N=2048 and 4096 atoms in the unit cell, show a satisfactory speedup up to 128 cores. For all systems under consideration the speedup is not as good beyond 128 cores. A more detailed analysis (not shown here) reveals that the part that scales the worst w.r.t. the number of compute cores is the orthonormalisation (Choleski decomposition from Intel MKL’s scaLAPACK).
Software
Compiler: Intel Fortran 11.1
Libraries: FFTW, Intel MKL (BLAS, LAPACK, and scaLAPACK)
Parallelisation: QLogic MPI
References
[1] G. Kresse and J. Furthmueller, Comput. Mat. Sci. 6, 15-50 (1996). G. Kresse and J. Furthmueller, Phys. Rev. B 54, 11169 (1996).
[2] W. Kohn, Rev. Mod. Phys. 71, 1253 (1999).
[3] P. E. Bloechl, Phys. Rev. B 50, 17953 (1994). G. Kresse and D. Joubert, Phys. Rev. B 59, 1758 (1999).