Elsevier Science Home
Computer Physics Communications Program Library
Full text online from Science Direct
Programs in Physics & Physical Chemistry
CPC Home

[Licence| Download | New Version Template] aedg_v2_0.tar.gz(783 Kbytes)
Manuscript Title: NDL-v2.0: A New Version of the Numerical Differentiation Library for Parallel Architectures
Authors: P.E. Hadjidoukas, P. Angelikopoulos, C. Voglis, D.G. Papageorgiou, I.E. Lagaris
Program title: NDL-v2.0
Catalogue identifier: AEDG_v2_0
Distribution format: tar.gz
Journal reference: Comput. Phys. Commun. 185(2014)2217
Programming language: ANSI Fortran-77, ANSI C, Python.
Computer: Distributed systems (clusters), shared memory systems.
Operating system: Linux, Unix.
Has the code been vectorised or parallelized?: Yes.
RAM: The library uses O(N) internal storage, N being the dimension of the problem. It can use up to O(N2) internal storage for Hessian calculations, if a task throttling factor has not been set by the user.
Keywords: Numerical differentiation, Finite differences, Hessian calculations, OpenMP, MPI, Parallel processing, Distributed computing.
PACS: 02.60.Jh, 02.60.Pn, 02.70.Bf.
Classification: 4.9, 4.14, 6.5.

Does the new version supersede the previous version?: Yes

Nature of problem:
The numerical estimation of derivatives at several accuracy levels is a common requirement in many computational tasks, such as optimization, solution of nonlinear systems, sensitivity analysis. For a large number of scientific and engineering applications, the underlying functions correspond to simulation codes for which analytical estimation of derivatives is difficult or almost impossible. A parallel implementation that exploits systems with multiple CPUs is very important for large scale and computationally expensive problems.

Solution method:
Finite differencing is used with a carefully chosen step that minimizes the sum of the truncation and round-off errors. The parallel versions employ both OpenMP and MPI libraries.

Reasons for new version:
The updated version was motivated by our endeavours to extend a parallel Bayesian uncertainty quantification framework [1], by incorporating higher order derivative information as in most state-of-the-art stochastic simulation methods such as Stochastic Newton MCMC [2] and Riemanian Manifold Hamiltonian MC [3]. The function evaluations are simulations with signicant time-to-solution, which also varies with the input parameters such as in [1, 4]. The runtime of the N-body-type of problem changes considerably with the introduction of a longer cut-off between the bodies. In the first version of the library, the OpenMP-parallel subroutines spawn a new team of threads and distribute the function evaluations with a PARALLEL DO directive. This limits the functionality of the library as multiple concurrent calls require nested parallelism support from the OpenMP environment. Therefore, either their function evaluations will be serialized or processor oversubscription is likely to occur due to the increased number of OpenMP threads. In addition, the Hessian calculations include two explicit parallel regions that compute first the diagonal and then the off-diagonal elements of the array. Due to the barrier between the two regions, the parallelism of the calculations is not fully exploited. These issues have been addressed in the new version by first restructuring the serial code and then running the function evaluations in parallel using OpenMP tasks. Although the MPI-parallel implementation of the first version is capable of fully exploiting the task parallelism of the PNDL routines, it does not utilize the caching mechanism of the serial code and, therefore, performs some redundant function evaluations in the Hessian and Jacobian calculations. This can lead to: a) higher execution times if the number of available processors is lower than the total number of tasks, and b) signicant energy consumption due to wasted processor cycles. Overcoming these drawbacks, which become critical as the time of a single function evaluation increases, was the primary goal of this new version. Due to the code restructure, the MPI-parallel implementation (and the OpenMP-parallel in accordance) avoids redundant calls, providing optimal performance in terms of the number of function evaluations. Another limitation of the library was that the library subroutines were collective and synchronous calls. In the new version, each MPI process can issue any number of subroutines for asynchronous execution. We introduce two library calls that provide global and local task synchronization, similarly to the BARRIER and TASKWAIT directives of OpenMP. The new MPI-implementation is based on TORC, a new tasking library for multicore clusters [5-7]. TORC improves the portability of the software, as it is relies exclusively on the POSIX-Threads and MPI programming interfaces. It allows MPI processes to utilize multiple worker threads, offering a hybrid programming and execution environment similar to MPI+OpenMP, in a completely transparent way. Finally, to further improve the usability of our software, a Python interface has been implemented on top of both the OpenMP and MPI versions of the library. This allows sequential Python codes to exploit shared and distributed memory systems.

Summary of revisions:
The revised code improves the performance of both parallel (OpenMP and MPI) implementations. The functionality and the user-interface of the MPI-parallel version have been extended to support asynchronous execution of multiple PNDL calls, issued by one or multiple MPI processes. A new underlying tasking library increases portability and allows MPI processes to have multiple worker threads. For both implementations, an interface to the Python programming language has been added.

The library uses only double precision arithmetic. The MPI implementation assumes homogeneity of the execution environment provided by the operating system. Specifically, the processes of a single MPI application must have identical address space and a user function resides at the same virtual address. In addition, address space layout randomization should not be used for the application.

Unusual features:
The software takes into account bound constraints, in the sense that only feasible points are used to evaluate the derivatives, and given the level of the desired accuracy, the proper formula is automatically employed.

Running time:
Running time depends on the functions complexity. The test run took 23 ms for the serial distribution, 25 ms for the OpenMP with 2 threads, 53 ms and 1.01 s for the MPI parallel distribution using 2 threads and 2 processes respectively and yield-time for idle workers equal to 10ms.

[1] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Bayesian uncertainty quantication and propagation in molecular dynamics simulations: A high performance computing framework, J. Chem. Phys 137 (14).
[2] H. P. Flath, L. C.Wilcox, V. Akcelik, J. Hill, B. van Bloemen Waanders, O. Ghattas, Fast algorithms for bayesian uncertainty quantication in large-scale linear inverse problems based on low-rank partial hessian approximations, SIAM J. Sci. Comput. 33 (1) (2011) 407-432.
[3] M. Girolami, B. Calderhead, Riemann manifold langevin and hamiltonian monte carlo methods, J. .R. Stat. Soc. Ser. B (Stat. Methodol.) 73 (2) (2011) 123-214.
[4] P. Angelikopoulos, C. Paradimitriou, P. Koumoutsakos, Data driven, predictive molecular dynamics for nanoscale flow simulations under uncertainty, J. Phys. Chem. B 117 (47) (2013) 14808-14816.
[5] P. E. Hadjidoukas, E. Lappas, V. V. Dimakopoulos, A runtime library for platform-independent task parallelism, in: PDP, IEEE, 2012, pp. 229-236.
[6] C. Voglis, P. E. Hadjidoukas, D. G. Papageorgiou, I. Lagaris, A parallel hybrid optimization algorithm for fitting interatomic potentials, Appl. Soft Comput. 13 (12) (2013) 4481-4492.
[7] P. E. Hadjidoukas, C. Voglis, V. V. Dimakopoulos, I. Lagaris, D. G. Papageorgiou, Supporting adaptive and irregular parallelism for non-linear numerical optimization, Appl. Math. Comput. 231 (2014) 544-559.