ADACS Development of PRISM

This repository documents the GPU optimisation investigation conducted by ADACS on the model emulation code PRISM as part of Ellert van der Velden’s 2019A ADACS Software Support Project.

This work was carried-out by ADACS developer Gregory B. Poole.

Implementation Description

The approach taken to move the mlxtend least-squares computation invoked by PRISM to the GPU has involved the following steps:

  1. Some increased coverage of the scikit-cuda library’s coverage of the Magma API. This library’s maintainer has kindly implemented these changes at the request of ADACS.
  2. Slight changes to the mlxtend code base. This has involved:
  1. Addition of a permit_estimator_override(est,func) function which calls function func unless an estimator method of the same name exists, in which case it calls that method:
def permit_estimator_override(est, func):
    est_atr = getattr(est, func.__name__, None)
    if callable(est_atr):
        return est_atr
    else:
        return func
  1. Wrapping of all function calls to cross_val_score and Parrallel in an estimator (for this work, that in sequential_feature_selector.py) with this function. For example, this:
parallel = Parallel(n_jobs=n_jobs, verbose=self.verbose, pre_dispatch=self.pre_dispatch)

becomes this

parallel = permit_estimator_override(self, Parallel)(n_jobs=n_jobs, verbose=self.verbose, pre_dispatch=self.pre_dispatch)

Note

These changes have not been pushed into the official mlxtend repository. A version can be found here, which can be installed as follows:

pip install git+https://github.com/ADACS-Australia/ADACS-SS18B-EvdVelden-mlxtend@adacs
  1. Implementation of a metaclass (provided by this repository; see prism_adacs.mlxtend_cuda.estimator for code and an example of it’s use) which provides all the needed support for CUDA-enabled GPU execution of mlxtend estimators. The user need-only:

    1. construct their estimators as they usual do, but with metaclass=CudaEstimator instead of metaclass=type (Python’s default).
    2. add a fit method to their estimator, which gets called in place of the estimator’s normal fitting routine.

Test Problem

For benchmarking, a 4-dimenstional “xsinx” function of the form \(F(\vec{x})=\prod_{i=1}^{4}(a_ix_i)sin(b_ix_i); \vec{a}=\vec{b}=[2,2]\) was used. A set of random samples \(\vec{s}\) were selected randomly in the range \([0,10]\) and a polynomial fit performed to the values \(S(\vec{s})=F(\vec{s})+G(\sigma=0.5)\), where \(G(\sigma)\) is a Gaussian-random value (centroid at the origin; variance of \(\sigma\)). Figure 1 illustrates a 2-dimensional example.

_images/fit_example.png

Figure 1: An example fit to \(S(\vec{x})\) for the case of 250 randomly selected points and a fit polynomial order of 7.

Results

Due to incomplete coverage of the *_gpu functions of the MAGMA library’s API, the current implementation presented here is far from optimal due to unnecessary host->device communications.

We have investigated benchmarks for two cases:

  1. The “current” implementation presented in this repository
  2. An optimistic “best case” scenario where SVD calculations are assumed to be of no expense and device communication are minimised.

Cases with fit polynomial orders of 5, 7 and 10 have been considered with a number of samples ranging from 100 to 100000. Larger problems were not run because they exceeded the maximum wallclock on OzSTAR, where all testing was conducted.

The runtimes of the current implementation are always poorer when utilizing the GPU, reflecting the cost of suboptimal host-device communication. Only for very large problems are accellerations realised for the best case scenario and even then, only by a maximum factor of approximately 2.

These results are presented in the figure below.

_images/timing_grid.png

Figure 2: GPU speed-ups for the current implementation and for a best-case scenario with minimised host-device communication and computing costs dictated only by least-squares fitting. Values less than 1 represent slower run-times; values greater than 1 represent accellerated run-times.