* Combine thread launches into single launch per tree for gpu_hist
algorithm.
* Address deprecation warning
* Add manual column sampler constructor
* Turn off omp dynamic to get a guaranteed number of threads
* Enable openmp in cuda code
* Optimisations for gpu_hist.
* Use streams to overlap operations.
* ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.