Normal prediction with DMatrix is now thread safe with locks. Added inplace prediction is lock free thread safe.
When data is on device (cupy, cudf), the returned data is also on device.
* Implementation for numpy, csr, cudf and cupy.
* Implementation for dask.
* Remove sync in simple dmatrix.
* Move thread local entry into Learner.
This is an attempt to workaround CUDA context issue in static variable, where
the CUDA context can be released before device vector.
* Add PredictionEntry to thread local entry.
This eliminates one copy of prediction vector.
* Don't define CUDA C API in a namespace.