GPU Plugin: Add subsample, colsample_bytree, colsample_bylevel (#1895)

This commit is contained in:
Rory Mitchell
2016-12-23 04:30:36 +13:00
committed by Tianqi Chen
parent cee4aafb93
commit b49b339183
10 changed files with 331 additions and 324 deletions

View File

@@ -9,10 +9,10 @@ https://www.kaggle.com/c/bosch-production-line-performance/data
Copy train_numeric.csv into xgboost/demo/data.
The subsample parameter can be changed so you can run the script first on a small portion of the data. Processing the entire dataset can take a long time and requires about 8GB of device memory. It is initially set to 0.4, using about 2650/3380MB on a GTX 970.
The subset parameter changes the proportion of rows loaded from the CSV file. Processing the entire dataset can take a long time and requires about 8GB of device memory. It is initially set to 0.4, using about 2650/3380MB on a GTX 970. Lower the parameter if your device runs out of memory.
```python
subsample = 0.4
subset = 0.4
```
Parameters are set as usual except that we set silent to 0 to see how much memory is being allocated on the GPU and we change 'updater' to 'grow_gpu' to activate the GPU plugin.

View File

@@ -5,12 +5,12 @@ import time
import random
from sklearn.cross_validation import StratifiedKFold
#For sub sampling rows from input file
#For sampling rows from input file
random_seed = 9
subsample = 0.4
subset = 0.4
n_rows = 1183747;
train_rows = int(n_rows * subsample)
train_rows = int(n_rows * subset)
random.seed(random_seed)
skip = sorted(random.sample(xrange(1,n_rows + 1),n_rows-train_rows))
data = pd.read_csv("../data/train_numeric.csv", index_col=0, dtype=np.float32, skiprows=skip)