Clarifies explanations around Data Interface code

This commit is contained in:
Preston Parry 2015-10-27 22:41:29 -07:00
parent 2e31e97e54
commit 89eafa1b97

View File

@ -24,32 +24,32 @@ Data Interface
-------------- --------------
XGBoost python module is able to loading from libsvm txt format file, Numpy 2D array and xgboost binary buffer file. The data will be store in ```DMatrix``` object. XGBoost python module is able to loading from libsvm txt format file, Numpy 2D array and xgboost binary buffer file. The data will be store in ```DMatrix``` object.
* To load libsvm text format file and XGBoost binary file into ```DMatrix```, the usage is like * To load a libsvm text file or a XGBoost binary file into ```DMatrix```, the command is:
```python ```python
dtrain = xgb.DMatrix('train.svm.txt') dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer') dtest = xgb.DMatrix('test.svm.buffer')
``` ```
* To load numpy array into ```DMatrix```, the usage is like * To load a numpy array into ```DMatrix```, the command is:
```python ```python
data = np.random.rand(5,10) # 5 entities, each contains 10 features data = np.random.rand(5,10) # 5 entities, each contains 10 features
label = np.random.randint(2, size=5) # binary target label = np.random.randint(2, size=5) # binary target
dtrain = xgb.DMatrix( data, label=label) dtrain = xgb.DMatrix( data, label=label)
``` ```
* Build ```DMatrix``` from ```scipy.sparse``` * To load a scpiy.sparse array into ```DMatrix```, the command is:
```python ```python
csr = scipy.sparse.csr_matrix((dat, (row, col))) csr = scipy.sparse.csr_matrix((dat, (row, col)))
dtrain = xgb.DMatrix(csr) dtrain = xgb.DMatrix(csr)
``` ```
* Saving ```DMatrix``` into XGBoost binary file will make loading faster in next time. The usage is like: * Saving ```DMatrix``` into XGBoost binary file will make loading faster in next time:
```python ```python
dtrain = xgb.DMatrix('train.svm.txt') dtrain = xgb.DMatrix('train.svm.txt')
dtrain.save_binary("train.buffer") dtrain.save_binary("train.buffer")
``` ```
* To handle missing value in ```DMatrix```, you can initialize the ```DMatrix``` like: * To handle missing value in ```DMatrix```, you can initialize the ```DMatrix``` by specifying missing values:
```python ```python
dtrain = xgb.DMatrix(data, label=label, missing = -999.0) dtrain = xgb.DMatrix(data, label=label, missing = -999.0)
``` ```
* Weight can be set when needed, like * Weight can be set when needed:
```python ```python
w = np.random.rand(5, 1) w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing = -999.0, weight=w) dtrain = xgb.DMatrix(data, label=label, missing = -999.0, weight=w)
@ -150,4 +150,4 @@ When you use ``IPython``, you can use ``to_graphviz`` function which converts th
```python ```python
xgb.to_graphviz(bst, num_trees=2) xgb.to_graphviz(bst, num_trees=2)
``` ```