Documenting CSV loading into DMatrix (#3137)

* Support CSV file in DMatrix We'd just need to expose the CSV parser in dmlc-core to the Python wrapper * Revert extra code; document existing CSV support CSV support is already there but undocumented * Add notice about categorical features
2018-02-28 18:41:10 -08:00
parent d5992dd881
commit 32ea70c1c9
3 changed files with 17 additions and 6 deletions
--- a/doc/python/python_intro.md
+++ b/doc/python/python_intro.md
@@ -25,7 +25,9 @@ Data Interface
 --------------
 The XGBoost python module is able to load data from:
 - libsvm txt format file
- Numpy 2D array, and
+- comma-separated values (CSV) file
+- Numpy 2D array
+- Scipy 2D sparse array, and
 - xgboost binary buffer file.

 The data is stored in a ```DMatrix``` object.
@@ -35,6 +37,16 @@ The data is stored in a ```DMatrix``` object.
 dtrain = xgb.DMatrix('train.svm.txt')
 dtest = xgb.DMatrix('test.svm.buffer')
 ```
+* To load a CSV file into ```DMatrix```:
+```python
+# label_column specifies the index of the column containing the true label
+dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
+dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
+```
+(Note that XGBoost does not support categorical features; if your data contains
+categorical features, load it as a numpy array first and then perform
+[one-hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).)
+
 * To load a numpy array into ```DMatrix```:
 ```python
 data = np.random.rand(5, 10)  # 5 entities, each contains 10 features