Documenting CSV loading into DMatrix (#3137)

* Support CSV file in DMatrix We'd just need to expose the CSV parser in dmlc-core to the Python wrapper * Revert extra code; document existing CSV support CSV support is already there but undocumented * Add notice about categorical features
2018-02-28 18:41:10 -08:00 · 2018-02-28 18:41:10 -08:00 · 32ea70c1c9
commit 32ea70c1c9
parent d5992dd881
3 changed files with 17 additions and 6 deletions
--- a/doc/conf.py
+++ b/doc/conf.py
@ -78,6 +78,8 @@ master_doc = 'index'
 # Usually you set "language" from the command line for these cases.
 language = None

+autoclass_content = 'both'
+
 # There are two options for replacing |today|: either, you set today to some
 # non-false value, then it is used:
 #today = ''
--- a/doc/python/python_intro.md
+++ b/doc/python/python_intro.md
@ -25,7 +25,9 @@ Data Interface
 --------------
 The XGBoost python module is able to load data from:
 - libsvm txt format file
- Numpy 2D array, and
+- comma-separated values (CSV) file
+- Numpy 2D array
+- Scipy 2D sparse array, and
 - xgboost binary buffer file.

 The data is stored in a ```DMatrix``` object.
@ -35,6 +37,16 @@ The data is stored in a ```DMatrix``` object.
 dtrain = xgb.DMatrix('train.svm.txt')
 dtest = xgb.DMatrix('test.svm.buffer')
 ```
+* To load a CSV file into ```DMatrix```:
+```python
+# label_column specifies the index of the column containing the true label
+dtrain = xgb.DMatrix('train.csv?format=csv&label_column=0')
+dtest = xgb.DMatrix('test.csv?format=csv&label_column=0')
+```
+(Note that XGBoost does not support categorical features; if your data contains
+categorical features, load it as a numpy array first and then perform
+[one-hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).)
+
 * To load a numpy array into ```DMatrix```:
 ```python
 data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
--- a/python-package/xgboost/core.py
+++ b/python-package/xgboost/core.py
@ -235,8 +235,6 @@ class DMatrix(object):
                 feature_names=None, feature_types=None,
                 nthread=None):
        """
-        Data matrix used in XGBoost.
-
        Parameters
        ----------
        data : string/numpy array/scipy.sparse/pd.DataFrame
@ -706,7 +704,7 @@ class DMatrix(object):


 class Booster(object):
-    """"A Booster of of XGBoost.
+    """A Booster of of XGBoost.

    Booster is the model of xgboost, that contains low level routines for
    training, prediction and evaluation.
@ -716,8 +714,7 @@ class Booster(object):

    def __init__(self, params=None, cache=(), model_file=None):
        # pylint: disable=invalid-name
-        """Initialize the Booster.
-
+        """
        Parameters
        ----------
        params : dict