Document refactor

change badge
2015-08-01 13:47:41 -07:00
parent c43fee541d
commit e8de5da3a5
20 changed files with 286 additions and 184 deletions
--- a/doc/python/python_intro.md
+++ b/doc/python/python_intro.md
@@ -1,32 +1,27 @@
-XGBoost Python Module
-=====================
+Python Package Introduction
+===========================
+This document gives a basic walkthrough of xgboost python package.

-This page will introduce XGBoost Python module, including:
-* [Building and Import](#building-and-import)
-* [Data Interface](#data-interface)
-* [Setting Parameters](#setting-parameters)
-* [Train Model](#training-model)
-* [Early Stopping](#early-stopping)
-* [Prediction](#prediction)
-* [API Reference](python_api.md)
+***List of other Helpful Links***
+* [Python walkthrough code collections](https://github.com/tqchen/xgboost/blob/master/demo/guide-python)
+* [Python API Reference](python_api.rst)

-A [walk through python example](https://github.com/tqchen/xgboost/blob/master/demo/guide-python) for UCI Mushroom dataset is provided.
-
-=
-#### Install
-
-To install XGBoost, you need to run `make` in the root directory of the project and then in the `python-package` directory run
+Install XGBoost
+---------------
+To install XGBoost, do the following steps.

+* You need to run `make` in the root directory of the project
+* In the  `python-package` directory run
 ```shell
 python setup.py install
 ```
-Then import the module in Python as usual
+
 ```python
 import xgboost as xgb
 ```

-=
-#### Data Interface
+Data Interface
+--------------
 XGBoost python module is able to loading from libsvm txt format file, Numpy 2D array and xgboost binary buffer file. The data will be store in ```DMatrix``` object.

 * To load libsvm text format file and XGBoost binary file into ```DMatrix```, the usage is like
@@ -42,8 +37,8 @@ dtrain = xgb.DMatrix( data, label=label)
 ```
 * Build ```DMatrix``` from ```scipy.sparse```
 ```python
-csr = scipy.sparse.csr_matrix( (dat, (row,col)) )
-dtrain = xgb.DMatrix( csr )
+csr = scipy.sparse.csr_matrix((dat, (row, col)))
+dtrain = xgb.DMatrix(csr)
 ```
 * Saving ```DMatrix``` into XGBoost binary file will make loading faster in next time. The usage is like:
 ```python
@@ -52,18 +47,17 @@ dtrain.save_binary("train.buffer")
 ```
 * To handle missing value in ```DMatrix```, you can initialize the ```DMatrix``` like:
 ```python
-dtrain = xgb.DMatrix( data, label=label, missing = -999.0)
+dtrain = xgb.DMatrix(data, label=label, missing = -999.0)
 ```
 * Weight can be set when needed, like
 ```python
-w = np.random.rand(5,1)
-dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=w)
+w = np.random.rand(5, 1)
+dtrain = xgb.DMatrix(data, label=label, missing = -999.0, weight=w)
 ```

-
-=
-#### Setting Parameters
-XGBoost use list of pair to save [parameters](parameter.md). Eg
+Setting Parameters
+------------------
+XGBoost use list of pair to save [parameters](../parameter.md). Eg
 * Booster parameters
 ```python
 param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'binary:logistic' }
@@ -77,8 +71,9 @@ plst += [('eval_metric', 'ams@0')]
 evallist  = [(dtest,'eval'), (dtrain,'train')]
 ```

-=
-#### Training Model
+Training
+--------
+
 With parameter list and data, you are able to train a model.
 * Training
 ```python
@@ -104,10 +99,11 @@ After you save your model, you can load model file at anytime by using
 bst = xgb.Booster({'nthread':4}) #init model
 bst.load_model("model.bin") # load data
 ```
-=
-#### Early stopping

-If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in `evals`. If there's more than one, it will use the last.
+Early Stopping
+--------------
+If you have a validation set, you can use early stopping to find the optimal number of boosting rounds.
+Early stopping requires at least one set in `evals`. If there's more than one, it will use the last.

 `train(..., evals=evals, early_stopping_rounds=10)`

@@ -117,13 +113,14 @@ If early stopping occurs, the model will have two additional fields: `bst.best_s

 This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC).

-=
-#### Prediction
+Prediction
+----------
 After you training/loading a model and preparing the data, you can start to do prediction.
 ```python
-data = np.random.rand(7,10) # 7 entities, each contains 10 features
-dtest = xgb.DMatrix( data, missing = -999.0 )
-ypred = bst.predict( xgmat )
+# 7 entities, each contains 10 features
+data = np.random.rand(7, 10)
+dtest = xgb.DMatrix(data)
+ypred = bst.predict(xgmat)
 ```

 If early stopping is enabled during training, you can predict with the best iteration.