[jvm-packages] Allow for bypassing spark missing value check (#4805)

* Allow for bypassing spark missing value check

* Update documentation for dealing with missing values in spark xgboost
This commit is contained in:
cpfarrell
2019-12-18 10:48:20 -08:00
committed by Nan Zhu
parent 27b3646d29
commit bc9d88259f
6 changed files with 134 additions and 40 deletions

View File

@@ -156,24 +156,9 @@ labels. A DataFrame like this (containing vector-represented features and numeri
Dealing with missing values
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Strategies to handle missing values (and therefore overcome issues as above):
In the case that a feature column contains missing values for any reason (could be related to business logic / wrong data ingestion process / etc.), the user should decide on a strategy of how to handle it.
The choice of approach depends on the value representing 'missing' which fall into four different categories:
1. 0
2. NaN
3. Null
4. Any other value which is not mentioned in (1) / (2) / (3)
We introduce the following approaches dealing with missing value and their fitting scenarios:
1. Skip VectorAssembler (using setHandleInvalid = "skip") directly. Used in (2), (3).
2. Keep it (using setHandleInvalid = "keep"), and set the "missing" parameter in XGBClassifier/XGBRegressor as the value representing missing. Used in (2) and (4).
3. Keep it (using setHandleInvalid = "keep") and transform to other irregular values. Used in (3).
4. Nothing to be done, used in (1).
Then, XGBoost will automatically learn what's the ideal direction to go when a value is missing, based on that value and strategy.
XGBoost supports missing values by default (`as desribed here <https://xgboost.readthedocs.io/en/latest/faq.html#how-to-deal-with-missing-value>`_).
If given a SparseVector, XGBoost will treat any values absent from the SparseVector as missing. You are also able to
specify to XGBoost to treat a specific value in your Dataset as if it was a missing value. By default XGBoost will treat NaN as the value representing missing.
Example of setting a missing value (e.g. -999) to the "missing" parameter in XGBoostClassifier:
@@ -190,11 +175,37 @@ Example of setting a missing value (e.g. -999) to the "missing" parameter in XGB
setFeaturesCol("features").
setLabelCol("classIndex")
.. note:: Using 0 to represent meaningful value
.. note:: Missing values with Spark's VectorAssembler
Due to the fact that Spark's VectorAssembler transformer only accepts 0 as a missing values, this one creates a problem when the user has 0 as meaningful value plus there are enough 0's to use SparseVector (However, In case the dataset is represented by a DenseVector, the 0 is kept)
If given a Dataset with enough features having a value of 0 Spark's VectorAssembler transformer class will return a
SparseVector where the absent values are meant to indicate a value of 0. This conflicts with XGBoost's default to
treat values absent from the SparseVector as missing. The model would effectively be
treating 0 as missing but not declaring that to be so which can lead to confusion when using the trained model on
other platforms. To avoid this, XGBoost will raise an exception if it receives a SparseVector and the "missing"
parameter has not been explicitly set to 0. To workaround this issue the user has three options:
In this case, users are also supposed to transform 0 to some other values to avoid the issue.
1. Explicitly convert the Vector returned from VectorAssembler to a DenseVector to return the zeros to the dataset. If
doing this with missing values encoded as NaN, you will want to set ``setHandleInvalid = "keep"`` on VectorAssembler
in order to keep the NaN values in the dataset. You would then set the "missing" parameter to whatever you want to be
treated as missing. However this may cause a large amount of memory use if your dataset is very sparse.
2. Before calling VectorAssembler you can transform the values you want to represent missing into an irregular value
that is not 0, NaN, or Null and set the "missing" parameter to 0. The irregular value should ideally be chosen to be
outside the range of values that your features have.
3. Do not use the VectorAssembler class and instead use a custom way of constructing a SparseVector that allows for
specifying sparsity to indicate a non-zero value. You can then set the "missing" parameter to whatever sparsity
indicates in your Dataset. If this approach is taken you can pass the parameter
``"allow_non_zero_for_missing_value" -> true`` to bypass XGBoost's assertion that "missing" must be zero when given a
SparseVector.
Option 1 is recommended if memory constraints are not an issue. Option 3 requires more work to get set up but is
guaranteed to give you correct results while option 2 will be quicker to set up but may be difficult to find a good
irregular value that does not conflict with your feature values.
.. note:: Using a non-default missing value when using other bindings of XGBoost.
When XGBoost is saved in native format only the booster itself is saved, the value of the missing parameter is not
saved alongside the model. Thus, if a non-default missing parameter is used to train the model in Spark the user should
take care to use the same missing parameter when using the saved model in another binding.
Training
========