The functions featureValueOfSparseVector or featureValueOfDenseVector could return a Float.NaN if the input vectore was containing any missing values. This would make fail the partition key computation and most of the vectors would end up in the same partition. We fix this by avoid returning a NaN and simply use the row HashCode in this case. We added a test to ensure that the repartition is indeed now uniform on input dataset containing values by checking that the partitions size variance is below a certain threshold. Signed-off-by: Anthony D'Amato <anthony.damato@hotmail.fr>
eXtreme Gradient Boosting
Community | Documentation | Resources | Contributors | Release Notes
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, MPI, Dask) and can solve problems beyond billions of examples.
License
© Contributors, 2019. Licensed under an Apache-2 license.
Contribute to XGBoost
XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone. Checkout the Community Page.
Reference
- Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016
- XGBoost originates from research project at University of Washington.
Sponsors
Become a sponsor and get a logo here. See details at Sponsoring the XGBoost Project. The funds are used to defray the cost of continuous integration and testing infrastructure (https://xgboost-ci.net).
Open Source Collective sponsors
Sponsors
Backers
Other sponsors
The sponsors in this list are donating cloud hours in lieu of cash donation.

