When using xgboost4j-spark I had executors getting killed much more
often than i would expect by yarn for overrunning their memory limits,
based on the memoryOverhead provided. It looks like a significant
amount of this is because dmatrix's were being created but not released,
because they were only released when the GC decided it was time to
cleanup the references.
Rather than waiting for the GC, relesae the DMatrix's when we know
they are no longer necessary.
* add back train method but mark as deprecated
* fix scalastyle error
* fix the persistence of XGBoostEstimator
* test persistence of a complete pipeline
* fix compilation issue
* do not allow persist custom_eval and custom_obj
* fix the failed tesl
* add back train method but mark as deprecated
* fix scalastyle error
* change class to object in examples
* fix compilation error
* small fix for cleanExternalCache
* add back train method but mark as deprecated
* fix scalastyle error
* change class to object in examples
* fix compilation error
* fix several issues in tests
* [jvm-packages] call setGroup for ranking task
* passing groupData through xgBoostConfMap
* fix original comment position
* make groupData param
* remove groupData variable, use xgBoostConfMap directly
* set default groupData value
* add use groupData tests
* reduce rank-demo size
* use TaskContext.getPartitionId() instead of mapPartitionsWithIndex
* add DF use groupData test
* remove unused varable
* add back train method but mark as deprecated
* fix scalastyle error
* first commit in scala binding for fast histo
* java test
* add missed scala tests
* spark training
* add back train method but mark as deprecated
* fix scalastyle error
* local change
* first commit in scala binding for fast histo
* local change
* fix df frame test
* add back train method but mark as deprecated
* fix scalastyle error
* change class to object in examples
* fix compilation error
* bump spark version to 2.1
* preserve num_class issues
* fix failed test cases
* rivising
* add multi class test
* [jvm-packages] Scala implementation of the Rabit tracker.
A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).
* [jvm-packages] Updated Akka dependency in pom.xml.
* Refactored the RabitTracker directory structure.
* Fixed premature stopping of connection handler.
Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.
* Added interface IRabitTracker so that user can switch implementations.
* Default timeout duration changes.
* Dependency for Akka tests.
* Removed the main function of RabitTracker.
* A skeleton for testing Akka-based Rabit tracker.
* waitFor() in RabitTracker no longer throws exceptions.
* Completed unit test for the 'start' command of Rabit tracker.
* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)
* Fixed the default timeout duration.
* Use Java container to avoid serialization issues due to intermediate wrappers.
* Added tests for Allreduce/model training using Scala Rabit tracker.
* Added spill-over unit test for the Scala Rabit tracker.
* Fixed a typo.
* Overhaul of RabitTracker interface per code review.
- Removed methods start() waitFor() (no arguments) from IRabitTracker.
- The timeout in start(timeout) is now worker connection timeout, as tcp
socket binding timeout is less intuitive.
- Dropped time unit from start(...) and waitFor(...) methods; the default
time unit is millisecond.
- Moved random port number generation into the RabitTrackerHandler.
- Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.
* More code refactoring and comments.
* Unified timeout constants. Readable tracker status code.
* Add comments to indicate that allReduce is for tests only. Removed all other variants.
* Removed unused imports.
* Simplified signatures of training methods.
- Moved TrackerConf into parameter map.
- Changed GeneralParams so that TrackerConf becomes a standalone parameter.
- Updated test cases accordingly.
* Changed monitoring strategies.
* Reverted monitoring changes.
* Update test case for Rabit AllReduce.
* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.
* More comprehensive test cases for exception handling and worker connection timeout.
* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.
* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.
* Reverted scalastyle-config changes.
* Visibility scope change. Interface tweaks.
* Use match pattern to handle tracker_conf parameter.
* Minor clarification in JNI code.
* Clearer intent in match pattern to suppress warnings.
* Removed Future from constructor. Block in start() and waitFor() instead.
* Revert inadvertent comment changes.
* Removed debugging information.
* Updated test cases that are a bit finicky.
* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.
* add back train method but mark as deprecated
* fix scalastyle error
* change class to object in examples
* fix compilation error
* update methods in test cases to be consistent
* add blank lines
* fix
* add back train method but mark as deprecated
* fix scalastyle error
* change class to object in examples
* fix compilation error
* fix mis configuration
ml.dmlc.xgboost4j.scala.spark.XGBoost.scala:51
values is empty when we meet it at first time, so values(0) throw an IndexOutOfBoundsException.
It should be dVector.values(i) instead of values(i).
* bump up to scala 2.11
* framework of data frame integration
* test consistency between RDD and DataFrame
* order preservation
* test order preservation
* example code and fix makefile
* improve type checking
* improve APIs
* user docs
* work around travis CI's limitation on log length
* adjust test structure
* integrate with Spark -1 .x
* spark 2.x integration
* remove spark 1.x implementation but provide instructions on how to downgrade
* test consistency of prediction functions between DMatrix and RDD
* remove APIs with DMatrix from xgboost-spark
* fix compilation error in xgboost4j-example
* fix test cases
* create dmatrix with specified missing value
* update dmlc-core
* support for predict method in spark package
repartitioning
work around
* add more elements to work around training set empty partition issue