This removes the need for a local histogram space during distributed training, which cuts the cache size by half.