GPU-Optimized GBDT with CubeCL: A Step Towards XGBoost | 25.01.2025

To build our own XGBoost algorithm, I will start with a GPU-optimized Gradient Boosted Decision Tree (GBDT). This step will help me get a feeling for CubeCL's reduce feature and understand how to extend it for efficient GPU operations.

Current Approach

The CubeCL reduce feature processes each GPU operation individually. For each primitive operation (+, *, argmax, ...), the CPU allocates memory on the GPU, sends data for computation, retrieves the result, and repeats this for subsequent operations. This isn't optimal because of frequent memory allocations and data transfers.

Goals

The primary focus is to implement a GBDT using CubeCL’s reduce feature in its current state. This will provide a deeper understanding of its behavior and limitations. Initially, the implementation will prioritize functionality over optimization to build familiarity with CubeCL-reduce for future development.

Next Steps

In the future, CubeCL-reduce will be enhanced to support multiple GPU operations in a single step. This approach will reduce the need for repetitive memory allocations. Additionally, strategies for efficient memory management will be developed to minimize the overhead caused by data transfers between the CPU and GPU. These improvements will form the basis for creating an efficient GPU-based XGBoost algorithm.

What’s Next?

Next week, I plan to release the first version of the GBDT implementation using CubeCL reduce. While it will not yet be optimized, this version will provide valuable insights and set the stage for further enhancements.