Overview
Supervised learning models have revolutionized feature ranking and classification in quantitative strategies. While deep neural networks receive a significant share of academic attention, tree ensembles—specifically Gradient Boosted Decision Trees (XGBoost)—consistently outperform deep learning on structured tabular datasets like time-series technical indicators.
XGBoost fits sequentially, constructing weak learners (decision trees) that minimize the residual errors of prior trees. Its built-in L1/L2 regularization terms prevent weights from exploding, making it remarkably robust against the high noise-to-signal ratios typical of financial exchanges.
Feature Engineering & Label Generation
For an XGBoost classifier to successfully forecast stock movements, feature inputs must capture diverse dimensions of price and volume dynamics. We calculate rolling vectors of volatility (ATR), sector relative strength ratios, moving average extensions, and volume thrust parameters.
The label generation process is designed as follows:
Label (Y) = 1 if the stock's maximum forward return over the next 20 trading days exceeds a specific target threshold (e.g., +15%) without dropping below a strict protective stop boundary. Otherwise, Y = 0.
Curbing Data Leakage via Walk-Forward Validation
Standard k-fold cross-validation is a critical pitfall in financial modeling. Because asset prices are highly autocorrelated, random partitioning results in extreme lookahead bias (training on future data to predict past data). To prevent this data leakage, we enforce a strict Out-of-Time Walk-Forward Validation framework.
# Walk-forward partitioning example
train_start = "2020-01-01"
train_end = "2024-12-31"
val_start = "2025-01-01"
val_end = "2025-12-31"
X_train = df.loc[train_start:train_end, features]
y_train = df.loc[train_start:train_end, 'target']
X_val = df.loc[val_start:val_end, features]
y_val = df.loc[val_start:val_end, 'target']
Gradient Boosting Implementation with Regularization
Regularization hyperparameters must be tuned using genetic search or bayesian search to prevent tree depth from memorizing specific market cycles. Essential hyperparameter constraints include:
- max_depth: Clamped between 3 and 5 to prevent overfitting complex noisy patterns.
- subsample & colsample_bytree: Set between 0.6 and 0.8 to inject stochastic feature selection at each step.
- min_child_weight: Raised to high integers (e.g., 10 to 30) to enforce that leaf nodes contain a significant partition of samples.
By locking down these boundaries, we build classifiers that generalize cleanly across out-of-sample datasets, providing real-time technical probability scores that filter out weak trading setups.