Model performance improvement tips

Created on: January 10, 2025

Answered using O1 by Chat01

Question

Below a function to train ML model for traiding bot, for now after training we have ROC-AUC=0.537, Precision=0.452, Recall=0.999, F1=0.622, Best threshold=0.35, need to improuve these values.

Code:
async def train_unified_model_with_calibration_upd():
symbols = await get_tickers_usdt ()

if not symbols:
logging.error ("No symbols found for USDT.")
return

# 2) Загружаем данные
df_all = await load_all_data (symbols, TIMEFRAMES, DATA_LIMIT)
if df_all.empty:
logging.error ("No data loaded.")
return

logging.info (f"Loaded data shape: {df_all.shape}")

# 3) Добавляем технические индикаторы
df_all = add_technical_indicators (df_all)

# 4) Добавляем таргет
df_all = add_target (df_all, forecast_horizon=5)

# 5) Кодируем категориальные признаки
df_all = encode_categorical (df_all)

# 6) Готовим фичи и целевую переменную
X, y, feature_cols = prepare_features (df_all)
logging.info (f"Feature columns: {feature_cols}")

# 7) Делим данные на train / test (сначала)
X_trainval, X_test, y_trainval, y_test = train_test_split (X, y, test_size=0.2, shuffle=False)
logging.info (f"After 1st split: trainval={len (X_trainval)}, test={len (X_test)}")

# Теперь делим trainval (80%) на собственно train (60%) и val (20%)
X_train, X_val, y_train, y_val = train_test_split (X_trainval, y_trainval, test_size=0.25, shuffle=False)
logging.info (f"Final split sizes: train={len (X_train)}, val={len (X_val)}, test={len (X_test)}")

# Handle class imbalance
smote = SMOTE (random_state=42)
X_train, y_train = smote.fit_resample (X_train, y_train)

# 8) Hyperparameter tuning for RandomForest
rf_param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [4, 6, 8],
'min_samples_split': [2, 5, 10]
}
rf_grid_search = GridSearchCV (RandomForestClassifier (random_state=42), rf_param_grid, cv=5, scoring='roc_auc',
n_jobs=-1)
rf_grid_search.fit (X_train, y_train)
best_rf_model = rf_grid_search.best_estimator_

train_score = best_rf_model.score (X_train, y_train)
val_score = best_rf_model.score (X_val, y_val)
logging.info (f"Best RF Model Train Accuracy: {train_score:.3f}, Val Accuracy: {val_score:.3f}")

# 9) Калибровка вероятностей
calibrated_model = CalibratedClassifierCV (estimator=best_rf_model, method='sigmoid', cv='prefit')
calibrated_model.fit (X_val, y_val)

# Test
proba_val = calibrated_model.predict_proba (X_val)[:, 1]
y_pred_thr = (proba_val >= 0.5).astype (int)
logging.info (f"Predicted positive samples: {np.sum (y_pred_thr)}")

# 10) Подбор порога тоже на val
proba_val = calibrated_model.predict_proba (X_val)[:, 1]
thresholds = np.linspace (0, 1, 101)
best_thr = 0.5
best_f1 = 0.0

for thr in thresholds:
y_pred_thr = (proba_val >= thr).astype (int)
precision, recall, f1, _ = precision_recall_fscore_support (y_val, y_pred_thr, average='binary',
zero_division=0)
if f1 > best_f1:
best_f1 = f1
best_thr = thr

logging.info (f"Best threshold on val={best_thr:.2f} with F1={best_f1:.3f}")

# 11) Оцениваем итоговую модель на тесте (уже c подобранным threshold, но без «дополнительной» калибровки)
proba_test = calibrated_model.predict_proba (X_test)[:, 1]
y_pred_test = (proba_test >= best_thr).astype (int)

# Считаем метрики на test
test_precision, test_recall, test_f1, _ = precision_recall_fscore_support (y_test, y_pred_test, average='binary')
test_roc_auc = roc_auc_score (y_test, proba_test)

logging.info (
f"Final Test metrics: ROC-AUC={test_roc_auc:.3f}, "
f"Precision={test_precision:.3f}, Recall={test_recall:.3f}, F1={test_f1:.3f}, Best threshold={best_thr:.2f}"
)

# 12) Сохраняем откалиброванную модель, признаки и best_thr
Path ("models").mkdir (exist_ok=True)
with open ("models/ml_unified_model.pkl", "wb") as f:
pickle.dump ((calibrated_model, feature_cols, best_thr), f)

# 13) Сохраняем маппинги символов / tf
symbol_mapping = {symbol: idx for idx, symbol in enumerate (df_all['symbol'].unique ())}
tf_mapping = {tf: idx for idx, tf in enumerate (df_all['tf'].unique ())}

with open ("models/symbol_mapping.pkl", "wb") as f:
pickle.dump (symbol_mapping, f)
with open ("models/tf_mapping.pkl", "wb") as f:
pickle.dump (tf_mapping, f)

logging.info ("Calibrated model + threshold saved to models/ml_unified_model.pkl")

return calibrated_model, feature_cols, best_thr

Thought

Triaging areas for improvement

I’m mapping out options to enhance the trading bot’s performance, such as refining data pre-processing, addressing class imbalance, tuning hyperparameters, and introducing advanced models like gradient boosting.

Refining threshold strategy

Considering threshold tuning to boost precision, I’m analyzing the high recall and low precision to potentially adjust the threshold from 0.35 to a range like 0.6 or 0.7.

Assessing current performance

I’m noting the low precision and high recall, suggesting the threshold might need adjustment. To improve, focus on feature engineering, data cleaning, and comprehensive hyperparameter tuning.

Advancing model strategy

I'm crafting an advanced approach, focusing on classifiers like XGBoost, time-series cross-validation, and feature selection. Considering label definition, class imbalance, horizon, and calibration for a sophisticated model progression.

Thought about improving trading model metrics for 14 seconds

Answer

Below are some suggestions and ideas you can try to improve model performance (i.e., achieve higher ROC-AUC, precision, and F1 while not sacrificing recall quite so drastically). Some may or may not apply exactly to your scenario, but in practice they often help boost predictive power for trading models.

1. Ensure Proper Time Series Splitting

Since you’re working on a trading (time-series) problem, a standard train_test_split(..., shuffle=False) may not fully reflect the temporal dependencies in your data. Instead, consider using TimeSeriesSplit (or a custom walk-forward split) to respect the order of time:

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)  # or however many splits you want

rf_grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=rf_param_grid,
    cv=tscv,
    scoring='roc_auc',
    n_jobs=-1
)

TimeSeriesSplit ensures each split trains on past data and tests on future data, preventing look-ahead bias and giving you a more realistic performance estimate.

2. Expand Hyperparameter Tuning Range

Right now, you’re exploring only a small parameter grid:

rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [4, 6, 8],
    'min_samples_split': [2, 5, 10]
}

Try wider and/or finer ranges: e.g. more values for max_depth (3, 5, 7, 9, 11, None), try min_samples_leaf, or incorporate max_features.
Use randomized or Bayesian search: RandomizedSearchCV or libraries like optuna for more efficient hyperparameter searches.

Example:

rf_param_grid = {
    'n_estimators': [50, 100, 200, 300, 500],
    'max_depth': [3, 5, 7, 9, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10],
    'max_features': ['sqrt', 'log2']
}

3. Check or Refine Your Target Definition

You have add_target(df_all, forecast_horizon=5) which presumably labels whether the price moves up/down in the next 5 candles. Sometimes the definition of an up/down move or the threshold for “significant” movement can cause extreme class imbalance or lead to many false positives.

Experiment with different horizons (e.g., 3 bars, 10 bars, etc.).
Use a minimum price change threshold before labeling “up” vs “down.” Tiny price wiggles might produce too many random 1s/0s with no real signal.

Making your target less noisy (i.e. only “1” if a certain price change threshold is exceeded) often yields a better signal.

4. Improve Feature Engineering

In trading, good features often matter more than the exact choice of model. In addition to standard indicators:

Add more advanced technical indicators:
- Volume-based (On-balance volume, MFI).
- Volatility measures (ATR, Bollinger bands).
- Trend measures (MACD, ADX).
- Candle patterns (if relevant).
Lagged returns & rolling window features:
- Rolling averages of returns or volatility.
- Lagging your target by 1–2 bars if it helps with stationarity.
Feature selection:
- Too many noisy indicators can degrade performance.
- Consider something like SelectKBest, RFE, or Boruta on top of RF to prune unhelpful indicators.

5. Consider Different Class Imbalance Strategies

You are already using SMOTE on the training set:

smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

Verify that SMOTE is truly helping. Sometimes oversampling the minority class can lead to overfitting (especially in time-series). You can try undersampling the majority class or using other strategies (e.g. class_weight='balanced' in RandomForest).
Tune the ratio in SMOTE to not fully oversample but partially (e.g., sampling_strategy parameter).
Double-check that applying SMOTE doesn’t leak future information. SMOTE should be fit only on the training portion (never on validation or test).

6. Look at Precision–Recall vs. ROC–AUC

Your high recall (0.999) indicates you are predicting almost everything as positive for the chosen threshold (0.35). That kills precision. One approach:

Plot a precision-recall curve on the validation set.
Select a threshold that balances precision and recall for your trading objectives (some strategies prefer fewer trades with higher win rate, i.e. higher precision).

Sometimes the best F1 threshold isn’t the best for actual trading. If you’d rather have fewer signals but more accurate ones, you might want to increase the threshold to push precision up.

7. Try Calibrated or Ensemble Methods Beyond Random Forest

Gradient Boosting: XGBoost, LightGBM, or CatBoost frequently outperform random forests on structured/tabular data.
Stacking or Blending: Combine multiple models (e.g., an RF, XGBoost, and logistic regression) and calibrate the ensemble.
Check whether your calibration method (CalibratedClassifierCV with sigmoid) is appropriate. Sometimes isotonic calibration works better, or an ensemble approach might do better than a single RF.

8. Double-Check Data Preprocessing and Leakage

Ensure your feature columns do not include information “from the future” (e.g. next day’s open).
Check for duplicates, NaNs, or data alignment errors.
Make sure your validation and test sets are truly out-of-sample data.
If you do any scaling (StandardScaler, MinMaxScaler, etc.), fit it only on the training set.

9. Consider Trading-Specific Metrics

Sometimes a simple classification metric like F1 or precision/recall is not enough for a trading system. You might find it useful to:

Evaluate profit/loss or drawdown from trades triggered by your predictions.
Use custom cost-sensitive metrics (e.g. punishing false positives more if a false trade entry is expensive).
Evaluate predictions with respect to time-of-day or volatility regimes.

Putting It All Together

Time-series splits for hyperparameter tuning (avoid random splits).
Wider parameter search or advanced search methods (e.g. Bayesian).
Revise target definition if needed (reduce noise, ensure the label truly captures profitable moves).
Improve feature set (more signals, advanced technical indicators, feature selection).
Refine imbalance handling (try different SMOTE settings, undersampling, or class weights).
Check threshold carefully on validation data for the metrics you truly care about.
Experiment with alternative models (XGBoost, LightGBM, ensembles).
Ensure no data leakage and correct preprocessing for time series.
Incorporate domain knowledge about trading conditions (don’t rely solely on classification metrics).

By iterating on these points—especially focusing on the quality of your features, time-series cross-validation, and a balanced threshold for your trading goals—you should see improvements in ROC-AUC, precision, and a better overall F1 score (while likely reducing that near-100% recall, which is usually a sign of many false positives). Good luck!