我正在尝试调整梯度增强回归器的参数。
首先,仅考虑 n_estimators,使用staged_predict方法获得最优 n_estimators 我得到了 RMSE = 4.84 。
staged_predict
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0) gbr_onehot = GradientBoostingRegressor( n_estimators = 1000, learning_rate = 0.1, random_state = 214 ) model = gbr_onehot.fit(X_train, y_train) errors = [mean_squared_error(y_test, y_pred) for y_pred in gbr_onehot.staged_predict(X_test)] best_num_trees =np.argmin(errors) GBR_best_num_trees_onehot = GradientBoostingRegressor( n_estimators =best_num_trees, learning_rate = 0.1, random_state = 214 ) best_num_tree_model = GBR_best_num_trees_onehot.fit(X_train, y_train) y_pred = GBR_best_num_trees_onehot.predict(X_test) print(best_num_trees) print(f'RMSE with label encoding (best_num_trees) = {np.sqrt(metrics.mean_squared_error(y_test, y_pred))}') >>> 596 >>> RMSE with label encoding (best_num_trees) = 4.849497587420823
或者,这次使用 GridsearchCV,我已经为每棵树调整了 n_estimator、learning_rate 和 max_depth。
首先,调整 n_estimator 和 learning_rate:
def rmse(actual, predict): predict = np.array(predict) actual = np.array(actual) distance = predict - actual square_distance = distance ** 2 mean_square_distance = square_distance.mean() score = np.sqrt(mean_square_distance) return score rmse_score = make_scorer(rmse, greater_is_better=False) p_test = { 'learning_rate': [0.15,0.1,0.05,0.01,0.005,0.001], 'n_estimators' : [100,250,500,750,1000,1250,1500,1750] } tuning = GridSearchCV(estimator=GradientBoostingRegressor(max_depth=3, min_samples_split=2, min_samples_leaf=1, subsample=1, max_features='sqrt', random_state=214), param_grid = p_test, scoring = rmse_score, n_jobs = 4, iid=False, cv=5) tuning.fit(X_train, y_train)
然后使用来自tuning.best_params_
tuning.best_params_
p_test_2 = {'max_depth':[2,3,4,5,6,7]} tuning = GridSearchCV(estimator = GradientBoostingRegressor(learning_rate=0.05, n_estimators=1000, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', random_state=214), param_grid = p_test_2, scoring = rmse_score, n_jobs=4, iid=False, cv=5) tuning.fit(X_train, y_train)
用于获取最佳 max_depth 参数。
在我输入上面收到的参数并测试之后
model = GradientBoostingRegressor( learning_rate=0.1, n_estimators=1000, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', random_state=214, max_depth=3 ) model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f'RMSE = {np.sqrt(metrics.mean_squared_error(y_test, y_pred))}') >>> RMSE = 4.876534569535954
哪个比我仅使用 得到的 RMSE 更高staged_predict。为什么会这样?此外,当我打印(tuning.best_score_)时,为什么它返回负值?
很简单。当您在训练数据上获得最佳拟合参数时,您尝试比较测试数据上的 RMSE 指标。它必须是具有不同质量值的不同数据集。如果您在训练数据上计算 RMSE - 您应该获得具有最佳拟合参数的回归器的更好质量。
[更新]