XGBoost简单案例实现

XGBoost案例

本文中,我们用到印第安人发病的糖尿病数据集。右键点击下载

该数据集由8个描述患者医疗细节的输入变量和一个输出变量组成,以指示患者是否会在5年内发生糖尿病。

对于第一个XGBoost模型来说,这是一个很好的数据集,因为所有的输入变量都是数值型的,是一个简单的二元分类问题。

1
pip3 install xgboost

一、基础应用

引入 XGBoost 包

1
2
3
4
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

加载数据

1
2
3
data = pd.read_csv('diabetes.csv')
X = data.iloc[:,0:8]
Y = data.iloc[:,8]

分训练集测试集

1
2
3
seed = 5
test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

训练 XBGoost 模型

xgboost 有封装好的分类器和回归器,可以直接用 XGBClassifier 建立模型

XGBClassifier文档

1
2
model = XGBClassifier()
model.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

了解更多XGBoost参数信息

1
2
3
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
Accuracy: 80.52%

二、查看模型效果

XGBoost 可以在模型训练时,评价模型在测试集上的表现,也可以输出每一步的得分:

1
2
3
model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
[0]    validation_0-logloss:0.660999
Will train until validation_0-logloss hasn't improved in 10 rounds.
[1]    validation_0-logloss:0.630583
[2]    validation_0-logloss:0.605378
...
...
[74]    validation_0-logloss:0.421685
Stopping. Best iteration:
[64]    validation_0-logloss:0.417813

打印出Early Stopping 的点:

1
2
Stopping. Best iteration:
[64] validation_0-logloss:0.417813

三、输出重要特征

gradient boosting 还有一个优点是可以给出训练好的模型的特征重要性,
这样就可以知道哪些变量需要被保留,哪些可以舍弃

1
2
3
4
5
6
7
8
9
from xgboost import plot_importance
from matplotlib import pyplot
%matplotlib inline


model.fit(X, y)

plot_importance(model)
pyplot.show()

output_15_0

四、模型调参

如何调参呢,下面是三个超参数的一般实践最佳值,可以先将它们设定为这个范围,然后画出 learning curves,再调解参数找到最佳模型:

  • learning_rate = 0.1 或更小,越小就需要加入更多弱学习器;
  • max_depth = 2~8;
  • subsample = 训练集的 30%~80%;

更多

接下来我们用 GridSearchCV 来进行调参会更方便一些:

可以调的超参数组合有:

  1. 树的个数和大小 (n_estimators and max_depth).
  2. 学习率和树的个数 (learning_rate and n_estimators).
  3. 行列的 subsampling rates (subsample, colsample_bytree and colsample_bylevel).
1
2
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
1
2
3
4
5
6
7
8
9
model = XGBClassifier()
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2]
max_depth = [2, 3, 4, 5, 6]
param_grid = dict(learning_rate=learning_rate, max_depth=max_depth)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, Y)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
Best: -0.474370 using {'learning_rate': 0.1, 'max_depth': 2}

显示最佳参数

1
Best: -0.474370 using {'learning_rate': 0.1, 'max_depth': 2}

我们还可以用下面的代码打印出每一个参数组合对应的得分:

1
2
3
4
5
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("平均得分:%f (得分标准差:%f) %r" % (mean, stdev, param))
平均得分:-0.690191 (得分标准差:0.000436) {'learning_rate': 0.0001, 'max_depth': 2}
平均得分:-0.689811 (得分标准差:0.000475) {'learning_rate': 0.0001, 'max_depth': 3}
...
...
平均得分:-0.607988 (得分标准差:0.098076) {'learning_rate': 0.2, 'max_depth': 5}
平均得分:-0.647131 (得分标准差:0.098951) {'learning_rate': 0.2, 'max_depth': 6}