如何解决在h2o AutoML上检索保持数据集的交叉验证性能AUC
我正在使用默认的交叉验证(nfolds=5
)使用h2o AutoML训练二进制分类模型。我需要获得每个保留倍数的AUC分数,以便计算变异性。
这是我正在使用的代码:
h2o.init()
prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
# convert columns to factors
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()
prostate['DCAPS'] = prostate['DCAPS'].asfactor()
prostate['DPROS'] = prostate['DPROS'].asfactor()
# set the predictor and response columns
predictors = ["AGE","RACE","VOL","GLEASON"]
response_col = "CAPSULE"
# split into train and testing sets
train,test = prostate.split_frame(ratios = [0.8],seed = 1234)
aml = H2OAutoML(seed=1,max_runtime_secs=100,exclude_algos=["DeepLearning","GLM"],nfolds=5,keep_cross_validation_predictions=True)
aml.train(predictors,response_col,training_frame=prostate)
leader = aml.leader
我检查leader
是否不是StackedEnsamble模型(其验证指标不可用)。无论如何,我无法获取五个AUC分数。
关于如何操作的任何想法?
解决方法
这是完成的方式:
import h2o
from h2o.automl import H2OAutoML
h2o.init()
# import prostate dataset
prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
# convert columns to factors
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()
prostate['DCAPS'] = prostate['DCAPS'].asfactor()
prostate['DPROS'] = prostate['DPROS'].asfactor()
# set the predictor and response columns
predictors = ["AGE","RACE","VOL","GLEASON"]
response_col = "CAPSULE"
# split into train and testing sets
train,test = prostate.split_frame(ratios = [0.8],seed = 1234)
# run AutoML for 100 seconds
aml = H2OAutoML(seed=1,max_runtime_secs=100,exclude_algos=["DeepLearning","GLM"],nfolds=5,keep_cross_validation_predictions=True)
aml.train(x=predictors,y=response_col,training_frame=prostate)
# Get the leader model
leader = aml.leader
关于交叉验证的AUC,有一点需要说明-H2O当前存储了两个CV AUC计算。一个是汇总版本(采用汇总CV预测的AUC),另一个是交叉验证的AUC(来自k倍交叉验证的k个AUC的平均值)的“真实”定义。后者存储在一个对象中,该对象还包含各个折叠的AUC,以及折叠的标准偏差。
如果您想知道为什么要这样做,则出于某些历史和技术原因,我们有两个版本,而ticket仅向每个报告开放。
第一个是执行此操作后得到的内容(以及在AutoML页首横幅上显示的内容)。
# print CV AUC for leader model
print(leader.model_performance(xval=True).auc())
如果您需要折叠式AUC,以便可以计算或查看其平均值和变异性(标准偏差),可以在此处查看:
# print CV metrics summary
leader.cross_validation_metrics_summary()
输出:
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
----------- ---------- ----------- ------------ ------------ ------------ ------------ ------------
accuracy 0.71842104 0.06419111 0.7631579 0.6447368 0.7368421 0.7894737 0.65789473
auc 0.7767409 0.053587236 0.8206676 0.70905924 0.7982079 0.82538515 0.7303846
aucpr 0.6907578 0.0834025 0.78737605 0.7141305 0.7147677 0.67790955 0.55960524
err 0.28157896 0.06419111 0.23684211 0.35526314 0.2631579 0.21052632 0.34210527
err_count 21.4 4.8785243 18.0 27.0 20.0 16.0 26.0
--- --- --- --- --- --- --- ---
precision 0.61751753 0.08747421 0.675 0.5714286 0.61702126 0.7241379 0.5
r2 0.20118153 0.10781976 0.3014902 0.09386432 0.25050205 0.28393403 0.07611712
recall 0.84506994 0.08513061 0.84375 0.9142857 0.9354839 0.7241379 0.8076923
rmse 0.435928 0.028099842 0.41264254 0.47447023 0.42546 0.41106534 0.4560018
specificity 0.62579334 0.15424488 0.70454544 0.41463414 0.6 0.82978725 0.58
See the whole table with table.as_data_frame()
这是排行榜的样子(存储汇总的CV AUC)。在这种情况下,由于数据非常小(300行),因此两个报告的CV AUC值之间报告的两个之间存在明显差异,但是对于较大的数据集,它们应该更接近估计值。
# print the whole Leaderboard (all CV metrics for all models)
lb = aml.leaderboard
print(lb)
这将显示排行榜的顶部:
model_id auc logloss aucpr mean_per_class_error rmse mse
--------------------------------------------------- -------- --------- -------- ---------------------- -------- --------
XGBoost_grid__1_AutoML_20200924_200634_model_2 0.769716 0.565326 0.668827 0.290806 0.436652 0.190665
GBM_grid__1_AutoML_20200924_200634_model_4 0.762993 0.56685 0.666984 0.279145 0.437634 0.191524
XGBoost_grid__1_AutoML_20200924_200634_model_9 0.762417 0.570041 0.645664 0.300121 0.440255 0.193824
GBM_grid__1_AutoML_20200924_200634_model_6 0.759912 0.572651 0.636713 0.30097 0.440755 0.194265
StackedEnsemble_BestOfFamily_AutoML_20200924_200634 0.756486 0.574461 0.646087 0.294002 0.441413 0.194845
GBM_grid__1_AutoML_20200924_200634_model_7 0.754153 0.576821 0.641462 0.286041 0.442533 0.195836
XGBoost_1_AutoML_20200924_200634 0.75411 0.584216 0.626074 0.289237 0.443911 0.197057
XGBoost_grid__1_AutoML_20200924_200634_model_3 0.753347 0.57999 0.629876 0.312056 0.4428 0.196072
GBM_grid__1_AutoML_20200924_200634_model_1 0.751706 0.577175 0.628564 0.273603 0.442751 0.196029
XGBoost_grid__1_AutoML_20200924_200634_model_8 0.749446 0.576686 0.610544 0.27844 0.442314 0.195642
[28 rows x 7 columns]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。