无法使用XGBoost中的加权多类数据集生成预测

如何解决无法使用XGBoost中的加权多类数据集生成预测

问题描述

我有一个具有4个类的不平衡数据集，我正在尝试使用XGBoost对其进行分类。目前，训练似乎进展顺利，我正在使用dMatrix = xgb.DMatrix(X,y,weight = weights)根据班级突出程度定义的权重来训练模型，我认为这可以确保我的分类不会因我的不平衡数据集而产生偏差。唯一的问题是，我很难从训练有素的模型生成预测。似乎我的输入中仍然需要权重，而且我不太明白为什么。

当前实施

X = n samples * k features
y = n classifications

获取体重

#getting class weights
percents = y.value_counts(normalize=True)
classWeights = min(percents)/percents

classWeights：

1    0.065769
0    0.080583
3    1.000000
2    1.000000
dtype: float64

创建元素明智的权重向量

#copying the class vector
weights = copy.deepcopy(y)

#mapping the weights onto each class
weights = weights.replace(classWeights.to_dict())

重量：

0       0.080583
1       0.080583
2       0.080583
3       0.080583
4       0.080583
          ...   
2453    0.065769
2454    0.065769
2455    0.065769
2456    0.065769
2457    0.065769
Length: 2458,dtype: float64

基于X，y和权重的训练

#defining the input data for training
dMatrix = xgb.DMatrix(X,weight = weights)

param = {'max_depth': 4,'objective': 'multi:softprob','num_class': len(classWeights)}
param['nthread'] = 4
param['eval_metric'] = 'auc'

num_round = 10
bst = xgb.train(param,dMatrix,num_round)

问题

这一切都很好，但是我似乎无法弄清楚如何获得预测。例如，我尝试过

print(bst.predict(xgb.DMatrix(X.iloc[0])))

，并出现以下错误：

ValueError: ('Expecting 2 dimensional numpy.ndarray,got: ',(486,))

直观地算出，当从我的训练矩阵中传入

print(bst.predict(dMatrix))

我似乎得到了一些结果

[[0.6696256  0.10667633 0.18783426 0.0358638 ]
 [0.6733066  0.11801444 0.15131903 0.05735996]
 [0.66467386 0.14872473 0.14806929 0.03853212]
 ...
 [0.36380327 0.4005142  0.06459591 0.17108661]
 [0.12315215 0.60588646 0.0351514  0.23581001]
 [0.19565345 0.7347611  0.03038597 0.03919945]]

尽管

a）我不确定为什么要在预测中加入训练权重

b）此DMatrix包含y，这对我来说似乎很奇怪，因为整个观点都是为了预测y

c）我什至不完全确定如何解释输出。

问题

如何使用权重在偏向数据集上进行训练，以及如何使用该模型进行预测？

更新

如果我不使用举重，我可以很容易地将其启动并运行：

model = XGBClassifier() 
model.fit(X,y)
for val in model.predict(X):
    print(val)