如何解决如何在dask中使用xgboost?
我正尝试使用dask解决kaggle欺诈检测分类问题。 但是,当我建立模型时,模型会将所有值预测为1。
我真的感到很惊讶,因为测试数据中有56,000个zeor和92个,因此该模型仍以某种方式将所有值预测为。
我显然做错了。如何正确使用模型?
MWE
import numpy as np
import pandas as pd
import dask
import dask.dataframe as dd
import dask_ml
from dask_ml.xgboost import XGBClassifier
import collections
from dask_ml.model_selection import train_test_split
from dask.distributed import Client
# set up cluster
client = Client(n_workers=4)
# load the data
ifile = "https://github.com/vermaji333/MLProject/blob/master/creditcard.zip?raw=true"
#!wget https://github.com/vermaji333/MLProject/blob/master/creditcard.zip?raw=true
#ifile = 'creditcard.zip'
ddf = dd.read_csv(ifile,compression='zip',blocksize=None,assume_missing=True)
# train-test split
target = 'Class'
Xtr,Xtx,ytr,ytx = train_test_split(
ddf.drop(target,axis=1),ddf[target],test_size=0.2,random_state=100,shuffle=True
)
# modelling
model = XGBClassifier(n_jobs=-1,scale_pos_weight=1,# default
objective='binary:logistic')
model.fit(Xtr,ytr)
ypreds = model.predict(Xtx)
ytx = ytx.compute()
ypreds = ypreds.compute()
# model evaluation
print(collections.Counter(ytx)) # Counter({0.0: 56607,1.0: 92})
print(collections.Counter(ypreds)) # this gives all 1's
更新
我尝试了各种pos权重值。
I tried various scale_pos_weights
collections.Counter(ytr)
Counter({0.0: 227708,1.0: 400})
scale_pos_weight= 227708/400
scale_pos_weight= 400/227708
scale_pos_weight= other values
But,for all parameters,I got all 1's as the result:
print(collections.Counter(ytx)) # Counter({0.0: 56607,1.0: 92})
print(collections.Counter(ypreds)) # this gives all 1's
Counter({0.0: 56607,1.0: 92})
Counter({1: 56699})
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。