特征是否必须是浮点数才能通过决策树进行多类分类？

如何解决特征是否必须是浮点数才能通过决策树进行多类分类？

if (response.statusCode == 200) {
  Map videoListItem = json.decode(response.body);
  return videoListItem = VideoItem.fromJson(videoListItem);
}

我正在尝试使用决策树进行名称实体识别 (NER)。我的特征数据框和标签数据框看起来像上面那样。当我运行以下代码时，它返回 X_train ------------------------------------------------------------------------------------------ | bias | word.lower | word[-3:] | word.isupper | word.isdigit | POS | BOS | EOS | ------------------------------------------------------------------------------------------ 0 | 1.0 | headache,| HE,| True | False | NNP | True | False | 1 | 1.0 | mostly | tly | False | False | NNP | False | False | 2 | 1.0 | but | BUT | True | False | NNP | False | False | ... ... ... y_train ------------ | OBI | ------------ 0 | B-ADR | 1 | O | 2 | O | ... ... ...。我的数据格式是否正确（我正在关注 this tutorial）？对于决策树的多类分类，特征是否必须是浮点数？如果是这样，鉴于大多数标记特征（如果不是全部）都是字符串或布尔值，我应该如何进行 OBI 标记？

ValueError: could not convert string to float: 'headache,'

解决方法

是的，它们必须是数字（不一定是浮动的）。因此，如果一列中有 4 个不同的文本标签，则需要将其转换为 4 个数字。为此，请使用 sklearn 的 labelencoder。如果您的数据位于 Pandas 数据框 df 中，

from sklearn import preprocessing
from collections import defaultdict

# select text columns
cat_cols = df.select_dtypes(include='object').columns

# this is a way to apply label_encoder to all category cols at once,returning a label encoder per categorical column,in a dict d 
d = defaultdict(preprocessing.LabelEncoder)

 # transform all text columns to numbers
df[cat_cols] = df[cat_cols].apply(lambda x: d[x.name].fit_transform(x.astype(str)))

将所有列转换为数字后，您可能还希望进行 "one-hot" 编码。对分类列和布尔列执行此操作（这里我仅针对您的分类列显示）。

# you should probably also one-hot the categorical columns
df = pd.get_dummies(df,columns=cat_cols)

之后您可以使用标签编码器的字典 d 从标签编码器中检索值的名称。

d[col_name].inverse_transform(value)

This tutorial 对于理解这些概念特别有用。

特征是否必须是浮点数才能通过决策树进行多类分类？

如何解决特征是否必须是浮点数才能通过决策树进行多类分类？

解决方法

相关推荐