如何解决对两列字符串数据执行一种热编码
我正在尝试预测“ Full_Time_Home_Goals”
我的代码是:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
import os
import xlrd
import datetime
import numpy as np
# Set option to display all the rows and columns in the dataset. If there are more rows,adjust number accordingly.
pd.set_option('display.max_rows',5000)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
# Pandas needs you to define the column as date before its imported and then call the column and define as a date
# hence this step.
date_col = ['Date']
df = pd.read_csv(
r'C:\Users\harsh\Documents\My Dream\Desktop\Machine Learning\Attempt1\Historical Data\Concat_Cleaned.csv',parse_dates=date_col,skiprows=0,low_memory=False)
# Converting/defining the columns
# Before you define column types,you need to fill all NaN with a value. We will be reconverting them later
df = df.fillna(101)
# Defining column types
convert_dict = {'League_Division': str,'HomeTeam': str,'AwayTeam': str,'Full_Time_Home_Goals': int,'Full_Time_Away_Goals': int,'Full_Time_Result': str,'Half_Time_Home_Goals': int,'Half_Time_Away_Goals': int,'Half_Time_Result': str,'Attendance': int,'Referee': str,'Home_Team_Shots': int,'Away_Team_Shots': int,'Home_Team_Shots_on_Target': int,'Away_Team_Shots_on_Target': int,'Home_Team_Hit_Woodwork': int,'Away_Team_Hit_Woodwork': int,'Home_Team_Corners': int,'Away_Team_Corners': int,'Home_Team_Fouls': int,'Away_Team_Fouls': int,'Home_Offsides': int,'Away_Offsides': int,'Home_Team_Yellow_Cards': int,'Away_Team_Yellow_Cards': int,'Home_Team_Red_Cards': int,'Away_Team_Red_Cards': int,'Home_Team_Bookings_Points': float,'Away_Team_Bookings_Points': float,}
df = df.astype(convert_dict)
# Reverting the replace values step to get original dataframe and with the defined filetypes
df = df.replace('101',np.NAN,regex=True)
df = df.replace(101,regex=True)
# Exploration
print(df.dtypes)
print(df)
# Clean dataset by dropping null rows
data = df.dropna(axis=0)
# Column that you want to predict = y
y = df.Full_Time_Home_Goals
# Columns that are inputted into the model to make predictions (dependants),Cannot be column y
features = ['HomeTeam','AwayTeam','Full_Time_Away_Goals','Full_Time_Result']
# Create X
X = df[features]
# Split into validation and training data
train_X,val_X,train_y,val_y = train_test_split(X,y,random_state=1)
# Specify Model
soccer_model = DecisionTreeRegressor(random_state=1)
# Fit Model
soccer_model.fit(train_X,train_y)
我遇到了模型拟合错误
# Fit Model
soccer_model.fit(train_X,train_y)
向我抛出错误:
ValueError:无法将字符串转换为float:“ Nott'm Forest”
如何解决此问题并运行模型以获取输出?我尝试遵循一些示例,但无法继续。
您可以优化示例concat_cleaned文件here
解决方法
您必须将分类数据转换为数字数据。为此,您可以使用OneHotEncoder:
import os
import xlrd
import datetime
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import OneHotEncoder
# Set option to display all the rows and columns in the dataset. If there are more rows,adjust number accordingly.
pd.set_option('display.max_rows',5000)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
# Pandas needs you to define the column as date before its imported and then call the column and define as a date
# hence this step.
date_col = ['Date']
df = pd.read_csv(
r'Concat_Cleaned_Example.csv',parse_dates=date_col,skiprows=0,low_memory=False)
# Converting/defining the columns
# Before you define column types,you need to fill all NaN with a value. We will be reconverting them later
df = df.fillna(101)
# Defining column types
convert_dict = {'League_Division': str,'HomeTeam': str,'AwayTeam': str,'Full_Time_Home_Goals': int,'Full_Time_Away_Goals': int,'Full_Time_Result': str,'Half_Time_Home_Goals': int,'Half_Time_Away_Goals': int,'Half_Time_Result': str,'Attendance': int,'Referee': str,'Home_Team_Shots': int,'Away_Team_Shots': int,'Home_Team_Shots_on_Target': int,'Away_Team_Shots_on_Target': int,'Home_Team_Hit_Woodwork': int,'Away_Team_Hit_Woodwork': int,'Home_Team_Corners': int,'Away_Team_Corners': int,'Home_Team_Fouls': int,'Away_Team_Fouls': int,'Home_Offsides': int,'Away_Offsides': int,'Home_Team_Yellow_Cards': int,'Away_Team_Yellow_Cards': int,'Home_Team_Red_Cards': int,'Away_Team_Red_Cards': int,'Home_Team_Bookings_Points': float,'Away_Team_Bookings_Points': float,}
df = df.astype(convert_dict)
# Reverting the replace values step to get original dataframe and with the defined filetypes
df = df.replace('101',np.NAN,regex=True)
df = df.replace(101,regex=True)
# Clean dataset by dropping null rows
data = df.dropna(axis=0)
# Column that you want to predict = y
y = df.Full_Time_Home_Goals
# Columns that are inputted into the model to make predictions (dependants),Cannot be column y
features = ['HomeTeam','AwayTeam','Full_Time_Away_Goals','Full_Time_Result']
# Create X
X = df[features]
# Split into validation and training data
train_X,val_X,train_y,val_y = train_test_split(X,y,random_state=1)
# Specify Model
soccer_model = DecisionTreeRegressor(random_state=1)
# Define and train OneHotEncoder to transform numerical data to a numeric array
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train_X)
transformed_train_X = enc.transform(train_X)
# Fit Model
soccer_model.fit(transformed_train_X,train_y)
这样,您的数据(例如(Man United,Newcastle,H)
将被编码为
(0,14) 1.0
(0,35) 1.0
(0,43) 1.0
(0,50) 1.0
您可以使用以下方法查看任何数据点,以验证其是否正确编码:
entry_id = 1
print(transformed_train_X[entry_id])
for i in range(0,transformed_train_X[0].shape[1]):
if(transformed_train_X[entry_id,i]==1.0):
print(enc.get_feature_names()[i])
输出:
(0,14) 1.0
(0,35) 1.0
(0,43) 1.0
(0,50) 1.0
x0_Man United
x1_Newcastle
x2_0
x3_H
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。