对数据集进行分层,同时还避免被索引污染?

如何解决对数据集进行分层,同时还避免被索引污染?

作为可重复的示例,我具有以下数据集:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = np.random.randint(0,20,size=(300,5))
df = pd.DataFrame(data,columns=['ID','A','B','C','D'])
df = df.set_index(['ID'])

df.head()
Out: 
           A   B   C   D
ID                
12         3  14   4   7
9          5   9   8   4
12         18  17   3  14
1          0  10   1   0
9          10   5  11   9

我需要执行一个70%-30%的分层拆分(在y上),我知道它应该像这样:

# Train/Test Split
X = df.iloc[:,0:-1] # Columns A,B,and C
y = df.iloc[:,-1] # Column D
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.70,test_size = 0.30,stratify = y)

但是,尽管我希望训练和测试集具有相同(或足够相似)的“ D”分布,但我不希望在测试和训练中都存在相同的“ ID”。

我该怎么办?

解决方法

编辑: 一种方法(类似)是按类存储ID,然后为每个类获取ID的70%,并将具有这些ID的样本插入到Train中,将其余ID插入测试集中。

请注意,如果每个ID出现的次数不同,这仍然不能保证分布是相同的。而且,由于每个ID可以属于D中的多个类,并且不应该在 train test 集之间共享,因此寻找相同的分布成为一个复杂的优化问题。这是因为每个ID只能包含在 train test 中,并且同时将可变数量的类添加到分配的集合中,具体取决于给定ID的所有行都包含在其中。

在近似平衡分布的同时拆分数据的一种相当简单的方法是,以随机顺序遍历各个类,并仅考虑出现的每个类中的每个ID,因此将其分配给 train / test 对其所有行进行测试,因此在以后的类中将其删除。

我发现将ID作为列可以帮助完成此任务,因此我按如下所示更改了您提供的代码:

# Given snippet (modified)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = np.random.randint(0,20,size=(300,5))
df = pd.DataFrame(data,columns=['ID','A','B','C','D'])

建议的解决方案:

import random
from collections import defaultdict

classes = df.D.unique().tolist() # get unique classes,random.shuffle(classes)          # shuffle to eliminate positional biases
ids_by_class = defaultdict(list)


# iterate over classes
temp_df = df.copy()
for c in classes:
    c_rows = temp_df.loc[temp_df['D'] == c] # rows with given class
    ids = temp_df.ID.unique().tolist()      # IDs in these rows
    ids_by_class[c].extend(ids)

    # remove ids so they cannot be taken into account for other classes
    temp_df = temp_df[~temp_df.ID.isin(ids)]


# now construct ids split,class by class
train_ids,test_ids = [],[]
for c,ids in ids_by_class.items():
    random.shuffle(ids) # shuffling can eliminate positional biases

    # split the IDs
    split = int(len(ids)*0.7) # split at 70%

    train_ids.extend(ids[:split])
    test_ids.extend(ids[split:])

# finally use the ids in train and test to get the
# data split from the original df
train = df.loc[df['ID'].isin(train_ids)]
test = df.loc[df['ID'].isin(test_ids)]

让我们测试一下拆分比率大致符合70/30,保留数据并且在 train test 数据帧之间没有共享ID:

# 1) check that elements in Train are roughly 70% and Test 30% of original df
print(f'Numbers of elements in train: {len(train)},test: {len(test)}| Perfect split would be train: {int(len(df)*0.7)},test: {int(len(df)*0.3)}')

# 2) check that concatenating Train and Test gives back the original df
train_test = pd.concat([train,test]).sort_values(by=['ID','D']) # concatenate dataframes into one,and sort
sorted_df = df.sort_values(by=['ID','D']) # sort original df
assert train_test.equals(sorted_df) # check equality

# 3) check that the IDs are not shared between train/test sets
train_id_set = set(train.ID.unique().tolist())
test_id_set = set(test.ID.unique().tolist())
assert len(train_id_set.intersection(test_id_set)) == 0

样本输出:

Numbers of elements in train: 209,test: 91| Perfect split would be train: 210,test: 90
Numbers of elements in train: 210,test: 90| Perfect split would be train: 210,test: 90
Numbers of elements in train: 227,test: 73| Perfect split would be train: 210,test: 90

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 <select id="xxx"> SELECT di.id, di.name, di.work_type, di.updated... <where> <if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 <property name="dynamic.classpath" value="tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -> systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping("/hires") public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate<String
使用vite构建项目报错 C:\Users\ychen\work>npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-