熊猫数据框根据条件更改列中的值

如何解决熊猫数据框根据条件更改列中的值

我下面有一个大的数据框:

在此处{edu_val.csv}中用作示例的数据可以在https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv

中找到
import pandas as pd 

edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)

ID  Year    Education
22445   1991    higher education
29925   1991    No qualifications
76165   1991    No qualifications
223725  1991    Other
280165  1991    intermediate qualifications
333205  1991    No qualifications
387605  1991    higher education
541285  1991    No qualifications
541965  1991    No qualifications
599765  1991    No qualifications

Education中的值是:

edu.Education.value_counts()

intermediate qualifications 153705
higher education    67020
No qualifications   55842
Other   36915

我想通过以下方式替换“教育”列中的值:

  1. 如果一个IDhigher education列中的年份中的值为Education,则该ID的所有未来年份也将具有{{1} }在higher education列中。

  2. 如果一个Education在一年中的值为ID,那么该intermediate qualifications的所有未来年份将在相应的{{1}中包含ID }列。但是,如果值intermediate qualifications在此Education的任何后续年份中出现,则higher education在随后的年份中替换ID,无论higher education还是intermediate qualifications

例如,在下面的数据框中,Other年中的No qualifications occur的值为IDhigher education的所有后续1991值应为在以后的年份(直到Education年之前,都用22445替换。

higher education

类似地,以下数据框中的2017 1587125在年份edu.loc[edu['ID'] == 22445] ID Year Education 22445 1991 higher education 22445 1992 higher education 22445 1993 higher education 22445 1994 higher education 22445 1995 higher education 22445 1996 intermediate qualifications 22445 1997 intermediate qualifications 22445 1998 Other 22445 1999 No qualifications 22445 2000 intermediate qualifications 22445 2001 intermediate qualifications 22445 2002 intermediate qualifications 22445 2003 intermediate qualifications 22445 2004 intermediate qualifications 22445 2005 intermediate qualifications 22445 2006 intermediate qualifications 22445 2007 intermediate qualifications 22445 2008 intermediate qualifications 22445 2010 intermediate qualifications 22445 2011 intermediate qualifications 22445 2012 intermediate qualifications 22445 2013 intermediate qualifications 22445 2014 intermediate qualifications 22445 2015 intermediate qualifications 22445 2016 intermediate qualifications 22445 2017 intermediate qualifications 中具有值ID,在intermediate qualifications中变为1991。未来几年(从1993年开始)higher education1993列中的所有后续值都应为Education

1587125

数据中有12,057个唯一的higher education,列edu.loc[edu['ID'] == 1587125] ID Year Education 1587125 1991 intermediate qualifications 1587125 1992 intermediate qualifications 1587125 1993 higher education 1587125 1994 higher education 1587125 1995 higher education 1587125 1996 higher education 1587125 1997 higher education 1587125 1998 higher education 1587125 1999 higher education 1587125 2000 higher education 1587125 2001 higher education 1587125 2002 higher education 1587125 2003 higher education 1587125 2004 Other 1587125 2005 No qualifications 1587125 2006 intermediate qualifications 1587125 2007 intermediate qualifications 1587125 2008 intermediate qualifications 1587125 2010 intermediate qualifications 1587125 2011 higher education 1587125 2012 higher education 1587125 2013 higher education 1587125 2014 higher education 1587125 2015 higher education 1587125 2016 higher education 1587125 2017 higher education 的范围是1991年至2017年。如何根据以上所述更改所有12,057个ID的值条件?我不确定如何针对所有唯一的Year以统一的方式执行此操作。此处用作示例的示例数据附在上面的Github链接中。预先非常感谢。

解决方法

您可以使用categorical data来做到这一点:

df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')

eddtype = pd.CategoricalDtype(['No qualifications','Other','intermediate qualifications','higher education'],ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)

df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
                 .transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )

它被明确地分解了,所以您可以看到我正在使用的数据操作。

  1. 创建教育categorical dtype with order
  2. 接下来,将“教育”列的dtype更改为使用该类别 dtype(EducationCat)
  3. 使用分类代码执行cummax计算
  4. 通过索引返回由cummax计算(EduMax)定义的类别

输出:

df[df['ID'] == 1587125]

            ID  Year                    Education                 EducationCat                       EduMax
18      1587125  1991  intermediate qualifications  intermediate qualifications  intermediate qualifications
12075   1587125  1992  intermediate qualifications  intermediate qualifications  intermediate qualifications
24132   1587125  1993             higher education             higher education             higher education
36189   1587125  1994             higher education             higher education             higher education
48246   1587125  1995             higher education             higher education             higher education
60303   1587125  1996             higher education             higher education             higher education
72360   1587125  1997             higher education             higher education             higher education
84417   1587125  1998             higher education             higher education             higher education
96474   1587125  1999             higher education             higher education             higher education
108531  1587125  2000             higher education             higher education             higher education
120588  1587125  2001             higher education             higher education             higher education
132645  1587125  2002             higher education             higher education             higher education
144702  1587125  2003             higher education             higher education             higher education
156759  1587125  2004                        Other                        Other             higher education
168816  1587125  2005            No qualifications            No qualifications             higher education
180873  1587125  2006  intermediate qualifications  intermediate qualifications             higher education
192930  1587125  2007  intermediate qualifications  intermediate qualifications             higher education
204987  1587125  2008  intermediate qualifications  intermediate qualifications             higher education
217044  1587125  2010  intermediate qualifications  intermediate qualifications             higher education
229101  1587125  2011             higher education             higher education             higher education
241158  1587125  2012             higher education             higher education             higher education
253215  1587125  2013             higher education             higher education             higher education
265272  1587125  2014             higher education             higher education             higher education
277329  1587125  2015             higher education             higher education             higher education
289386  1587125  2016             higher education             higher education             higher education
301443  1587125  2017             higher education             higher education             higher education
,

教育水平显然是有秩序的。您的问题可以重申为“滚动最大值”问题:一个人在某年的最高学历是什么?

尝试一下:

# A dictionary mapping each label to a rank
mappings = {e: i for i,e in enumerate(['No qualifications','higher education'])}

# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)

# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()

# The first index level in tmp is the ID,the second level is the original index
# We only need the original index,hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k,v in mappings.items()})

edu['Education'] = tmp

测试:

edu[edu['ID'] == 1587125]

    ID  Year                    Education
1587125  1991  intermediate qualifications
1587125  1992  intermediate qualifications
1587125  1993             higher education
1587125  1994             higher education
1587125  1995             higher education
1587125  1996             higher education
1587125  1997             higher education
1587125  1998             higher education
1587125  1999             higher education
1587125  2000             higher education
1587125  2001             higher education
1587125  2002             higher education
1587125  2003             higher education
1587125  2004             higher education
1587125  2005             higher education
1587125  2006             higher education
1587125  2007             higher education
1587125  2008             higher education
1587125  2010             higher education
1587125  2011             higher education
1587125  2012             higher education
1587125  2013             higher education
1587125  2014             higher education
1587125  2015             higher education
1587125  2016             higher education
1587125  2017             higher education
,

您可以遍历ID,然后遍历年份。 DataFrame按时间顺序排列,因此,如果某人在某个单元格中具有“高等教育”或“中级资格”,则可以保存此知识并将其应用于后续单元格中:

edu = edu.set_index('ID')
ids = edu.index.unique()

for id in ids:
    # booleans to keep track of education statuses we've seen
    higher_ed = False
    inter_qual = False

    rows = edu.loc[id]
    for _,row in rows:
        # check for intermediate qualifications
        if inter_qual:
            row['Education'] = 'intermediate qualifications'
        elif row['Education'] = 'intermediate qualifications':
            inter_qual = True

        # check for higher education
        if higher_ed:
            row['Education'] = 'higher education'
        elif row['Education'] = 'higher education':
            higher_ed = True

我们可能不止一次地覆盖每个状态,如果一个人同时具有“中级资格”和“高等教育”,我们只需要确保“高等教育”排在最后即可。

我通常不建议使用for循环来处理DataFrame-但是每个单元格值可能都依赖于其上方的值,并且Dataframe不会太大而无法实现。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 <select id="xxx"> SELECT di.id, di.name, di.work_type, di.updated... <where> <if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 <property name="dynamic.classpath" value="tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -> systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping("/hires") public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate<String
使用vite构建项目报错 C:\Users\ychen\work>npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-