使用一个数据框中的最后一个有效数据索引来选择另一个数据框中的数据

如何解决使用一个数据框中的最后一个有效数据索引来选择另一个数据框中的数据

我想找到第一个数据框的最后一个有效索引,并用它来索引第二个数据框。

因此,假设我具有以下数据框(df1):

            Site 1  Site 2  Site 3  Site 4  Site 5  Site 6
Date                                                      
2000-01-01    13.0    28.0    76.0      45    90.0    58.0
2001-01-01    77.0    75.0    57.0       3    41.0    24.0
2002-01-01    50.0    29.0     2.0      65    48.0    21.0
2003-01-01     7.0    48.0    14.0      63    12.0    66.0
2004-01-01    11.0    90.0    11.0       5    47.0     6.0
2005-01-01    50.0     4.0    31.0       1    40.0    79.0
2006-01-01    30.0    98.0    91.0      96    43.0    39.0
2007-01-01    50.0    20.0    54.0      65     NaN    47.0
2008-01-01    24.0    84.0    52.0      84     NaN    81.0
2009-01-01    56.0    61.0    57.0      25     NaN    36.0
2010-01-01    87.0    45.0    68.0      65     NaN    71.0
2011-01-01    22.0    50.0    92.0      91     NaN    48.0
2012-01-01    12.0    44.0    79.0      77     NaN    25.0
2013-01-01     1.0    22.0    34.0      57     NaN    25.0
2014-01-01    94.0     NaN    86.0      97     NaN    91.0
2015-01-01     2.0     NaN    98.0      44     NaN    79.0
2016-01-01    81.0     NaN    35.0      87     NaN    32.0
2017-01-01    59.0     NaN    95.0      32     NaN    58.0
2018-01-01     NaN     NaN     3.0      14     NaN     NaN
2019-01-01     NaN     NaN    48.0       9     NaN     NaN
2020-01-01     NaN     NaN     NaN      49     NaN     NaN

现在,我可以使用“ first_valid_index()”来查找每列的最后一个有效索引:

lvi = df.apply(lambda series: series.last_valid_index())

哪个产量:

Site 1   2017-01-01
Site 2   2013-01-01
Site 3   2019-01-01
Site 4   2020-01-01
Site 5   2006-01-01
Site 6   2017-01-01

如何将其应用于另一个数据框,在此我使用此索引来切片另一个数据框的时间序列。可以使用以下内容创建数据框的另一个示例:

import pandas as pd
import numpy as np
from numpy import random

random.seed(30)

df2 = pd.DataFrame({
    "Site 1": np.random.rand(21),"Site 2": np.random.rand(21),"Site 3": np.random.rand(21),"Site 4": np.random.rand(21),"Site 5": np.random.rand(21),"Site 6": np.random.rand(21)})

idx = pd.date_range(start='2000-01-01',end='2020-01-01',freq ='AS')
df2 = df2.set_index(idx)

如何使用“ lvi”变量索引到df2?

要手动执行此操作,我可以使用:

df_s1 = df['Site 1'].loc['2000-01-01':'2017-01-01']

要获得类似的东西:

2000-01-01    13.0
2001-01-01    77.0
2002-01-01    50.0
2003-01-01     7.0
2004-01-01    11.0
2005-01-01    50.0
2006-01-01    30.0
2007-01-01    50.0
2008-01-01    24.0
2009-01-01    56.0
2010-01-01    87.0
2011-01-01    22.0
2012-01-01    12.0
2013-01-01     1.0
2014-01-01    94.0
2015-01-01     2.0
2016-01-01    81.0
2017-01-01    59.0

有没有更好的方法来解决这个问题?此外,每一列是否必须实质上是其自己的数据框才能起作用?任何帮助将不胜感激!

解决方法

这可能更惯用:

df2[df.notna()]

甚至

df2.where(df.notna())

请注意,在这些情况下(和df1*0 + df2),将执行操作以匹配索引dfdf2。例如,df2[df.reset_index(drop=True).notna()]将返回所有nan,因为没有公用的索引值。

,

这似乎很好:

In [34]: d
Out[34]: 
      x    y
Date                
2020-01-01  1.0  2.0
2020-01-02  1.0  2.0
2020-01-03  1.0  2.0
2020-01-04  1.0  2.0
2020-01-05  1.0  2.0
2020-01-06  1.0  NaN
2020-01-07  1.0  NaN
2020-01-08  1.0  NaN
2020-01-09  1.0  NaN
2020-01-10  1.0  NaN
2020-01-11  NaN  NaN
2020-01-12  NaN  NaN
2020-01-13  NaN  NaN
2020-01-14  NaN  NaN
2020-01-15  NaN  NaN
2020-01-16  NaN  NaN
2020-01-17  NaN  NaN
2020-01-18  NaN  NaN
2020-01-19  NaN  NaN
2020-01-20  NaN  NaN

In [35]: d.apply(lambda col: col.last_valid_index())
Out[35]: 
x   2020-01-10
y   2020-01-05
dtype: datetime64[ns]

然后: In [15]: d.apply(lambda col: col.last_valid_index()).apply(lambda date: df2.loc[date]) Out[15]: z x 0.940396 y 0.564007

,

好吧,因此在考虑了一段时间并尝试提出一个涉及for循环等的详细过程之后,我得出了这个简单的数学运算可以解决问题的结论。基本上,我利用了熊猫数据框之间的数学运算方法。

output = df1*0 + df2

这将在df2上提供输出,该输出将采用df1的NaN值,如下所示:

              Site 1    Site 2    Site 3    Site 4    Site 5    Site 6
Date                                                                  
2000-01-01  0.690597  0.443933  0.787931  0.659639  0.363606  0.922373
2001-01-01  0.388669  0.577734  0.450225  0.021592  0.554249  0.305546
2002-01-01  0.578212  0.927848  0.361426  0.840541  0.626881  0.545491
2003-01-01  0.431668  0.128282  0.893351  0.783488  0.122182  0.666194
2004-01-01  0.151491  0.928584  0.834474  0.945401  0.590830  0.802648
2005-01-01  0.113477  0.398326  0.649955  0.202538  0.485927  0.127925
2006-01-01  0.521906  0.458672  0.923632  0.948696  0.638754  0.552753
2007-01-01  0.266599  0.839047  0.099069  0.000928       NaN  0.018146
2008-01-01  0.819810  0.809779  0.706223  0.247780       NaN  0.759691
2009-01-01  0.441574  0.020291  0.702551  0.468862       NaN  0.341191
2010-01-01  0.277030  0.130573  0.906697  0.589474       NaN  0.819986
2011-01-01  0.795344  0.103121  0.846405  0.589916       NaN  0.564411
2012-01-01  0.697255  0.599767  0.206482  0.718980       NaN  0.731366
2013-01-01  0.891771  0.001944  0.703132  0.751986       NaN  0.845933
2014-01-01  0.672579       NaN  0.466981  0.466770       NaN  0.618069
2015-01-01  0.767219       NaN  0.702156  0.370905       NaN  0.481971
2016-01-01  0.315264       NaN  0.793531  0.754920       NaN  0.091432
2017-01-01  0.431651       NaN  0.974520  0.708074       NaN  0.870077
2018-01-01       NaN       NaN  0.408743  0.430576       NaN       NaN
2019-01-01       NaN       NaN  0.751509  0.755521       NaN       NaN
2020-01-01       NaN       NaN       NaN  0.518533       NaN       NaN

我基本上想将NaN值从一个Dataframe压印到另一个。我不敢相信我要这么做有多困难。只要我的数据框大小相同,就可以满足我的需求。

现在,我应该能够从这里开始计算每个最后一个有效数据点的变化百分比。谢谢大家的投入!

编辑:

只是向大家展示我最终想要完成的工作,这是我在大家的帮助和建议下生成的最终代码!

原始df最初看起来像:

            Site 1  Site 2  Site 3  Site 4  Site 5  Site 6
Date                                                      
2000-01-01    13.0    28.0    76.0      45    90.0    58.0
2001-01-01    77.0    75.0    57.0       3    41.0    24.0
2002-01-01    50.0    29.0     2.0      65    48.0    21.0
2003-01-01     7.0    48.0    14.0      63    12.0    66.0
2004-01-01    11.0    90.0    11.0       5    47.0     6.0
2005-01-01    50.0     4.0    31.0       1    40.0    79.0
2006-01-01    30.0    98.0    91.0      96    43.0    39.0
2007-01-01    50.0    20.0    54.0      65     NaN    47.0
2008-01-01    24.0    84.0    52.0      84     NaN    81.0
2009-01-01    56.0    61.0    57.0      25     NaN    36.0
2010-01-01    87.0    45.0    68.0      65     NaN    71.0
2011-01-01    22.0    50.0    92.0      91     NaN    48.0
2012-01-01    12.0    44.0    79.0      77     NaN    25.0
2013-01-01     1.0    22.0    34.0      57     NaN    25.0
2014-01-01    94.0     NaN    86.0      97     NaN    91.0
2015-01-01     2.0     NaN    98.0      44     NaN    79.0
2016-01-01    81.0     NaN    35.0      87     NaN    32.0
2017-01-01    59.0     NaN    95.0      32     NaN    58.0
2018-01-01     NaN     NaN     3.0      14     NaN     NaN
2019-01-01     NaN     NaN    48.0       9     NaN     NaN
2020-01-01     NaN     NaN     NaN      49     NaN     NaN

然后我想出了第二个完整的数据帧(df2):

df2 = pd.DataFrame({
    "Site 1": np.random.rand(21),"Site 2": np.random.rand(21),"Site 3": np.random.rand(21),"Site 4": np.random.rand(21),"Site 5": np.random.rand(21),"Site 6": np.random.rand(21)})
idx = pd.date_range(start='2000-01-01',end='2020-01-01',freq ='AS')
df2 = df2.set_index(idx)

现在,我将df2中的nan值替换为df中的nan值:

dfr = df2[df.notna()]

然后我反转数据框:

dfr = dfr[::-1]

valid_first = dfr.apply(lambda col: col.first_valid_index())
valid_last = dfr.apply(lambda col: col.last_valid_index())

现在,我要从上一个有效数据点计算百分比变化,该百分比对于每列都是固定的。相对于最近(或最后一个有效)数据点,这使我从现在到过去的变化百分比。

new = []
for j in dfr:
    m = dfr[j].loc[valid_first[j]:valid_last[j]]
    pc = m / m.iloc[0]-1
    new.append(pc)

final = pd.concat(new,axis=1)    
print(final) 

哪位给了我

              Site 1    Site 2    Site 3    Site 4     Site 5    Site 6
2000-01-01  0.270209 -0.728445 -0.636105  0.380330  41.339081 -0.462147
2001-01-01  0.854952 -0.827804 -0.703568 -0.787391  40.588791 -0.884806
2002-01-01 -0.677757 -0.120482 -0.208255 -0.982097  54.348094 -0.483415
2003-01-01 -0.322010 -0.061277 -0.382602  1.025088   5.440808 -0.602661
2004-01-01  1.574451 -0.768251 -0.543260  1.210434  50.494788 -0.859331
2005-01-01 -0.412226 -0.866441 -0.055027 -0.168267   1.346869 -0.385080
2006-01-01  1.280867 -0.640899  0.354513  1.086703   0.000000  0.108504
2007-01-01  1.121585 -0.741675 -0.735990 -0.768578        NaN -0.119436
2008-01-01 -0.210467 -0.376884 -0.575106 -0.779147        NaN  0.055949
2009-01-01  1.864107 -0.966827  0.566590  1.003121        NaN -0.214482
2010-01-01  0.571762 -0.311459 -0.518113  1.036950        NaN -0.513911
2011-01-01 -0.122525 -0.178137 -0.641642  0.197481        NaN  0.033141
2012-01-01  0.403578 -0.829402  0.161753 -0.438578        NaN -0.996595
2013-01-01  0.383481  0.000000 -0.305824  0.602079        NaN -0.057711
2014-01-01 -0.699708       NaN -0.515074 -0.277157        NaN -0.840873
2015-01-01  0.422364       NaN -0.759708  1.230037        NaN -0.663253
2016-01-01 -0.418945       NaN  0.197396 -0.445260        NaN -0.299741
2017-01-01  0.000000       NaN -0.897428  0.669791        NaN  0.000000
2018-01-01       NaN       NaN  0.138997  0.486961        NaN       NaN
2019-01-01       NaN       NaN  0.000000  0.200771        NaN       NaN
2020-01-01       NaN       NaN       NaN  0.000000        NaN       NaN 

我经常知道这些问题没有上下文,因此这是您的输入所获得的最终输出。再次感谢大家的帮助!

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 <select id="xxx"> SELECT di.id, di.name, di.work_type, di.updated... <where> <if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 <property name="dynamic.classpath" value="tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams['font.sans-serif'] = ['SimHei'] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -> systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping("/hires") public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate<String
使用vite构建项目报错 C:\Users\ychen\work>npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-