R data.frame中多个变量的小时均值?

如何解决R data.frame中多个变量的小时均值?

我有以下代码,正在尝试查找每个hourly mean中的variables (i.e.,X,Y,and Z)。我的输出应为带有data.frame列的hourlyDate和所有mean hourly data中的variables。任何前进的方式都将不胜感激。

library(lubridate)

set.seed(123)

T <- data.frame(Datetime = seq(ymd_hms("2011-01-01 00:00:00"),to= ymd_hms("2011-12-31 00:00:00"),by = "5 min"),X = runif(104833,5,10),Y = runif(104833,Z = runif(104833,10))
T$Date <- format(T$Datetime,format="%Y-%m-%d")
T$Hour <- format(T$Datetime,format = "%H")
T$Mints <- format(T$Datetime,format = "%M")

解决方法

尝试:

library(lubridate)
library(dplyr)

set.seed(123)

T <- data.frame(Datetime = seq(ymd_hms("2011-01-01 00:00:00"),to= ymd_hms("2011-12-31 00:00:00"),by = "5 min"),X = runif(104833,5,10),Y = runif(104833,Z = runif(104833,10))



T %>% mutate(hourlyDate = floor_date(Datetime,unit='hour')) %>%
      select(-Datetime) %>% group_by(hourlyDate) %>% 
      summarize(across(everything(),mean)) %>%
      ungroup()
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 8,737 x 4
#>    hourlyDate              X     Y     Z
#>    <dttm>              <dbl> <dbl> <dbl>
#>  1 2011-01-01 00:00:00  8.00  7.90  6.90
#>  2 2011-01-01 01:00:00  7.93  7.47  7.90
#>  3 2011-01-01 02:00:00  7.83  6.89  7.67
#>  4 2011-01-01 03:00:00  6.61  7.92  7.18
#>  5 2011-01-01 04:00:00  7.27  7.20  6.48
#>  6 2011-01-01 05:00:00  7.88  6.80  7.69
#>  7 2011-01-01 06:00:00  7.07  8.05  7.52
#>  8 2011-01-01 07:00:00  7.40  7.92  6.99
#>  9 2011-01-01 08:00:00  7.97  7.76  7.26
#> 10 2011-01-01 09:00:00  7.57  7.47  6.94
#> # ... with 8,727 more rows

reprex package(v0.3.0)于2020-08-20创建

,

这是一种整洁的方法:

library(dplyr)

group_by(T,Date,Hour) %>% 
  summarize(X = mean(X),Y = mean(Y),Z = mean(Z)) %>%
  transmute(Date = as.POSIXct(paste0(Date," ",Hour,":00:00")),X,Y,Z)

#> # A tibble: 8,737 x 4
#> # Groups:   Date [8,714]
#>    Date                    X     Y     Z
#>    <dttm>              <dbl> <dbl> <dbl>
#>  1 2011-01-01 00:00:00  8.00  7.90  6.90
#>  2 2011-01-01 01:00:00  7.93  7.47  7.90
#>  3 2011-01-01 02:00:00  7.83  6.89  7.67
#>  4 2011-01-01 03:00:00  6.61  7.92  7.18
#>  5 2011-01-01 04:00:00  7.27  7.20  6.48
#>  6 2011-01-01 05:00:00  7.88  6.80  7.69
#>  7 2011-01-01 06:00:00  7.07  8.05  7.52
#>  8 2011-01-01 07:00:00  7.40  7.92  6.99
#>  9 2011-01-01 08:00:00  7.97  7.76  7.26
#> 10 2011-01-01 09:00:00  7.57  7.47  6.94
#> # ... with 8,727 more rows
,

lubridate具有floor_date函数,可将Datetime列修整为指定的单位。

然后按每小时时间戳汇总您想要的变量

library(dplyr)
library(lubridate)

T %>%
    group_by(hourlyDate = lubridate::floor_date(Datetime,unit = 'hours')) %>%
    summarise(across(.cols = c(X,Z),.fns = ~mean(.x,na.rm=TRUE),.names = "meanHourlyData_{.col}"))

顺便说一句,我建议不要将T用作变量名,因为这也是TRUE的缩写,并且可能会导致一些意外的行为...

,

三种基本的R解决方案是将splittapplyrowsumtable结合使用。后者速度特别快(比dplyr答案之一快9倍。)

tl; dr是您获得了以下计算时间

#R> Unit: milliseconds
#R>            expr   min    lq  mean median    uq   max neval
#R>  split + sapply 563.9 577.4 636.1  649.8 680.7 697.1    10
#R> tapply + sapply 108.0 117.3 134.0  120.2 124.4 205.1    10
#R>  rowsum + table  21.3  21.3  21.5   21.3  21.6  21.9    10
#R>           dplyr 172.4 176.6 182.3  180.9 185.9 203.4    10

这是解决方案

# create date-hour column
T$DateH <-  format(T$Datetime,format="%Y-%m-%d-%H")

# using split + sapply
options(digits = 3)
out_1 <- sapply(split(T[,c("X","Y","Z")],T$DateH),colMeans) 
head(t(out_1),5)
#R>                  X    Y    Z
#R> 2011-01-01-00 8.00 7.90 6.90
#R> 2011-01-01-01 7.93 7.47 7.90
#R> 2011-01-01-02 7.83 6.89 7.67
#R> 2011-01-01-03 6.61 7.92 7.18
#R> 2011-01-01-04 7.27 7.20 6.48

# using tapply + sapply
out_2 <- sapply(c("X","Z"),function(var) c(tapply(T[[var]],T$DateH,mean)))
head(out_2)
#R>                  X    Y    Z
#R> 2011-01-01-00 8.00 7.90 6.90
#R> 2011-01-01-01 7.93 7.47 7.90
#R> 2011-01-01-02 7.83 6.89 7.67
#R> 2011-01-01-03 6.61 7.92 7.18
#R> 2011-01-01-04 7.27 7.20 6.48

# check that we get the same
all.equal(t(out_1),out_2,check.attributes = FALSE)
#R> [1] TRUE

# with rowsum + table
out_3 <- as.matrix(rowsum(T[,group = T$DateH)) / 
  rep(table(T$DateH),3)

# check that we get the same
all.equal(out_2,out_3)
#R> [2] TRUE

# compare with dplyr solution
library(dplyr)
out_3 <- group_by(T,Z)


# check that we get the same
all.equal(out_2,as.matrix(out_3[,"Z")]),check.attributes = FALSE)
#R> [1] TRUE

# check computation time
library(microbenchmark)
microbenchmark(
  `split + sapply` = 
    sapply(split(T[,colMeans),`tapply + sapply` = 
    sapply(c("X",mean))),`rowsum + table` = 
    as.matrix(rowsum(T[,group = T$DateH)) / 
    rep(table(T$DateH),3),`dplyr` = 
    group_by(T,Hour) %>% 
    summarize(X = mean(X),Z = mean(Z)) %>%
    transmute(Date = as.POSIXct(paste0(Date,times = 10)
#R> Unit: milliseconds
#R>            expr   min    lq  mean median    uq   max neval
#R>  split + sapply 563.9 577.4 636.1  649.8 680.7 697.1    10
#R> tapply + sapply 108.0 117.3 134.0  120.2 124.4 205.1    10
#R>  rowsum + table  21.3  21.3  21.5   21.3  21.6  21.9    10
#R>           dplyr 172.4 176.6 182.3  180.9 185.9 203.4    10

我认为使用data.table也可能会很快获得结果。最后,请勿将T用作变量名。 TTRUE的简写!

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-