如何解决按时间日期查找行的重叠
我有这样的数据帧,其中 min-max 是间隔的开始和结束。 core_name 是实例名称,length_mins 是间隔长度。
min max length_mins core_name
1 2020-07-28 03:05:30 2020-07-28 05:45:15 159.75 mins 0,1
2 2020-07-14 14:29:30 2020-07-14 16:36:45 127.25 mins 0,10
3 2020-07-16 15:32:45 2020-07-16 16:16:00 43.25 mins 0,11
4 2020-07-17 02:37:30 2020-07-17 05:27:30 170.00 mins 0,11
5 2020-07-18 02:42:00 2020-07-18 05:24:30 162.50 mins 0,11
6 2020-07-25 02:21:15 2020-07-25 04:59:15 158.00 mins 0,12
7 2020-07-16 15:40:15 2020-07-16 16:13:45 33.50 mins 0,13
8 2020-07-16 13:18:30 2020-07-16 16:13:30 175.00 mins 0,15
9 2020-07-16 14:43:00 2020-07-16 15:49:30 66.50 mins 0,2
10 2020-07-14 14:29:30 2020-07-14 16:55:15 145.75 mins 0,4
11 2020-07-16 13:32:45 2020-07-16 17:21:00 228.25 mins 0,6
12 2020-07-27 02:15:30 2020-07-27 05:04:15 168.75 mins 0,6
13 2020-07-14 14:29:30 2020-07-14 16:53:30 144.00 mins 0,8
14 2020-07-16 16:40:30 2020-07-16 21:19:45 279.25 mins 1,0
15 2020-07-14 21:03:15 2020-07-14 22:49:45 106.50 mins 1,1
16 2020-07-15 03:32:45 2020-07-15 06:15:15 162.50 mins 1,10
17 2020-07-16 15:58:15 2020-07-16 21:18:30 320.25 mins 1,10
18 2020-07-14 18:44:00 2020-07-14 20:00:15 76.25 mins 1,11
19 2020-07-14 21:12:00 2020-07-15 00:56:00 224.00 mins 1,11
20 2020-07-16 16:32:30 2020-07-16 19:30:15 177.75 mins 1,12
21 2020-07-14 15:39:15 2020-07-15 00:35:15 536.00 mins 1,13
22 2020-07-16 15:14:15 2020-07-16 21:14:00 359.75 mins 1,14
23 2020-07-14 14:29:30 2020-07-15 00:48:45 619.25 mins 1,15
24 2020-07-16 16:34:00 2020-07-16 20:58:15 264.25 mins 1,16
25 2020-07-14 20:19:15 2020-07-15 00:54:30 275.25 mins 1,17
26 2020-07-16 16:35:00 2020-07-16 21:18:00 283.00 mins 1,18
27 2020-07-14 14:29:30 2020-07-14 19:20:45 291.25 mins 1,19
28 2020-07-14 20:13:00 2020-07-15 01:00:45 287.75 mins 1,19
29 2020-07-16 16:27:45 2020-07-16 21:07:15 279.50 mins 1,2
30 2020-07-14 14:29:30 2020-07-15 00:57:30 628.00 mins 1,3
31 2020-07-16 16:32:30 2020-07-16 21:15:45 283.25 mins 1,4
32 2020-07-14 20:42:15 2020-07-15 00:44:45 242.50 mins 1,5
33 2020-07-16 16:25:00 2020-07-16 21:16:45 291.75 mins 1,6
34 2020-07-14 18:24:00 2020-07-14 23:08:15 284.25 mins 1,7
35 2020-07-16 02:29:30 2020-07-16 05:11:00 161.50 mins 1,7
36 2020-07-16 16:37:45 2020-07-16 21:16:30 278.75 mins 1,8
37 2020-07-14 14:29:30 2020-07-15 00:59:15 629.75 mins 1,9
我需要:
- 查找相互重叠的行,
- 计算重叠数,
- 获取每个核心的重叠核心列表。
这是我收到的结果:
min max length_mins core_name overlaps
1 2020-07-14 14:29:30 2020-07-15 00:59:15 629.75 mins 1,9 15
2 2020-07-14 14:29:30 2020-07-15 00:57:30 628.00 mins 1,3 15
3 2020-07-14 14:29:30 2020-07-15 00:48:45 619.25 mins 1,15 15
4 2020-07-14 15:39:15 2020-07-15 00:35:15 536.00 mins 1,13 15
5 2020-07-16 15:14:15 2020-07-16 21:14:00 359.75 mins 1,14 15
6 2020-07-16 13:32:45 2020-07-16 17:21:00 228.25 mins 0,6 15
7 2020-07-16 15:58:15 2020-07-16 21:18:30 320.25 mins 1,10 14
8 2020-07-14 18:24:00 2020-07-14 23:08:15 284.25 mins 1,7 12
9 2020-07-16 16:25:00 2020-07-16 21:16:45 291.75 mins 1,6 11
10 2020-07-16 16:32:30 2020-07-16 21:15:45 283.25 mins 1,4 11
11 2020-07-16 16:35:00 2020-07-16 21:18:00 283.00 mins 1,18 11
12 2020-07-16 16:27:45 2020-07-16 21:07:15 279.50 mins 1,2 11
13 2020-07-16 16:40:30 2020-07-16 21:19:45 279.25 mins 1,0 11
14 2020-07-16 16:37:45 2020-07-16 21:16:30 278.75 mins 1,8 11
15 2020-07-16 16:34:00 2020-07-16 20:58:15 264.25 mins 1,16 11
16 2020-07-16 16:32:30 2020-07-16 19:30:15 177.75 mins 1,12 11
17 2020-07-14 14:29:30 2020-07-14 19:20:45 291.25 mins 1,19 10
18 2020-07-14 20:13:00 2020-07-15 01:00:45 287.75 mins 1,19 10
19 2020-07-14 20:19:15 2020-07-15 00:54:30 275.25 mins 1,17 10
20 2020-07-14 20:42:15 2020-07-15 00:44:45 242.50 mins 1,5 10
21 2020-07-14 21:12:00 2020-07-15 00:56:00 224.00 mins 1,11 10
22 2020-07-14 21:03:15 2020-07-14 22:49:45 106.50 mins 1,1 10
23 2020-07-14 14:29:30 2020-07-14 16:55:15 145.75 mins 0,4 8
24 2020-07-14 14:29:30 2020-07-14 16:53:30 144.00 mins 0,8 8
25 2020-07-14 14:29:30 2020-07-14 16:36:45 127.25 mins 0,10 8
26 2020-07-16 13:18:30 2020-07-16 16:13:30 175.00 mins 0,15 7
27 2020-07-14 18:44:00 2020-07-14 20:00:15 76.25 mins 1,11 7
28 2020-07-16 15:32:45 2020-07-16 16:16:00 43.25 mins 0,11 7
29 2020-07-16 15:40:15 2020-07-16 16:13:45 33.50 mins 0,13 7
30 2020-07-16 14:43:00 2020-07-16 15:49:30 66.50 mins 0,2 6
31 2020-07-17 02:37:30 2020-07-17 05:27:30 170.00 mins 0,11 1
32 2020-07-27 02:15:30 2020-07-27 05:04:15 168.75 mins 0,6 1
33 2020-07-18 02:42:00 2020-07-18 05:24:30 162.50 mins 0,11 1
34 2020-07-15 03:32:45 2020-07-15 06:15:15 162.50 mins 1,10 1
35 2020-07-16 02:29:30 2020-07-16 05:11:00 161.50 mins 1,7 1
36 2020-07-28 03:05:30 2020-07-28 05:45:15 159.75 mins 0,1 1
37 2020-07-25 02:21:15 2020-07-25 04:59:15 158.00 mins 0,12 1
cores_list
1 1,9;0,10;0,4;0,8;1,1;1,11;1,13;1,15;1,17;1,19;1,3;1,5;1,7
2 1,3;0,7;1,9
3 1,15;0,9
4 1,13;0,9
5 1,14;0,11;0,2;0,6;1,0;1,10;1,12;1,16;1,18;1,2;1,4;1,8
6 0,6;0,14;1,8
7 1,8
8 1,9
9 1,8
10 1,8
11 1,18;0,8
12 1,8
13 1,0;0,8
14 1,8;0,6
15 1,16;0,8
16 1,12;0,8
17 1,19;0,9
18 1,9
19 1,9
20 1,9
21 1,9
22 1,9
23 0,9
24 0,9
25 0,9
26 0,14
27 1,9
28 0,14
29 0,14
30 0,14
31 0,11
32 0,6
33 0,11
34 1,10
35 1,7
36 0,1
37 0,12
这是我的带有示例数据的代码:
# find overlaps
library(dplyr)
library(lubridate)
data.example <-
structure(
list(
min = structure(
c(
1595894730,1594726170,1594902765,1594942650,1595029320,1595632875,1594903215,1594894710,1594899780,1594895565,1595805330,1594906830,1594749795,1594773165,1594904295,1594741440,1594750320,1594906350,1594730355,1594901655,1594906440,1594747155,1594906500,1594746780,1594906065,1594748535,1594905900,1594740240,1594855770,1594906665,1594726170
),tzone = "",class = c("POSIXct","POSIXt")
),max = structure(
c(
1595904315,1594733805,1594905360,1594952850,1595039070,1595642355,1594905225,1594905210,1594903770,1594734915,1594909260,1595815455,1594734810,1594923585,1594756185,1594782915,1594923510,1594746015,1594763760,1594917015,1594762515,1594923240,1594763325,1594922295,1594763670,1594923480,1594743645,1594764045,1594922835,1594763850,1594923345,1594763085,1594923405,1594757295,1594865460,1594923390,1594763955
),length_mins = structure(
c(
159.75,127.25,43.25,170,162.5,158,33.5,175,66.5,145.75,228.25,168.75,144,279.25,106.5,320.25,76.25,224,177.75,536,359.75,619.25,264.25,275.25,283,291.25,287.75,279.5,628,283.25,242.5,291.75,284.25,161.5,278.75,629.75
),class = "difftime",units = "mins"
),core_name = c(
"0,1","0,10",11",12",13",15",2",4",6",8","1,0",14",16",17",18",19",3",5",7",9"
)
),row.names = c(NA,-37L),class = "data.frame"
)
print ( data.example)
data.example <- data.example %>% mutate (overlaps = 1,cores_list = c(core_name))
print ("Calculating rows overlaps")
for (i in 1:(nrow(data.example)-1)) {
min_el1 <- data.example[i,]$min
max_el1 <- data.example[i,]$max
for (k in (i+1):nrow(data.example)) {
min_el2 <- data.example[k,]$min
max_el2 <- data.example[k,]$max
el1_interval <- interval(min_el1,max_el1)
el2_interval <- interval(min_el2,max_el2)
overlaps <- int_overlaps(el1_interval,el2_interval)
if (overlaps == T) {
print (paste ("row",i,"overlaps with row",k))
data.example[k,]$overlaps <- data.example[k,]$overlaps +1
data.example[i,]$overlaps <- data.example[i,]$overlaps +1
if ( !grepl( data.example[i,]$cores_list,data.example[k,]$core_name,fixed = TRUE)) {
data.example[i,]$cores_list <- paste(data.example[i,sep=';')
}
if ( !grepl( data.example[k,data.example[i,fixed = TRUE)) {
data.example[k,]$cores_list <- paste(data.example[k,sep=';')
}
}
}
}
data.example <- data.example %>% arrange(desc(overlaps),desc(length_mins))
print (data.example)
我对结果很满意,但我的代码非常慢。如果我有数百行代码需要几分钟才能运行。我确信可以避免使用嵌套循环,并且可以显着加快代码速度。任何帮助将不胜感激。
解决方法
这应该可以工作..似乎与所需的输出相匹配..
library( data.table )
#make it a data.table
setDT( data.example )
#create temp id column and set is as key (for use with .EACHI later on)
data.example[,id := .I ]
setkey( data.example,id )
#self join on subset by row
data.example[ data.example,c("overlaps","cores_list") := {
temp <- data.example[ min <= i.max & max >= i.min,]
list( nrow(temp),paste0( temp$core_name,collapse = ";") )
},by = .EACHI ]
#if desired,you can drop the id-columns using: data.example[,id := NULL]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。