按1分钟间隔分组以执行一系列操作SQL BigQuery

如何解决按1分钟间隔分组以执行一系列操作SQL BigQuery

我需要以1分钟为间隔将数据分组进行一系列操作。我的数据如下:

id    MetroId            Time             ActionName            refererurl
111     a          2020-09-01-09:19:00     First           www.stackoverflow/a12345
111     b         2020-09-01-12:36:54      First           www.stackoverflow/a12345
111     f         2020-09-01-12:36:56      First     www.stackoverflow/xxxx
111     b         2020-09-01-12:36:58      Midpoint        www.stackoverflow/a12345
111     f         2020-09-01-12:37:01      Midpoint    www.stackoverflow/xxx
111     b          2020-09-01-12:37:03     Third           www.stackoverflow/a12345
111     b          2020-09-01-12:37:09     Complete        www.stackoverflow/a12345
222     d          2020-09-01-15:17:44     First           www.stackoverflow/a2222
222     d          2020-09-01-15:17:48     Midpoint        www.stackoverflow/a2222
222     d          2020-09-01-15:18:05     Third           www.stackoverflow/a2222

我需要在以下情况下获取数据:如果x_id列的x_urlComplete具有action_name值,请获取该值。如果没有Complete,请抓住Third,依此类推。

  ARRAY_AGG(current_query_result 
    ORDER BY CASE ActionName
      WHEN 'Complete' THEN 1
      WHEN 'Third' THEN 2
      WHEN 'Midpoint' THEN 3
      WHEN 'First' THEN 4
    END
    LIMIT 1
  )[OFFSET(0)]
FROM
    (
        SELECT d.id,c.Time,c.ActionName,c.refererurl,c.MetroId
        FROM
            `bq_query_table_c` c
            INNER JOIN `bq_table_d` d ON d.id = c.CreativeId
        WHERE
            c.refererurl LIKE "https://www.stackoverflow/%"
            AND c.ActionName in ('First','Midpoint','Third','Complete')
    ) current_query_result
GROUP BY
    id,refererurl,MetroId 
    TIMESTAMP_SUB(
    PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S',time),INTERVAL MOD(UNIX_SECONDS(PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S',time)),1 * 60) 
    SECOND
  ) 

所需的输出:

id   MetroId         Time             ActionName            refererurl
111      a     2020-09-01-09:19:00     First           www.stackoverflow/a12345
111     f     2020-09-01-12:37:01      Midpoint    www.stackoverflow/xxx
111     b     2020-09-01-12:37:09     Complete        www.stackoverflow/a12345
222     c      2020-09-01-15:18:05     Third           www.stackoverflow/a2222

解决方法

这听起来像是一个“差距与岛屿”的问题,其中的差距大于1分钟,而孤岛则代表“行动链”。

我将从建立代表岛屿的组开始:为此,您可以使用lag()来检索先前的动作时间,以及两个连续动作之间每间隔1分钟或更长时间的累积总和。 :

select t.*,sum(case when time > timestamp_add(lag_time,interval 1 minute) then 1 else 0 end)
        over(partition by x_id,x_url order by time) grp
from (
    select d.id,c.time,c.actionname,c.refererurl,lag(time) over(partition by id,refererurl order by time) lag_time
    from `bq_query_table_c` c
    inner join `bq_table_d` d on d.id = c.creativeid
    where c.refererurl like "https://www.stackoverflow/%"
        and c.actionname in ('First','Midpoint','Third','Complete')
) t

grp是岛屿标识符。

从那时起,我们可以使用您的原始逻辑来过滤每个组的首选操作。我们不需要每隔1分钟进行汇总-我们可以改用grp

select   
    array_agg(t) order by case actionname
        when 'Complete' then 1 
        when 'Third'    then 2
        when 'midpoint' then 3
        when 'first'    then 4
    end limit 1)[offset(0)]
from (
    select t.*,interval 1 minute) then 1 else 0 end)
            over(partition by x_id,x_url order by time) grp
    from (
        select d.id,refererurl order by time) lag_time
        from `bq_query_table_c` c
        inner join `bq_table_d` d on d.id = c.creativeid
        where c.refererurl like "https://www.stackoverflow/%"
            and c.actionname in ('First','Complete')
    ) t
) t
group by id,refererurl,grp

请注意,如果在单个孤岛上有两个“完成”操作,则未定义将选择哪个操作(您的原始查询几乎具有相同的缺陷)。为了使结果具有确定性,您想向ARRAY_AGG()添加另一个排序条件,例如time

    array_agg(t) order by case actionname
        when 'Complete' then 1 
        when 'Third'    then 2
        when 'midpoint' then 3
        when 'first'    then 4
    end,time limit 1)[offset(0)]
,

以下是用于BigQuery标准SQL

#standardSQL
WITH temp AS (
  SELECT *,PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S',time) ts
  FROM `project.dataset.bq_table`
)
SELECT * EXCEPT (ts,time_lag) FROM (
  SELECT *,TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts),ts,SECOND) time_lag
  FROM (
    SELECT 
      AS VALUE ARRAY_AGG(t 
        ORDER BY STRPOS('First,Midpoint,Third,Complete',action_name) DESC 
        LIMIT 1
      )[OFFSET(0)]
    FROM temp t
    WHERE action_name IN ('First','Complete')
    GROUP BY id,url,TIMESTAMP_SUB(ts,INTERVAL MOD(UNIX_SECONDS(ts),60) SECOND
      )   
  )
)
WHERE NOT IFNULL(time_lag,777) < 60    

您可以使用问题中的示例数据来测试,玩转上面的示例

#standardSQL
WITH `project.dataset.bq_table` AS (
  SELECT 111 id,'2020-09-01-09:19:00' time,'First' action_name,'www.stackoverflow/a12345' url UNION ALL
  SELECT 111,'2020-09-01-12:36:54','First','www.stackoverflow/a12345' UNION ALL
  SELECT 111,'2020-09-01-12:36:58','2020-09-01-12:37:03','2020-09-01-12:37:09','Complete','www.stackoverflow/a12345' UNION ALL
  SELECT 222,'2020-09-01-15:17:44','www.stackoverflow/a2222' UNION ALL
  SELECT 222,'2020-09-01-15:17:48','2020-09-01-15:18:05','www.stackoverflow/a2222' 
),temp AS (
  SELECT *,777) < 60   

有结果

Row     id      time                    action_name     url  
1       111     2020-09-01-09:19:00     First           www.stackoverflow/a12345     
2       111     2020-09-01-12:37:09     Complete        www.stackoverflow/a12345     
3       222     2020-09-01-15:18:05     Third           www.stackoverflow/a2222    

注意:对于您的用例,我仍然不是100%肯定-但以上内容是基于到目前为止所讨论/评论的内容

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-