Redshift：构建可变日期范围内的累积总和

如何解决Redshift：构建可变日期范围内的累积总和

我正在努力构建一个使用日期范围动态构建累积总和的查询。

将问题打个比方，我希望计算每位客人每天订购的客房服务盘的平均数量。以以下示例数据集为例：

guest_id	most_recent_plate_ordered_date	cumulative_plates_ordered
1	10/1/2020	1
1	10/2/2020	2
1	10/4/2020	3
2	10/1/2020	1
2	10/2/2020	1
3	10/3/2020	1
3	10/4/2020	2

这是我想要实现的输出：

日期	cumulative_plates_ordered	number_of_people
10/1/2020	2	2
10/2/2020	3	2
10/3/2020	4	3
10/4/2020	6	3

本质上，我需要构建两个数字：每人订购的最大盘子数量和每天的人数之和。我已经生成了每天的人数——这很容易。我正在努力构建一个可以随着日期范围扩大而动态求和的查询。

我能够生成查询，为我提供给定日期最大值所需的数字。我的问题是将其转换为在一个查询中跨所有可能日期生成此数字的内容。以下是范围从 10/1 到 10/1 的示例查询：

select sum(max_cumulative_plates_ordered) from (
  select guest_id,max(cumulative_plates_ordered) as max_cumulative_plates_ordered
  from raw_data
  where most_recent_plate_ordered_date <= '2020-10-01'
  group by 1
)

有什么想法吗？感觉这是一个很容易解决的问题。

解决方法

我能够生成查询，为我提供给定日期最大值所需的数字。我的问题是将其转换为在一个查询中跨所有可能日期生成此数字的内容

不只是想要 group by 子句中的日期吗？

select dt,sum(cumulative_plates_ordered) as cumulative_plates_ordered,count(*) as number_of_people
from (
    select guest_id,most_recent_plate_ordered_date::date as dt,max(cumulative_plates_ordered) as cumulative_plates_ordered
    from raw_data
    group by 1,2
) t
group by dt

编辑

如果您想考虑帐户“缺失”的日期，那就有点不同了。您可以使用 cross join 生成所有可能的天数和来宾组合。然后使用窗口函数来填补空白：

select dt,count(*) as number_of_people
from (
    select g.guest_id,d.dt,max(max(t.cumulative_plates_ordered)) over(order by d.dt) as cumulative_plates_ordered
    from (select distinct most_recent_plate_ordered_date::date as dt from raw_data) d
    cross join (select distinct guest_id from raw_data) g
    left join raw_data t
        on  t.guest_id = g.guest_id
        and t.most_recent_plate_ordered_date >= d.dt
        and t.most_recent_plate_ordered_date <  d.dt + interval 1 day
    group by g.guest_id,d.dt
) t
group by dt

如果我理解正确，你想要：

到特定日期订购的不同人数。
当天最大cumulative_plates_ordered个订单的总和。

然而，这表明 2020-10-03 的值实际上是 4 而不是 5。

一种方法是相关子查询：

select dte::date,(select count(distinct guest_id)
        from t
        where t.most_recent_place_ordered <= gs.dte
       ) as num_guests,(select sum(plates)
        from (select t.guest_id,max(t.cumulative_plates_ordered) as plates
              from t
              where most_recent_place_ordered <= gs.dte
              group by t.guest_id
             ) t
       ) as num_plates
from (select distinct most_recent_place_ordered as dte from t) gs;

使您的数据具有挑战性的是累积总和。您可以使用 lag() 获取特定日期的更改。有了这些数据，使用窗口函数和聚合得到你想要的结果就简单多了：

with net as (
     select t.*,row_number() over (partition by guest_id order by most_recent_place_ordered) as seqnum,cumulative_plates_ordered - coalesce(lag(cumulative_plates_ordered) over (partition by guest_id order by most_recent_place_ordered),0) as new_plates
      from t
     )
select most_recent_place_ordered,sum(sum( (seqnum = 1)::int )) over (order by most_recent_place_ordered rows between unbounded preceding and current row) as num_guests,sum(sum( new_plates )) over (order by most_recent_place_ordered rows between unbounded preceding and current row) as num_plates
from net
group by most_recent_place_ordered
order by most_recent_place_ordered;

Here 是一个 dbfiddle。

Redshift：构建可变日期范围内的累积总和

如何解决Redshift：构建可变日期范围内的累积总和

解决方法

相关推荐