如何解决pyspark滚动窗口时间表
我正在尝试使用30分钟的时间范围(由source_ip分组)来实现滚动窗口。想法是获取每个source_ip的平均值。不确定这是正确的方法。我遇到的问题是ip 192.168.1.3,它似乎比30分钟的窗口平均多了,因为数据包25是几天后的。
df = sqlContext.createDataFrame([('192.168.1.1',17,"2017-03-10T15:27:18+00:00"),('192.168.1.2',1,"2017-03-15T12:27:18+00:00"),2,"2017-03-15T12:28:18+00:00"),3,"2017-03-15T12:29:18+00:00"),('192.168.1.3',4,5,25,"2017-03-18T11:27:18+00:00")],["source_ip","packets","timestampGMT"])
w = (Window()
.partitionBy("source_ip")
.orderBy(F.col("timestampGMT").cast('long'))
.rangeBetween(-1800,0))
df = df.withColumn('rolling_average',F.avg("packets").over(w))
df.show(100,False)
这是我得到的结果。我希望前两个条目为4.5,第三个条目为25?
+-----------+-------+-------------------------+------------------+
|source_ip |packets|timestampGMT |rolling_average |
+-----------+-------+-------------------------+------------------+
|192.168.1.3|4 |2017-03-15T12:28:18+00:00|11.333333333333334|
|192.168.1.3|5 |2017-03-15T12:29:18+00:00|11.333333333333334|
|192.168.1.3|25 |2017-03-18T11:27:18+00:00|11.333333333333334|
|192.168.1.2|1 |2017-03-15T12:27:18+00:00|2.0 |
|192.168.1.2|2 |2017-03-15T12:28:18+00:00|2.0 |
|192.168.1.2|3 |2017-03-15T12:29:18+00:00|2.0 |
|192.168.1.1|17 |2017-03-10T15:27:18+00:00|17.0 |
+-----------+-------+-------------------------+------------------+
解决方法
首先将字符串更改为时间戳,然后按顺序排序。
import pyspark.sql.functions as F
from pyspark.sql import Window
w = (Window()
.partitionBy("source_ip")
.orderBy(F.col("timestamp"))
.rangeBetween(-1800,0))
df = df.withColumn("timestamp",F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
.withColumn('rolling_average',F.avg("packets").over(w))
df.printSchema()
df.show(100,False)
root
|-- source_ip: string (nullable = true)
|-- packets: long (nullable = true)
|-- timestampGMT: string (nullable = true)
|-- timestamp: long (nullable = true)
|-- rolling_average: double (nullable = true)
+-----------+-------+-------------------------+----------+---------------+
|source_ip |packets|timestampGMT |timestamp |rolling_average|
+-----------+-------+-------------------------+----------+---------------+
|192.168.1.2|1 |2017-03-15T12:27:18+00:00|1489580838|1.0 |
|192.168.1.2|2 |2017-03-15T12:28:18+00:00|1489580898|1.5 |
|192.168.1.2|3 |2017-03-15T12:29:18+00:00|1489580958|2.0 |
|192.168.1.1|17 |2017-03-10T15:27:18+00:00|1489159638|17.0 |
|192.168.1.3|4 |2017-03-15T12:28:18+00:00|1489580898|4.0 |
|192.168.1.3|5 |2017-03-15T12:29:18+00:00|1489580958|4.5 |
|192.168.1.3|25 |2017-03-18T11:27:18+00:00|1489836438|25.0 |
+-----------+-------+-------------------------+----------+---------------+
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。