pyspark滚动窗口时间表

如何解决pyspark滚动窗口时间表

我正在尝试使用30分钟的时间范围（由source_ip分组）来实现滚动窗口。想法是获取每个source_ip的平均值。不确定这是正确的方法。我遇到的问题是ip 192.168.1.3，它似乎比30分钟的窗口平均多了，因为数据包25是几天后的。

df = sqlContext.createDataFrame([('192.168.1.1',17,"2017-03-10T15:27:18+00:00"),('192.168.1.2',1,"2017-03-15T12:27:18+00:00"),2,"2017-03-15T12:28:18+00:00"),3,"2017-03-15T12:29:18+00:00"),('192.168.1.3',4,5,25,"2017-03-18T11:27:18+00:00")],["source_ip","packets","timestampGMT"])

w = (Window()
     .partitionBy("source_ip")
     .orderBy(F.col("timestampGMT").cast('long'))
     .rangeBetween(-1800,0))

df = df.withColumn('rolling_average',F.avg("packets").over(w))

df.show(100,False)

这是我得到的结果。我希望前两个条目为4.5，第三个条目为25？

+-----------+-------+-------------------------+------------------+
|source_ip  |packets|timestampGMT             |rolling_average   |
+-----------+-------+-------------------------+------------------+
|192.168.1.3|4      |2017-03-15T12:28:18+00:00|11.333333333333334|
|192.168.1.3|5      |2017-03-15T12:29:18+00:00|11.333333333333334|
|192.168.1.3|25     |2017-03-18T11:27:18+00:00|11.333333333333334|
|192.168.1.2|1      |2017-03-15T12:27:18+00:00|2.0               |
|192.168.1.2|2      |2017-03-15T12:28:18+00:00|2.0               |
|192.168.1.2|3      |2017-03-15T12:29:18+00:00|2.0               |
|192.168.1.1|17     |2017-03-10T15:27:18+00:00|17.0              |
+-----------+-------+-------------------------+------------------+

解决方法

首先将字符串更改为时间戳，然后按顺序排序。

import pyspark.sql.functions as F
from pyspark.sql import Window

w = (Window()
     .partitionBy("source_ip")
     .orderBy(F.col("timestamp"))
     .rangeBetween(-1800,0))

df = df.withColumn("timestamp",F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
    .withColumn('rolling_average',F.avg("packets").over(w))

df.printSchema()
df.show(100,False)


root
 |-- source_ip: string (nullable = true)
 |-- packets: long (nullable = true)
 |-- timestampGMT: string (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- rolling_average: double (nullable = true)

+-----------+-------+-------------------------+----------+---------------+
|source_ip  |packets|timestampGMT             |timestamp |rolling_average|
+-----------+-------+-------------------------+----------+---------------+
|192.168.1.2|1      |2017-03-15T12:27:18+00:00|1489580838|1.0            |
|192.168.1.2|2      |2017-03-15T12:28:18+00:00|1489580898|1.5            |
|192.168.1.2|3      |2017-03-15T12:29:18+00:00|1489580958|2.0            |
|192.168.1.1|17     |2017-03-10T15:27:18+00:00|1489159638|17.0           |
|192.168.1.3|4      |2017-03-15T12:28:18+00:00|1489580898|4.0            |
|192.168.1.3|5      |2017-03-15T12:29:18+00:00|1489580958|4.5            |
|192.168.1.3|25     |2017-03-18T11:27:18+00:00|1489836438|25.0           |
+-----------+-------+-------------------------+----------+---------------+

pyspark滚动窗口时间表

如何解决pyspark滚动窗口时间表

解决方法

相关推荐