如何分组和合并Spark DataFrame的组的这些行

如何解决如何分组和合并Spark DataFrame的组的这些行

假设我有一个这样的表，

A  | B  |    C     | D  |  E  | F
x1 | 5  | 20200115 | 15 | 4.5 | 1
x1 | 10 | 20200825 | 15 | 5.6 | 19
x2 | 10 | 20200115 | 15 | 4.1 | 1
x2 | 10 | 20200430 | 15 | 9.1 | 1

我希望将这些行合并到col A上，并生成这样的数据框

A  | B  |    C     | D  |  E  | F
x1 | 15 | 20200825 | 15 | 5.6 | 19
x2 | 10 | 20200115 | 15 | 4.1 | 1
x2 | 10 | 20200430 | 15 | 9.1 | 1

基本上，如果A列中组的B列之和等于D列的值，则

B列的新值将是B列的总和
将根据C列的最新值（YYYYmmDD中的日期）提取C，E，F列

由于对于X2组，上述条件不成立（即B列的总和大于20且D列的15等于15），所以我想将两个记录都保留在目标中

假设：在我的数据中，给定组的D列将是相同的（在本例中为15）

我看过很多分组和窗口化（partitioning）示例，但是在我看来这是不同的，因此我无法缩小范围。

我可以将分组数据通过管道传输到UDF并执行某些操作吗？

PS：在pyspark中构建它，如果您的示例可以在pyspark中，那就太好了

解决方法

试试这个-

将sum + max与具有开窗口功能一起使用

df.show(false)
    df.printSchema()
    /**
      * +---+---+--------+---+---+---+
      * |A  |B  |C       |D  |E  |F  |
      * +---+---+--------+---+---+---+
      * |x1 |5  |20200115|15 |4.5|1  |
      * |x1 |10 |20200825|15 |5.6|19 |
      * |x2 |10 |20200115|15 |4.1|1  |
      * |x2 |10 |20200430|15 |9.1|1  |
      * +---+---+--------+---+---+---+
      *
      * root
      * |-- A: string (nullable = true)
      * |-- B: integer (nullable = true)
      * |-- C: integer (nullable = true)
      * |-- D: integer (nullable = true)
      * |-- E: double (nullable = true)
      * |-- F: integer (nullable = true)
      */

    val w = Window.partitionBy("A")
    df.withColumn("sum",sum("B").over(w))
      .withColumn("latestC",max("C").over(w))
      .withColumn("retain",when($"sum" === $"D",when($"latestC" === $"C",true).otherwise(false) )
          .otherwise(true) )
      .where($"retain" === true)
      .withColumn("B",$"sum").otherwise($"B") )
        .otherwise($"B"))
      .show(false)

    /**
      * +---+---+--------+---+---+---+---+--------+------+
      * |A  |B  |C       |D  |E  |F  |sum|latestC |retain|
      * +---+---+--------+---+---+---+---+--------+------+
      * |x1 |15 |20200825|15 |5.6|19 |15 |20200825|true  |
      * |x2 |10 |20200115|15 |4.1|1  |20 |20200430|true  |
      * |x2 |10 |20200430|15 |9.1|1  |20 |20200430|true  |
      * +---+---+--------+---+---+---+---+--------+------+
      */

在pyspark中，我会这样：

from pyspark.sql import functions as F,Window as W

b = ["A","B","C","D","E","F"]
a = [
    ("x1",5,"20200115",15,4.5,1),("x1",10,"20200825",5.6,19),("x2",4.1,"20200430",9.1,]

df = spark.createDataFrame(a,b)


df = df.withColumn("B_sum",F.sum("B").over(W.partitionBy("A")))

process_df = df.where("D >= B_Sum")
no_process_df = df.where("D < B_sum").drop("B_sum")


process_df = (
    process_df.withColumn(
        "rng",F.row_number().over(W.partitionBy("A").orderBy(F.col("C").desc()))
    )
    .where("rng=1")
    .select("A",F.col("B_sum").alias("B"),"F",)
)

final_output = process_df.unionByName(no_process_df)
+---+---+--------+---+---+---+
|  A|  B|       C|  D|  E|  F|
+---+---+--------+---+---+---+
| x1| 15|20200825| 15|5.6| 19|
| x2| 10|20200115| 15|4.1|  1|
| x2| 10|20200430| 15|9.1|  1|
+---+---+--------+---+---+---+

如何分组和合并Spark DataFrame的组的这些行

如何解决如何分组和合并Spark DataFrame的组的这些行

解决方法

相关推荐