如何解决有条件计算总和
以下是我的数据,我正在使用parcel_id进行分组,如果需要 imprv_det_type_cd以MA开头
输入:
+------------+----+-----+-----------------+
| parcel_id|year| sqft|imprv_det_type_cd|
+------------+----+-----+-----------------+
|000000100010|2014| 4272| MA|
|000000100010|2014| 800| 60P|
|000000100010|2014| 3200| MA2|
|000000100010|2014| 1620| 49R|
|000000100010|2014| 1446| 46R|
|000000100010|2014|40140| 45B|
|000000100010|2014| 1800| 45C|
|000000100010|2014| 864| 49C|
|000000100010|2014| 1| 48S|
+------------+----+-----+-----------------+
在这种情况下,从上方仅考虑两行。
预期输出:
+---------+-----------------+--------------------+----------+
|parcel_id|imprv_det_type_cd|structure_total_sqft|year_built|
+---------+-----------------+--------------------+----------+
|100010 |MA |7472 |2014 |
+---------+-----------------+--------------------+----------+
代码:
# read APPRAISAL_IMPROVEMENT_DETAIL.TXT
def _transfrom_imp_detail():
w_impr = Window.partitionBy("parcel_id")
return(
(spark.read.text(path_ade_imp_info)
.select(
F.trim(F.col("value").substr(1,12)).alias("parcel_id"),F.trim(F.col("value").substr(86,4)).cast("integer").alias("year"),F.trim(F.col("value").substr(94,15)).cast("integer").alias("sqft"),F.trim(F.col("value").substr(41,10)).alias("imprv_det_type_cd"),)
.withColumn(
"parcel_id",F.regexp_replace('parcel_id',r'^[0]*','')
)
.withColumn("structure_total_sqft",F.sum("sqft").over(w_impr))
.withColumn("year_built",F.min("year").over(w_impr))
).drop("sqft","year").drop_duplicates(["parcel_id"])
)
我知道此代码中的.withColumn(“ structure_total_sqft”,F.sum(“ sqft”)。over(w_impr))有所更改,但不确定我必须做什么。我尝试了功能,但仍然无法正常工作。
先谢谢您
解决方法
我不知道您为什么要groupBy
,但您不想这么做。
df.withColumn('parcel_id',f.regexp_replace('parcel_id',r'^[0]*','')) \
.filter("imprv_det_type_cd like 'MA%'") \
.groupBy('parcel_id','year') \
.agg(f.sum('sqft').alias('sqft'),f.first(f.substring('imprv_det_type_cd',2)).alias('imprv_det_type_cd')) \
.show(10,False)
+---------+----+------+-----------------+
|parcel_id|year|sqft |imprv_det_type_cd|
+---------+----+------+-----------------+
|100010 |2014|7472.0|MA |
+---------+----+------+-----------------+
,
使用 sum(when(..))
df2.show(false)
df2.printSchema()
/**
* +------------+----+-----+-----------------+
* |parcel_id |year|sqft |imprv_det_type_cd|
* +------------+----+-----+-----------------+
* |000000100010|2014|4272 |MA |
* |000000100010|2014|800 |60P |
* |000000100010|2014|3200 |MA2 |
* |000000100010|2014|1620 |49R |
* |000000100010|2014|1446 |46R |
* |000000100010|2014|40140|45B |
* |000000100010|2014|1800 |45C |
* |000000100010|2014|864 |49C |
* |000000100010|2014|1 |48S |
* +------------+----+-----+-----------------+
*
* root
* |-- parcel_id: string (nullable = true)
* |-- year: string (nullable = true)
* |-- sqft: string (nullable = true)
* |-- imprv_det_type_cd: string (nullable = true)
*/
val p = df2.groupBy(expr("cast(parcel_id as integer) as parcel_id"))
.agg(
sum(when($"imprv_det_type_cd".startsWith("MA"),$"sqft")).as("structure_total_sqft"),first("imprv_det_type_cd").as("imprv_det_type_cd"),first($"year").as("year_built")
)
p.show(false)
p.explain()
/**
* +---------+--------------------+-----------------+----------+
* |parcel_id|structure_total_sqft|imprv_det_type_cd|year_built|
* +---------+--------------------+-----------------+----------+
* |100010 |7472.0 |MA |2014 |
* +---------+--------------------+-----------------+----------+
*/
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。