如何解决更新基于另一个列的Spark数据框中的列值
我有如下所述的spark数据框。
val data = spark.sparkContext.parallelize(Seq(
(1,"","SNACKS","BISCUITS - AMBIENT","REFLETS DE FRANCE CROQUANT","UNCOATED BISCUIT","NO PROMOTION","400G",""),(2,"GROCERY","BISCUITS","SWEET BISCUITS ","AMBIENT BISCUIT","CHOCOS")
))
.toDF("id","c4","c1001","c1002","c1003","c1008","c1008_unmasked","c1009","c1011","c1012","c1013","c1015","c1016","c1016_unmasked")
data.show(false)
样品输入:
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|id |c4 |c1001 |c1002 |c1003 |c1008|c1008_unmasked |c1009 |c1011 |c1012|c1013|c1015|c1016|c1016_unmasked|
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|1 | |SNACKS |BISCUITS - AMBIENT|BISCUITS - AMBIENT| |REFLETS DE FRANCE CROQUANT|UNCOATED BISCUIT|NO PROMOTION| | |400G | | |
|2 |GROCERY|BISCUITS|SWEET BISCUITS |BISCUITS - AMBIENT| | |AMBIENT BISCUIT |NO PROMOTION| | |400G | |CHOCOS |
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
仅当相同的 cXXXX_unmasked 中具有值时,才需要使用值“已屏蔽” 填充列 cXXXX 。请检查示例输出以更好地理解。
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|id |c4 |c1001 |c1002 |c1003 |c1008 |c1008_unmasked |c1009 |c1011 |c1012|c1013|c1015|c1016 |c1016_unmasked|
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|1 | |SNACKS |BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE CROQUANT|UNCOATED BISCUIT|NO PROMOTION| | |400G | | |
|2 |GROCERY|BISCUITS|SWEET BISCUITS |BISCUITS - AMBIENT| | |AMBIENT BISCUIT |NO PROMOTION| | |400G |MASKED|CHOCOS |
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
预先感谢
解决方法
这是我的尝试。
val cols = data.columns.filter(_.endsWith("_unmasked"))
val new_data = cols.foldLeft(data) { (df,c) =>
df.withColumn(c.split("_").head,when(col(c) =!= "" && col(c).isNotNull,lit("MASKED")).otherwise(col(c)))
}
new_data.show
+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
| id| c4| c1001| c1002| c1003| c1008| c1008_unmasked| c1009| c1011|c1012|c1013|c1015| c1016|c1016_unmasked|
+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
| 1| | SNACKS|BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE...|UNCOATED BISCUIT|NO PROMOTION| | | 400G| | |
| 2|GROCERY|BISCUITS| SWEET BISCUITS |BISCUITS - AMBIENT| | | AMBIENT BISCUIT|NO PROMOTION| | | 400G|MASKED| CHOCOS|
+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。