如何解决Pypsark解压具有不同长度的字符串的嵌套列表的列
我有一个pyspark数据框,如下所示,其中包含不同长度的嵌套列表:
ID BioID Pvalue Significance
Sample1 "AATC" 0.01 1
Sample2 "AATC" 0.01 1
Sample2 "AATG" 0.02 0
Sample2 "AAAA" 0.50 0
Sample3 "TGCC" 0.04 0
我需要解压缩数据框,以便为每个嵌套列表和以下列保留ID:
df.select("ID",F.explode("results")).show(5)
ID col
Sample1 ["AATC","AATC","AATG","AAAA","TGCC"]
Sample2 [0.01,0.02,0.50,0.04]
Sample2 [1,1,0]
Sample2 ["AATC","TGCC"]
Sample3 [0.01,0.04]
我尝试爆炸,但是它给了我更多列表:
root
|-- ID: string (nullable = true)
|-- features: array (nullable = true)
| |-- element: string (containsNull = true)
编辑:基于建议添加架构
### ALABAMA FILTER
al_filter <- reactive({
if(input$selectcounty == "Autauga-AL") {
demographics_autauga <- subset.data.frame(demographics,NAME=="Autauga-AL")
nodes_autauga <- as.Node(demographics_autauga)
}
else {
return("ERROR2")
}
})
##### ARKANSAS FILTER
ar_filter <- reactive ({
if(input$selectcounty== "Arkansas-AR") {
demographics_ArkansasAR <-subset.data.frame(demographics,NAME=="Arkansas-AR")
nodes_ArkansasAR<- as.Node(demographics_ArkansasAR)
}
else {
return("ERROR2")
}
})
##### STATES FILTER
demographics_filter <- reactive({
if(grepl("-AL",input$selectcounty)){
return(al_filter())
}
else if (grepl("-AR",input$selectcounty)){
return (ar_filter())
}
else {
return(" ERROR")
}
})
解决方法
如果您具有 nested list
,并且具有如下所示的架构( Array-> Array-> string )使用 transform
(使用高阶函数 inline
(将所需的列组合到数组中的结构中) >爆炸结构数组)以获取所需的输出。
df.show(truncate=False)
#+-------+--------------------------------------------------+
#|ID |Features |
#+-------+--------------------------------------------------+
#|Sample1|[[AATC,0.01,1]] |
#|Sample2|[[AATC,1],[AATG,0.02,0],[AAAA,0.5,0]]|
#|Sample3|[[TGCC,0.04,0]] |
#+-------+--------------------------------------------------+
df.printSchema()
#root
# |-- ID: string (nullable = true)
# |-- Features: array (nullable = true)
# | |-- element: array (containsNull = true)
# | | |-- element: string (containsNull = true)
from pyspark.sql import functions as F
df.withColumn("Features",F.expr("""transform(Features,x-> struct(x[0] as BioID,x[1] as Pvalue,x[2] as Significance))"""))\
.select("ID",F.expr("""inline(Features)""")).show()
#+-------+-----+------+------------+
#| ID|BioID|Pvalue|Significance|
#+-------+-----+------+------------+
#|Sample1| AATC| 0.01| 1|
#|Sample2| AATC| 0.01| 1|
#|Sample2| AATG| 0.02| 0|
#|Sample2| AAAA| 0.5| 0|
#|Sample3| TGCC| 0.04| 0|
#+-------+-----+------+------------+
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。