pyspark，聚合导致错误

如何解决pyspark，聚合导致错误

我可以在聚合之前很好地打印数据框

(Pdb) df_interesting.printSchema()
root
 |-- userId: long (nullable = true)
 |-- screen_index: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- time_delta: float (nullable = true)
 |-- app_open_index: integer (nullable = true)
 |-- timestamp: timestamp (nullable = true)

(pdb) df_interesting.show(n=2)

+------+------------+------+----------+--------------+--------------------+
|userId|screen_index|  type|time_delta|app_open_index|           timestamp|
+------+------------+------+----------+--------------+--------------------+
|214431|           7|screen|      60.0|            13|2020-07-31 07:52:...|
|398910|           3|screen|      60.0|             2|2020-07-29 11:43:...|
+------+------------+------+----------+--------------+--------------------+

但是，聚合之后，show（）会导致错误。

(Pdb) df_interesting.groupBy('app_open_index').agg(F.max("screen_index").alias("max_screen_index")).show(n=2)
[Stage 1:>                                                                   (0 + 2) / 2]20/08/13 18:07:26 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.IllegalArgumentException: The value (Buffer()) of the type (scala.collection.convert.Wrappers.JListWrapper) cannot be converted to the string type
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:290)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:285)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:248)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)

编辑

我试图单列，这是一些进步

(Pdb) df_interesting = df_interesting.select(col('data.userId').alias('userId'))
(Pdb) df_interesting.count()
[Stage 0:>                                                                   (0 + 2) / 2]20/08/13 18:59:12 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.elasticsearch.hadoop.rest.EsHadoopParsingException: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'data.properties.priceObj' not found; typically this occurs \
with arrays which are not mapped as single value

pyspark，聚合导致错误

如何解决pyspark，聚合导致错误

相关推荐