如何解决如何在pyspark管道阶段处理字符串索引器和onehot编码器
针对此代码的此错误:
stage_string = [StringIndexer(inputCol=c,outputCol=c + "_string_encoded") for c in categorical_columns]
stage_one_hot = [OneHotEncoder(inputCol=c + "_string_encoded",outputCol=c + "_one_hot") for c in categorical_columns]
assembler = VectorAssembler(inputCols=feature_list,outputCol="features")
rf = RandomForestClassifier(labelCol="output",featuresCol="features")
pipeline = Pipeline(stages=[stage_string,stage_one_hot,assembler,rf])
pipeline.fit(df)
Cannot recognize a pipeline stage of type <class 'list'>.
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py",line 132,in fit
return self._fit(dataset)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py",line 97,in _fit
"Cannot recognize a pipeline stage of type %s." % type(stage))
TypeError: Cannot recognize a pipeline stage of type <class 'list'>.
解决方法
此 pipeline = Pipeline(stages=[stage_string,stage_one_hot,assembler,rf])
语句存在问题stage_string
和stage_one_hot
是PipelineStage
和assembler
的列表,而rf是单独的流水线阶段。
如下修改您的陈述-
stages = stage_string + stage_one_hot + [assembler,rf]
pipeline = Pipeline(stages=stages)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。