udf中的F.regexp_extract返回AttributeError:'NoneType'对象没有属性'_jvm'

如何解决udf中的F.regexp_extract返回AttributeError:'NoneType'对象没有属性'_jvm'

我是Spark和pyspark的完全入门者。我有一个庞大的数据集,并且有一组关键字需要检查并从列中提取出来。

我的代码如下

temp_skills = ['sales','it','c']

@F.udf
def lookhere(z) -> str:
    strng = ' '
    for skill in temp_skills:
         strng += F.regexp_extract(z,skill,0)
    return strng 
spark.udf.register("lookhere",lambda z : lookhere(z),returnType=StringType())
DF.withColumn(
            'temp',lookhere(DF.dept_name)
            ).show(truncate = False)

原始DF:

+------------------+-------+
|         dept_name|dept_id|
+------------------+-------+
|  finance sales it|     10|
|marketing it sales|     20|
|             sales|     30|
|                it|     40|
+------------------+-------+

期望的DF:

+------------------+-------+----------+
|         dept_name|dept_id|      temp|
+------------------+-------+----------+
|  finance sales it|     10|sales it c|
|marketing it sales|     20| sales it |
|             sales|     30|   sales  |
|                it|     40|       it |
+------------------+-------+----------+

错误:

---------------------------------------------------------------------------
PythonException                           Traceback (most recent call last)
<ipython-input-80-0c11f7327f77> in <module>()
      1 DF.withColumn('temp2',2             lookintothis(DF.dept_name)
----> 3             ).show(truncate = False)

/content/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py in show(self,n,truncate,vertical)
    440             print(self._jdf.showString(n,20,vertical))
    441         else:
--> 442             print(self._jdf.showString(n,int(truncate),vertical))
    443 
    444     def __repr__(self):

/content/spark-3.0.0-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self,*args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer,self.gateway_client,self.target_id,self.name)
   1306 
   1307         for temp_arg in temp_args:

/content/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a,**kw)
    135                 # Hide where the exception came from that shows a non-Pythonic
    136                 # JVM exception message.
--> 137                 raise_from(converted)
    138             else:
    139                 raise

/content/spark-3.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py in raise_from(e)

PythonException: 
  An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace.
Traceback (most recent call last):
  File "/content/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",line 605,in main
    process()
  File "/content/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",line 597,in process
    serializer.dump_stream(out_iter,outfile)
  File "/content/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",line 223,in dump_stream
    self.serializer.dump_stream(self._batched(iterator),stream)
  File "/content/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",line 141,in dump_stream
    for obj in iterator:
  File "/content/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",line 212,in _batched
    for item in iterator:
  File "/content/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",line 450,in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets,f) in udfs)
  File "/content/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets,line 90,in <lambda>
    return lambda *a: f(*a)
  File "/content/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py",line 107,in wrapper
    return f(*args,**kwargs)
  File "<ipython-input-75-31ef5eea3b75>",line 7,in lookintothis
  File "/content/spark-3.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/functions.py",line 1811,in regexp_extract
    jc = sc._jvm.functions.regexp_extract(_to_java_column(str),pattern,idx)
AttributeError: 'NoneType' object has no attribute '_jvm

环境: Google Colab Windows 10 星火3.0.0 pyspark 3.0.0

我的方法错误吗?还是我的语法?请帮助我理解这一点!

解决方法

我已经通过使用skill语句检查了dept_namein中的位置。我认为您不需要更换任何东西。

temp_skills = ['sales','it','c']

from pyspark.sql.functions import *

@udf
def lookhere(z) -> str:
    
    strings = []
    for skill in temp_skills:
        if skill in z: strings.append(skill)
    return strings

spark.udf.register("lookhere",lookhere)

df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
df.withColumn('temp',lookhere('dept_name')).show(4,False)

+------------------+-------+--------------+
|dept_name         |dept_id|temp          |
+------------------+-------+--------------+
|finance sales it  |10     |[sales,it,c]|
|marketing it sales|20     |[sales,it]   |
|sales             |30     |[sales]       |
|it                |40     |[it]          |
+------------------+-------+--------------+

数据框方法的另一种方法,并通过分割dept_name来添加更准确的关键字比较。

temp_skills = ['sales','c']

from pyspark.sql.functions import *

df = spark.read.option("header","true").csv("test.csv")
df.withColumn('dept_names',split('dept_name',' ')) \
  .withColumn('skills',array(*map(lambda c: lit(c),temp_skills))) \
  .withColumn('temp',array_intersect('dept_names','skills')) \
  .drop('dept_names','skills').show(4,False)

+------------------+-------+-----------+
|dept_name         |dept_id|temp       |
+------------------+-------+-----------+
|finance sales it  |10     |[sales,it]|
|marketing it sales|20     |[it,sales]|
|sales             |30     |[sales]    |
|it                |40     |[it]       |
+------------------+-------+-----------+

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-