使用Python运行SPARK作业时出错带动态字段的NestedJson

如何解决使用Python运行SPARK作业时出错带动态字段的NestedJson

我在Amazon S3上有一个inputNestedJson。此json有一个JsonObjects列表,并且这些jsonObject的每个都有一个动态字段“ Extension”。它有时可以是列表,有时可以是地图。我需要忽略此字段并创建与其他字段相对应的架构。目前,我无法执行此操作,并且在为dataFrame记录应用展平时,出现错误。

一旦获得正确的数据,我需要将其注入到AWS Elastic中,以便可以将其用于查询。

我的问题->有什么办法可以忽略动态字段并仅从相关字段创建数据框? 在杰克逊中,我们可以对这些字段应用@JsonIgnore,以便在序列化/反序列化时不会读取它们。

我尝试仅使用3个字段创建一个新的dataFrame,但是得到的结果是单个ROW

ndf = df.select("Records.LEI","Records.Entity","Records.Registration").show(truncate = False)

结果:

+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|LEI                                             |Entity                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |Registration                                                                                                                                                                                                                                                                                                       |
+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[001GPB6A9XPE8XJICC14],[004L5FPTUREIWK9T2N63]]|[[[FUND],[ACTIVE],[,[Boston],[US],[245 Summer Street],[02210],[US-MA]],[BOSTON],[245 SUMMER STREET],[02110],[[8888],[OTHER]],[US-MA],[FIDELITY ADVISOR SERIES I - Fidelity Advisor Leveraged Company Stock Fund],[[S000005113],[RA000665]]],[[888 7th Avenue],[New York],[22nd Floor],[10106],[US-NY]],[[[2711 Centerville Road],[Suite 400]],[Wilmington],[C/O Corporation Service Company],[19808],[US-DE]],[[T91T],[LIMITED PARTNERSHIP]],[US-DE],[Hutchin Hill Capital,LP],[[4386463],[RA000602]]]]|[[[2012-11-29 22:03:00],[2020-06-03 20:03:00],[EVK05KS7XY1DEII3R011],[2021-05-29 13:20:00],[ISSUED],[RA000665]],[FULLY_CORROBORATED]],[[2012-06-06 21:26:00],[2020-07-17 18:10:00],[2018-05-08 19:16:00],[LAPSED],[RA000602]],[FULLY_CORROBORATED]]]|
+------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

完整代码段->

       
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql.functions import flatten
from pyspark.sql import SQLContext

appName = "PySpark - JSON file to Spark Data Frame"
master = "local"

# Initialize contexts and session
path = "C:\\spark\\lei_OrigData.txt"

spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext(sc)

# Log starting time
dt_start = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print("Start time:",dt_start)

# Create a schema for the dataframe
df = spark.read.json(path,multiLine=True)
df.select(flatten(df.records)).show(truncate=False)

#Below line while trying to create a new dataFrame using 3 fields only but the result I got was a single ROW
#ndf = df.select("Records.LEI","Records.Registration").show(truncate = False)

# Log end time
dt_end = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

print("End time:",dt_end)

运行上述代码后,出现以下错误。

pyspark.sql.utils.AnalysisException: cannot resolve 'flatten(`records`)' due to data type mismatch: The argument should be an array of arrays,but '`records`' is of array<struct<Entity:struct<EntityCategory:struct<$:string>,EntityStatus:struct<$:string>,HeadquartersAddress:struct<AdditionalAddressLine:struct<$:string>,City:struct<$:string>,Country:struct<$:string>,FirstAddressLine:struct<$:string>,PostalCode:struct<$:string>,Region:struct<$:string>>,LegalAddress:struct<AdditionalAddressLine:array<struct<$:string>>,LegalForm:struct<EntityLegalFormCode:struct<$:string>,OtherLegalForm:struct<$:string>>,LegalJurisdiction:struct<$:string>,LegalName:struct<$:string>,RegistrationAuthority:struct<RegistrationAuthorityEntityID:struct<$:string>,RegistrationAuthorityID:struct<$:string>>>,Extension:struct<gleif:Geocoding:string>,LEI:struct<$:string>,Registration:struct<InitialRegistrationDate:struct<$:timestamp>,LastUpdateDate:struct<$:timestamp>,ManagingLOU:struct<$:string>,NextRenewalDate:struct<$:timestamp>,RegistrationStatus:struct<$:string>,ValidationAuthority:struct<ValidationAuthorityEntityID:struct<$:string>,ValidationAuthorityID:struct<$:string>>,ValidationSources:struct<$:string>>>> type.;;                                   'Project [flatten(records#0) AS flatten(records)#2] 

INPUT JSON->

{
  "records": [
    {
      "LEI": {
        "$": "001GPB6A9XPE8XJICC14"
      },"Entity": {
        "LegalName": {
          "$": "FIDELITY ADVISOR SERIES I - Fidelity Advisor Leveraged Company Stock Fund"
        },"LegalAddress": {
          "FirstAddressLine": {
            "$": "245 SUMMER STREET"
          },"City": {
            "$": "BOSTON"
          },"Region": {
            "$": "US-MA"
          },"Country": {
            "$": "US"
          },"PostalCode": {
            "$": "02110"
          }
        },"HeadquartersAddress": {
          "FirstAddressLine": {
            "$": "245 Summer Street"
          },"City": {
            "$": "Boston"
          },"PostalCode": {
            "$": "02210"
          }
        },"RegistrationAuthority": {
          "RegistrationAuthorityID": {
            "$": "RA000665"
          },"RegistrationAuthorityEntityID": {
            "$": "S000005113"
          }
        },"LegalJurisdiction": {
          "$": "US-MA"
        },"EntityCategory": {
          "$": "FUND"
        },"LegalForm": {
          "EntityLegalFormCode": {
            "$": "8888"
          },"OtherLegalForm": {
            "$": "OTHER"
          }
        },"EntityStatus": {
          "$": "ACTIVE"
        }
      },"Registration": {
        "InitialRegistrationDate": {
          "$": "2012-11-29T16:33:00.000Z"
        },"LastUpdateDate": {
          "$": "2020-06-03T14:33:00.000Z"
        },"RegistrationStatus": {
          "$": "ISSUED"
        },"NextRenewalDate": {
          "$": "2021-05-29T07:50:00.000Z"
        },"ManagingLOU": {
          "$": "EVK05KS7XY1DEII3R011"
        },"ValidationSources": {
          "$": "FULLY_CORROBORATED"
        },"ValidationAuthority": {
          "ValidationAuthorityID": {
            "$": "RA000665"
          },"ValidationAuthorityEntityID": {
            "$": "S000005113"
          }
        }
      },"Extension": {
        "gleif:Geocoding": {
          "gleif:original_address": {
            "$": "245 Summer Street,02210,Boston,US-MA,US"
          },"gleif:relevance": {
            "$": "0.92"
          },"gleif:match_type": {
            "$": "pointAddress"
          },"gleif:lat": {
            "$": "42.3514"
          },"gleif:lng": {
            "$": "-71.05385"
          },"gleif:geocoding_date": {
            "$": "2017-10-23T19:14:11"
          },"gleif:bounding_box": {
            "$": "TopLeft.Latitude: 42.3525242,TopLeft.Longitude: -71.0553711,BottomRight.Latitude: 42.3502758,BottomRight.Longitude: -71.0523289"
          },"gleif:match_level": {
            "$": "houseNumber"
          },"gleif:formatted_address": {
            "$": "245 Summer St,MA 02210,United States"
          },"gleif:mapped_location_id": {
            "$": "NT_PYMT6GOD3rrAC9q2Al5jZB_yQTN"
          },"gleif:mapped_street": {
            "$": "Summer St"
          },"gleif:mapped_housenumber": {
            "$": "245"
          },"gleif:mapped_postalcode": {
            "$": "02210"
          },"gleif:mapped_city": {
            "$": "Boston"
          },"gleif:mapped_district": {
            "$": "Downtown Boston"
          },"gleif:mapped_state": {
            "$": "MA"
          },"gleif:mapped_country": {
            "$": "USA"
          }
        }
      }
    },{
      "LEI": {
        "$": "004L5FPTUREIWK9T2N63"
      },"Entity": {
        "LegalName": {
          "$": "Hutchin Hill Capital,LP"
        },"LegalAddress": {
          "FirstAddressLine": {
            "$": "C/O Corporation Service Company"
          },"AdditionalAddressLine": [
            {
              "$": "2711 Centerville Road"
            },{
              "$": "Suite 400"
            }
          ],"City": {
            "$": "Wilmington"
          },"Region": {
            "$": "US-DE"
          },"PostalCode": {
            "$": "19808"
          }
        },"HeadquartersAddress": {
          "FirstAddressLine": {
            "$": "22nd Floor"
          },"AdditionalAddressLine": {
            "$": "888 7th Avenue"
          },"City": {
            "$": "New York"
          },"Region": {
            "$": "US-NY"
          },"PostalCode": {
            "$": "10106"
          }
        },"RegistrationAuthority": {
          "RegistrationAuthorityID": {
            "$": "RA000602"
          },"RegistrationAuthorityEntityID": {
            "$": "4386463"
          }
        },"LegalJurisdiction": {
          "$": "US-DE"
        },"LegalForm": {
          "EntityLegalFormCode": {
            "$": "T91T"
          },"OtherLegalForm": {
            "$": "LIMITED PARTNERSHIP"
          }
        },"Registration": {
        "InitialRegistrationDate": {
          "$": "2012-06-06T15:56:00.000Z"
        },"LastUpdateDate": {
          "$": "2020-07-17T12:40:00.000Z"
        },"RegistrationStatus": {
          "$": "LAPSED"
        },"NextRenewalDate": {
          "$": "2018-05-08T13:46:00.000Z"
        },"ValidationAuthority": {
          "ValidationAuthorityID": {
            "$": "RA000602"
          },"ValidationAuthorityEntityID": {
            "$": "4386463"
          }
        }
      },"Extension": {
        "gleif:Geocoding": [
          {
            "gleif:original_address": {
              "$": "22nd Floor,888 7th Avenue,10106,New York,US-NY,US"
            },"gleif:relevance": {
              "$": "0.94"
            },"gleif:match_type": {
              "$": "pointAddress"
            },"gleif:lat": {
              "$": "40.76537"
            },"gleif:lng": {
              "$": "-73.98088"
            },"gleif:geocoding_date": {
              "$": "2017-10-25T06:53:52"
            },"gleif:bounding_box": {
              "$": "TopLeft.Latitude: 40.7664942,TopLeft.Longitude: -73.9823642,BottomRight.Latitude: 40.7642458,BottomRight.Longitude: -73.9793958"
            },"gleif:match_level": {
              "$": "houseNumber"
            },"gleif:formatted_address": {
              "$": "888 7th Ave,NY 10106,United States"
            },"gleif:mapped_location_id": {
              "$": "NT_42almrnte4m8ALt9ONHN2C_4gDO"
            },"gleif:mapped_street": {
              "$": "7th Ave"
            },"gleif:mapped_housenumber": {
              "$": "888"
            },"gleif:mapped_postalcode": {
              "$": "10106"
            },"gleif:mapped_city": {
              "$": "New York"
            },"gleif:mapped_district": {
              "$": "Clinton"
            },"gleif:mapped_state": {
              "$": "NY"
            },"gleif:mapped_country": {
              "$": "USA"
            }
          },{
            "gleif:original_address": {
              "$": "C/O Corporation Service Company,2711 Centerville Road,Suite 400,null,US,US-DE,19808,Wilmington"
            },"gleif:relevance": {
              "$": "0.93"
            },"gleif:lat": {
              "$": "39.75411"
            },"gleif:lng": {
              "$": "-75.62652"
            },"gleif:geocoding_date": {
              "$": "2016-08-16T03:54:45"
            },"gleif:bounding_box": {
              "$": "TopLeft.Latitude: 39.7552342,TopLeft.Longitude: -75.6279822,BottomRight.Latitude: 39.7529858,BottomRight.Longitude: -75.6250578"
            },"gleif:formatted_address": {
              "$": "2711 Centerville Rd,Wilmington,DE 19808,"gleif:mapped_location_id": {
              "$": "NT_8wi0yH62lxXql.LtXORq-C_ycTMxA"
            },"gleif:mapped_street": {
              "$": "Centerville Rd"
            },"gleif:mapped_housenumber": {
              "$": "2711"
            },"gleif:mapped_postalcode": {
              "$": "19808"
            },"gleif:mapped_city": {
              "$": "Wilmington"
            },"gleif:mapped_state": {
              "$": "DE"
            },"gleif:mapped_country": {
              "$": "USA"
            }
          }
        ]
      }
    }
  
]
}
```[enter image description here][1]


  [1]: https://i.stack.imgur.com/ajOJM.png

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-