如何解决无法使用 Spark 从 s3 存储桶读取部分文件
我有一个在 GCP Dataproc 上运行的 Spark 2.4.7 和从 AWS S3 读取一些文件的任务。
在创建集群时将 AWS 凭证,即访问密钥和秘密访问密钥附加到文件 /etc/profile.d/spark_config.sh /etc/*bashrc /usr/lib/spark/conf/spark-env.sh ,我可以从我的 S3 存储桶中读取部分文件,同时读取所有文件
Traceback (most recent call last):
File "<stdin>",line 7,in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py",line 274,in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",line 1257,in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py",line 63,in deco
return f(*a,**kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",line 328,in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o202.json.
: java.nio.file.AccessDeniedException: s3a://my_bucket/part-00000-a5adf948-1f65-4068-ae68-76c3708b4cf1-c000.txt.lzo: getFileStatus on s3a://my_bucket/part-00000-a5adf948-1f65-4068-ae68-76c3708b4cf1-c000.txt.lzo: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: Q3TR40YDDCWFKESZ; S3 Extended Request ID: MeZc+rmfDXDL2DEXWS8zpfrn1s3dbthI31pErLpVT14kTGjpaxUYEmBG53D9f1cWwVRVL1oCzeM=),S3 Extended Request ID: MeZc+rmfDXDL2DEXWS8zpfrn1s3dbthI31pErLpVT14kTGjpaxUYEmBG53D9f1cWwVRVL1oCzeM=
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:174)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:117)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1923)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1877)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1812)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1632)
at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:2631)
at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:575)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:559)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:411)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
我用来阅读的代码:
spark.read \
.option("io.compression.codecs","com.hadoop.compression.lzo.LzoCodec") \
.schema(StructType().add("id",StringType())) \
.json("s3a://my_bucket/*.lzo")
使用相同机制在不同时间插入该位置中的文件。使用 AWS CLI 检查“损坏的”文件之一,我看到错误:
aws s3 cp s3://my_bucket/part-00000-499c7394-c5cf-4045-8a54-0e1d5bc87897-c000.txt.lzo - | head
download failed: s3://my_bucket/part-00000-499c7394-c5cf-4045-8a54-0e1d5bc87897-c000.txt.lzo to - An error occurred (403) when calling the HeadObject operation: Forbidden
我的问题是如何进一步追踪某些文件的读取失败。谢谢!
UPD:
不知何故,与其他文件相比,有问题的文件非常小:大约 4KiB,而其他项目为 30-50MiB。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。