无法使用 Spark 从 s3 存储桶读取部分文件

如何解决无法使用 Spark 从 s3 存储桶读取部分文件

我有一个在 GCP Dataproc 上运行的 Spark 2.4.7 和从 AWS S3 读取一些文件的任务。

在创建集群时将 AWS 凭证，即访问密钥和秘密访问密钥附加到文件 /etc/profile.d/spark_config.sh /etc/*bashrc /usr/lib/spark/conf/spark-env.sh ，我可以从我的 S3 存储桶中读取部分文件，同时读取所有文件

Traceback (most recent call last):
  File "<stdin>",line 7,in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py",line 274,in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",line 1257,in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py",line 63,in deco
    return f(*a,**kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",line 328,in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o202.json.
: java.nio.file.AccessDeniedException: s3a://my_bucket/part-00000-a5adf948-1f65-4068-ae68-76c3708b4cf1-c000.txt.lzo: getFileStatus on s3a://my_bucket/part-00000-a5adf948-1f65-4068-ae68-76c3708b4cf1-c000.txt.lzo: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: Q3TR40YDDCWFKESZ; S3 Extended Request ID: MeZc+rmfDXDL2DEXWS8zpfrn1s3dbthI31pErLpVT14kTGjpaxUYEmBG53D9f1cWwVRVL1oCzeM=),S3 Extended Request ID: MeZc+rmfDXDL2DEXWS8zpfrn1s3dbthI31pErLpVT14kTGjpaxUYEmBG53D9f1cWwVRVL1oCzeM=
        at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:174)
        at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:117)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1923)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1877)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1812)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1632)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:2631)
        at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:575)
        at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
        at scala.collection.immutable.List.flatMap(List.scala:355)
        at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:559)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
        at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:411)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

我用来阅读的代码：

spark.read \
 .option("io.compression.codecs","com.hadoop.compression.lzo.LzoCodec") \
 .schema(StructType().add("id",StringType())) \
 .json("s3a://my_bucket/*.lzo")

使用相同机制在不同时间插入该位置中的文件。使用 AWS CLI 检查“损坏的”文件之一，我看到错误：

aws s3 cp s3://my_bucket/part-00000-499c7394-c5cf-4045-8a54-0e1d5bc87897-c000.txt.lzo - | head
download failed: s3://my_bucket/part-00000-499c7394-c5cf-4045-8a54-0e1d5bc87897-c000.txt.lzo to - An error occurred (403) when calling the HeadObject operation: Forbidden

我的问题是如何进一步追踪某些文件的读取失败。谢谢！

UPD：

不知何故，与其他文件相比，有问题的文件非常小：大约 4KiB，而其他项目为 30-50MiB。

无法使用 Spark 从 s3 存储桶读取部分文件

如何解决无法使用 Spark 从 s3 存储桶读取部分文件

相关推荐