GCP AI Platform：创建自定义预测器模型版本时出错训练模型 Pytorch 模型 + torchvision.transform [重要]

如何解决GCP AI Platform：创建自定义预测器模型版本时出错训练模型 Pytorch 模型 + torchvision.transform [重要]

我目前正在尝试按照 https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1 将自定义模型部署到 AI 平台。它基于 'Pytorch' 和 'torchvision.transform' 的预训练模型的组合。目前，我不断收到以下错误，这恰好与自定义预测的 500MB 限制有关。

错误：(gcloud.beta.ai-platform.versions.create) 创建版本失败。检测到错误模型错误：模型需要的内存超出允许范围。请尝试减小模型大小并重新部署。如果您继续遇到错误，请联系支持人员。

Setup.py

from setuptools import setup
from pathlib import Path

base = Path(__file__).parent
REQUIRED_PACKAGES = [line.strip() for line in open(base/"requirements.txt")]
print(f"\nPackages: {REQUIRED_PACKAGES}\n\n")

# [torch==1.3.0,torchvision==0.4.1,ImageHash==4.2.0
# Pillow==6.2.1,pyvis==0.1.8.2] installs 800mb worth of files

setup(description="Extract features of a image",author='Amrit',name='test',version='0.1',install_requires=REQUIRED_PACKAGES,project_urls={
                    'Documentation':'https://cloud.google.com/ai-platform/prediction/docs/custom-prediction-routines#tensorflow','Deploy':'https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1','Ai_platform troubleshooting':'https://cloud.google.com/ai-platform/training/docs/troubleshooting','Say Thanks!': 'https://medium.com/searce/deploy-your-own-custom-model-on-gcps-ai-platform- 
 7e42a5721b43','google Torch wheels':"http://storage.googleapis.com/cloud-ai-pytorch/readme.txt",'Torch & torchvision wheels':"https://download.pytorch.org/whl/torch_stable.html "
                    },python_requires='~=3.7',scripts=['predictor.py','preproc.py'])

采取的步骤： 尝试将 ‘torch’ 和 torchvision 直接添加到 setup.py 文件中的 ‘REQUIRED_PACKAGES’ 列表中，以便在部署时提供 PyTorch + torchvision 作为要安装的依赖项。我猜，Ai 平台内部为 PyTorch 下载了 500 MB 的 PyPI 包，这导致我们的模型部署失败。如果我仅使用“torch”部署模型并且它似乎可以正常工作（当然会因找不到库“torchvision”而引发错误）

文件大小

pytorch（torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl 约 111MB )
torchvision（torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl 约 46MB ) 来自 https://download.pytorch.org/whl/torch_stable.html 并将其存储在 GKS 上。
压缩的预测模型文件（.tar.gz 格式），它是 setup.py 的输出（5kb）
经过训练的 PyTorch 模型（大小 44MB）

总的来说，模型依赖项应该小于 250MB，但仍然不断收到此错误。还尝试使用 Google 镜像包 http://storage.googleapis.com/cloud-ai-pytorch/readme.txt 提供的 torch 和 torchvision，但同样的内存问题仍然存在。人工智能平台对我们来说是全新的，希望得到专业人士的一些意见。

更多信息：

GCP CLI 输入：

我的环境变量：

BUCKET_NAME= “something”
MODEL_DIR="gs://$BUCKET_NAME/"
VERSION_NAME='v6'
MODEL_NAME="something_model"
STAGING_BUCKET=$MODEL_DIR"staging_area/"
# TORCH_PACKAGE=$MODEL_DIR"package/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl"
# TORCHVISION_PACKAGE=$MODEL_DIR"package/torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl"
TORCH_PACKAGE="gs://cloud-ai-pytorch/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl"
TORCHVISION_PACKAGE="gs://cloud-ai-pytorch/torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl"
CUSTOM_CODE_PATH=$STAGING_BUCKET"imt_ai_predict-0.1.tar.gz"
PREDICTOR_CLASS="predictor.MyPredictor"
REGION='europe-west1'
MACHINE_TYPE='mls1-c4-m2'
 
gcloud beta ai-platform versions create $VERSION_NAME   \
--model=$MODEL_NAME   \
--origin=$MODEL_DIR  \
 --runtime-version=2.3  \
 --python-version=3.7   \
--machine-type=$MACHINE_TYPE  \
 --package-uris=$CUSTOM_CODE_PATH,$TORCH_PACKAGE,$TORCHVISION_PACKAGE   \
--prediction-class=$PREDICTOR_CLASS \

GCP CLI 输出：

 **[1] global**
 [2] asia-east1
 [3] asia-northeast1
 [4] asia-southeast1
 [5] australia-southeast1
 [6] europe-west1
 [7] europe-west2
 [8] europe-west3
 [9] europe-west4
 [10] northamerica-northeast1
 [11] us-central1
 [12] us-east1
 [13] us-east4
 [14] us-west1
 [15] cancel
Please enter your numeric choice:  1
 
To make this the default region,run `gcloud config set ai_platform/region global`.
 
Using endpoint [https://ml.googleapis.com/]
Creating version (this might take a few minutes)......failed.                                                                                                                                            
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: **Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to experience errors,please contact support.**

我的发现： 找到了一些文章，人们以同样的方式为 PyTorch 包苦苦挣扎，并通过在 GCS 上安装火炬轮使其工作（https://medium.com/searce/deploy-your-own-custom-model-on-gcps-ai-platform- 7e42a5721b43）。已经尝试了与 torch 和 torchvision 相同的方法，但到目前为止还没有运气，正在等待来自“cloudml-feedback@google.com cloudml-feedback@google.com”的回复。任何有关在 AI 平台上工作的基于自定义 torch_torchvision 的自定义预测器的帮助都会很棒。

解决方法

通过几件事的组合解决了这个问题。我坚持使用 4GB CPU MLS1 机器和自定义预测器例程 (

使用 setup.py 参数安装库，而不是仅解析包名称及其版本，添加正确的火炬轮（最好

REQUIRED_PACKAGES = [line.strip() for line in open(base/"requirements.txt")] +\
['torchvision==0.5.0','torch @ https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl']

我减少了预处理步骤。无法适应所有这些，所以 jsonify 你的 SEND 响应并从 preproc.py 和 predictor.py 中获取一个

import json
json.dump(your data to send to predictor class)

从所需库的类中导入这些函数。

from torch import zeros,load 
    your code

[重要]

尚未针对已训练模型测试不同类型的序列化对象，这可能与哪个（torch.save、pickle、joblib 等）节省内存有所不同。
为那些组织是 GCP 合作伙伴的人找到了这个链接，他们可能能够申请更多配额（我猜从 500MB 到 2GB 左右）。不必朝这个方向发展，因为我的问题得到了解决，其他人也弹出了哈哈。 https://cloud.google.com/ai-platform/training/docs/quotas

GCP AI Platform：创建自定义预测器模型版本时出错训练模型 Pytorch 模型 + torchvision.transform [重要]

如何解决GCP AI Platform：创建自定义预测器模型版本时出错训练模型 Pytorch 模型 + torchvision.transform [重要]

更多信息：

解决方法

[重要]

相关推荐