如何解决本地Jupyter笔记本中的SageMaker:无法使用AWS托管的XGBoost容器“ KeyError:'S3DistributionType'”和“无法运行:['docker-compose'”
在本地Jupyter笔记本中运行SageMaker(使用VS Code)可以正常工作,除了尝试使用AWS托管容器训练XGBoost模型会导致错误(容器名称:246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3
)之外。
Jupyter笔记本电脑
import sagemaker
session = sagemaker.LocalSession()
# Load and prepare the training and validation data
...
# Upload the training and validation data to S3
test_location = session.upload_data(os.path.join(data_dir,'test.csv'),key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir,'validation.csv'),key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir,'train.csv'),key_prefix=prefix)
region = session.boto_region_name
instance_type = 'ml.m4.xlarge'
container = sagemaker.image_uris.retrieve('xgboost',region,'1.0-1','py3',instance_type=instance_type)
role = 'arn:aws:iam::<USER ID #>:role/service-role/AmazonSageMaker-ExecutionRole-<ROLE ID #>'
xgb_estimator = sagemaker.estimator.Estimator(
container,role,train_instance_count=1,train_instance_type=instance_type,output_path=f's3://{session.default_bucket()}/{prefix}/output',sagemaker_session=session)
xgb_estimator.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,objective='reg:squarederror',early_stopping_rounds=10,num_round=200)
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location,content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location,content_type='csv')
xgb_estimator.fit({'train': s3_input_train,'validation': s3_input_validation})
Docker Container KeyError
algo-1-tfcvc_1 | ERROR:sagemaker-containers:Reporting training FAILURE
algo-1-tfcvc_1 | ERROR:sagemaker-containers:framework error:
algo-1-tfcvc_1 | Traceback (most recent call last):
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_trainer.py",line 84,in train
algo-1-tfcvc_1 | entrypoint()
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py",line 94,in main
algo-1-tfcvc_1 | train(framework.training_env())
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py",line 90,in train
algo-1-tfcvc_1 | run_algorithm_mode()
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/training.py",line 68,in run_algorithm_mode
algo-1-tfcvc_1 | checkpoint_config=checkpoint_config
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py",line 115,in sagemaker_train
algo-1-tfcvc_1 | validated_data_config = channels.validate(data_config)
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py",line 106,in validate
algo-1-tfcvc_1 | channel_obj.validate(value)
algo-1-tfcvc_1 | File "/miniconda3/lib/python3.6/site-packages/sagemaker_algorithm_toolkit/channel_validation.py",line 52,in validate
algo-1-tfcvc_1 | if (value[CONTENT_TYPE],value[TRAINING_INPUT_MODE],value[S3_DIST_TYPE]) not in self.supported:
algo-1-tfcvc_1 | KeyError: 'S3DistributionType'
本地PC运行时错误
RuntimeError: Failed to run: ['docker-compose','-f','/tmp/tmp71tx0fop/docker-compose.yaml','up','--build','--abort-on-container-exit'],Process exited with code: 1
如果Jupyter笔记本使用Amazon Cloud SageMaker环境(而不是在本地PC上)运行,则没有错误。请注意,在云笔记本上运行时,会话初始化为:
session = sagemaker.Session()
LocalSession()
与托管的Docker容器的工作方式似乎存在问题。
解决方法
在本地Jupyter笔记本中运行SageMaker时,它希望Docker容器也在本地计算机上运行。
确保SageMaker(在本地笔记本中运行)使用AWS托管的Docker容器的关键是在初始化LocalSession
时省略Estimator
对象。
错
xgb_estimator = sagemaker.estimator.Estimator(
container,role,train_instance_count=1,train_instance_type=instance_type,output_path=f's3://{session.default_bucket()}/{prefix}/output',sagemaker_session=session)
正确
xgb_estimator = sagemaker.estimator.Estimator(
container,output_path=f's3://{session.default_bucket()}/{prefix}/output')
其他信息
SageMaker Python SDK源代码提供以下有用的提示:
文件: sagemaker / local / local_session.py
class LocalSagemakerClient(object):
"""A SageMakerClient that implements the API calls locally.
Used for doing local training and hosting local endpoints. It still needs access to
a boto client to interact with S3 but it won't perform any SageMaker call.
...
文件: sagemaker / estimator.py
class EstimatorBase(with_metaclass(ABCMeta,object)):
"""Handle end-to-end Amazon SageMaker training and deployment tasks.
For introduction to model training and deployment,see
http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html
Subclasses must define a way to determine what image to use for training,what hyperparameters to use,and how to create an appropriate predictor instance.
"""
def __init__(self,train_instance_count,train_instance_type,train_volume_size=30,train_max_run=24 * 60 * 60,input_mode='File',output_path=None,output_kms_key=None,base_job_name=None,sagemaker_session=None,tags=None):
"""Initialize an ``EstimatorBase`` instance.
Args:
role (str): An AWS IAM role (either name or full ARN). ...
...
sagemaker_session (sagemaker.session.Session): Session object which manages interactions with
Amazon SageMaker APIs and any other AWS services needed. If not specified,the estimator creates one
using the default AWS configuration chain.
"""
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。