如何解决有问题的Dask工作程序节点悄无声息地无法加入集群
我在Dask.distributed中遇到了一个非常奇怪的错误。我有一个试图将Dask用于4个VM的非托管群集。我正在使用SSHCluster对象初始化集群:
from dask.distributed import Client,SSHCluster
cluster = SSHCluster(
['localhost',# scheduler
'localhost',# worker 0
'192.168.80.18',# worker 1
'192.168.80.14',# worker 2
'192.168.80.12']) # worker 3
client = Client(cluster)
似乎所有四名工人都已无误启动:
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - Clear task state
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - Scheduler at: tcp://192.168.80.13:8786
distributed.deploy.ssh - INFO - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.80.13:34395'
distributed.deploy.ssh - INFO - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.80.14:45773'
distributed.deploy.ssh - INFO - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.80.12:45709'
distributed.deploy.ssh - INFO - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.80.18:43979'
distributed.deploy.ssh - INFO - distributed.worker - INFO - Start worker at: tcp://192.168.80.14:39597
distributed.deploy.ssh - INFO - distributed.worker - INFO - Start worker at: tcp://192.168.80.18:34763
distributed.deploy.ssh - INFO - distributed.worker - INFO - Start worker at: tcp://192.168.80.12:37999
distributed.deploy.ssh - INFO - distributed.worker - INFO - Start worker at: tcp://192.168.80.13:33627
但是,“ 192.168.80.18”从不属于群集的一部分。这是客户端对象报告的内容:
Client
Scheduler: tcp://192.168.80.13:8786
Dashboard: http://192.168.80.13:8787/status
Cluster
Workers: 3
Cores: 12
Memory: 101.19 GB
深入研究调度程序日志,我们可以看到问题节点从未被注册:
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://192.168.80.13:8786
distributed.scheduler - INFO - dashboard at: :8787
distributed.scheduler - INFO - Register worker <Worker 'tcp://192.168.80.14:39597',name: 2,memory: 0,processing: 0>
distributed.scheduler - INFO - Starting worker compute stream,tcp://192.168.80.14:39597
distributed.scheduler - INFO - Register worker <Worker 'tcp://192.168.80.12:37999',name: 3,tcp://192.168.80.12:37999
distributed.scheduler - INFO - Receive client connection: Client-4b17a4cc-f83f-11ea-b0ac-fa163e0984d7
distributed.scheduler - INFO - Register worker <Worker 'tcp://192.168.80.13:33627',name: 0,tcp://192.168.80.13:33627
此外,问题节点不会在客户端日志中的任何位置显示。
我花了几天的时间对此进行调试,但无济于事。而且,我竭尽所能确保这些VM上的环境完全相同。我不明白该节点如何在没有任何错误的情况下简单地将自身从群集中排除。
请帮助? 预先感谢。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。