如何解决Compute Engine 虚拟机失去网络连接
我们在 Compute Engine 中有一个运行 CentOs 8 的虚拟机。这个虚拟机运行了很长时间并且从未重启过。昨晚,我们突然失去了通过内部和外部 IP 与 VM 的连接。 SSH 也不可能。
在串口上,我们观察到以下日志:
Apr 29 15:53:18 <vm-name> google_osconfig_agent[1215]: default
Apr 29 16:05:18 <vm-name> google_osconfig_agent[1215]: default
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3120] dhcp4 (eth0): option dhcp_lease_time => '3600'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option domain_name => 'us-central1-a.c.<project-name>.internal'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option domain_name_servers => '169.254.169.254'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option domain_search => 'us-central1-a.c.<project-name>.internal c.<project-name>.internal google.internal'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option expiry => '1619716266'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option host_name => '<vm-name>.us-central1-a.c.<project-name>.internal'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option interface_mtu => '1460'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option ip_address => '10.128.0.4'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option next_server => '10.128.0.1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option ntp_servers => '169.254.169.254'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option requested_broadcast_address => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3124] dhcp4 (eth0): option requested_domain_name => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_domain_name_servers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_domain_search => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_host_name => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_interface_mtu => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_ms_classless_static_routes => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_nis_domain => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_nis_servers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_ntp_servers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_rfc3442_classless_static_routes => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3125] dhcp4 (eth0): option requested_root_path => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option requested_routers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option requested_static_routes => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option requested_subnet_mask => '1'
Apr 29 16:11:06 <vm-name> dbus-daemon[827]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.8' (uid=0 pid=907 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option requested_time_offset => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option requested_wpad => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option rfc3442_classless_static_routes => '10.128.0.1/32 0.0.0.0 0.0.0.0/0 10.128.0.1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option routers => '10.128.0.1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): option subnet_mask => '255.255.255.255'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info> [1619712666.3126] dhcp4 (eth0): state changed extended -> extended
Apr 29 16:11:06 <vm-name> systemd[1]: Starting Network Manager Script Dispatcher Service...
Apr 29 16:11:06 <vm-name> dbus-daemon[827]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Apr 29 16:11:06 <vm-name> systemd[1]: Started Network Manager Script Dispatcher Service.
Apr 29 16:15:18 <vm-name> google_osconfig_agent[1215]: default
Apr 29 16:29:30 <vm-name> GCEGuestAgent[1269]: 2021-04-29T16:28:35.1302Z GCEGuestAgent Error main.go:181: Error watching metadata: Get http://metadata.google.internal/computeMetadata/v1//?recursive=true&alt=json&wait_for_change=true&timeout_sec=60&last_etag=4ac15b8179731d72: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Apr 29 16:29:31 <vm-name> OSConfigAgent[1215]: 2021-04-29T16:29:03.9464Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout
Apr 29 16:29:44 <vm-name> google_osconfig_agent[1215]: default
[16281276.978048] google_guest_agent[1269]: 2021/04/29 16:41:10 logging client: context deadline exceeded
Apr 29 17:11:14 <vm-name> google_guest_agent[1269]: 2021/04/29 16:41:10 logging client: context deadline exceeded
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: 2021/04/29 16:42:42 logging client: context deadline exceeded
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:03:31 logging client: context deadline exceeded
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T16:41:09.0306Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout
Apr 29 17:27:20 <vm-name> NetworkManager[907]: <info> [1619717130.6935] dhcp4 (eth0): option dhcp_lease_time => '3600'
Apr 29 17:27:20 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:26:00 logging client: context deadline exceeded
Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:07:12.7039Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout
Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:22:13.2740Z OSConfigAgent Error main.go:189: network error when requesting metadata,make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Apr 29 17:28:55 <vm-name> NetworkManager[907]: <info> [1619717199.5681] dhcp4 (eth0): option domain_name => 'us-central1-a.c.<project-name>.internal'
Apr 29 17:34:48 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:27:54.8816Z OSConfigAgent Error main.go:189: network error when requesting metadata,make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:36:13 logging client: context deadline exceeded
Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:55:16 <vm-name> NetworkManager[907]: <info> [1619717287.9321] dhcp4 (eth0): option domain_name_servers => '169.254.169.254'
Apr 29 17:58:36 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:30:47.0872Z OSConfigAgent Error main.go:189: network error when requesting metadata,make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Apr 29 18:03:20 <vm-name> google_osconfig_agent[1215]: default
Apr 29 18:03:20 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:46:22 logging client: context deadline exceeded
重启机器恢复网络。我们在日志中找不到任何其他内容,并且监控也没有显示任何可疑内容。是什么导致了这种情况?
解决方法
这个 OSConfigAgent 错误似乎是问题的原因:
Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:22:13.2740Z OSConfigAgent Error main.go:189: network error when requesting metadata,make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Google Cloud 与位于 169.254.169.254 的每个实例一起运行本地元数据服务器。此服务器对于实例的运行至关重要,因此无论您配置任何防火墙规则,实例都可以访问它。
基于此,/var/log/messages
中一定有一些有趣的东西值得一看。您可能会找到与网络适配器相关的内容。
根据以下日志摘录,我可以推断这很可能是因为网络连接中断
OSConfigAgent Error main.go:189: network error when requesting metadata,make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/? recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
从日志中,我注意到:
16:11:06,DHCP租约更新,持续3600秒。因此,DHCP 租约的第二次续订应该在 17:11:06 左右。但是从日志来看,在 17:26:39 发生的更新延迟了 15 分钟(大约)。下一次更新发生在 17:55:07,也就是 27 分钟(大约)之后。
包括 DHCP 更新在内的整个网络堆栈很可能由于 CPU 或内存过载而延迟。当主机遇到高 CPU 负载时,它会导致网络性能下降。
要检查过载是在 CPU 还是内存中, 打开云控制台 ---->Compute Engine ---> VM 实例 ---> 点击实例的三个点 ---> 查看监控 ---> CPU 和内存利用率图表(检查时间范围问题发生时)
确保您的实例有足够的内存利用率来维持负载。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。