Compute Engine 虚拟机失去网络连接

如何解决Compute Engine 虚拟机失去网络连接

我们在 Compute Engine 中有一个运行 CentOs 8 的虚拟机。这个虚拟机运行了很长时间并且从未重启过。昨晚,我们突然失去了通过内部和外部 IP 与 VM 的连接。 SSH 也不可能。

在串口上,我们观察到以下日志:

Apr 29 15:53:18 <vm-name> google_osconfig_agent[1215]: default

Apr 29 16:05:18 <vm-name> google_osconfig_agent[1215]: default

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3120] dhcp4 (eth0): option dhcp_lease_time      => '3600'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option domain_name          => 'us-central1-a.c.<project-name>.internal'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option domain_name_servers  => '169.254.169.254'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option domain_search        => 'us-central1-a.c.<project-name>.internal c.<project-name>.internal google.internal'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option expiry               => '1619716266'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option host_name            => '<vm-name>.us-central1-a.c.<project-name>.internal'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option interface_mtu        => '1460'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option ip_address           => '10.128.0.4'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option next_server          => '10.128.0.1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option ntp_servers          => '169.254.169.254'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option requested_broadcast_address => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option requested_domain_name => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_domain_name_servers => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_domain_search => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_host_name  => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_interface_mtu => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_ms_classless_static_routes => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_nis_domain => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_nis_servers => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_ntp_servers => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_rfc3442_classless_static_routes => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_root_path  => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option requested_routers    => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option requested_static_routes => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option requested_subnet_mask => '1'

Apr 29 16:11:06 <vm-name> dbus-daemon[827]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.8' (uid=0 pid=907 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option requested_time_offset => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option requested_wpad       => '1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option rfc3442_classless_static_routes => '10.128.0.1/32 0.0.0.0 0.0.0.0/0 10.128.0.1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option routers              => '10.128.0.1'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option subnet_mask          => '255.255.255.255'

Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): state changed extended -> extended

Apr 29 16:11:06 <vm-name> systemd[1]: Starting Network Manager Script Dispatcher Service...

Apr 29 16:11:06 <vm-name> dbus-daemon[827]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'

Apr 29 16:11:06 <vm-name> systemd[1]: Started Network Manager Script Dispatcher Service.

Apr 29 16:15:18 <vm-name> google_osconfig_agent[1215]: default

Apr 29 16:29:30 <vm-name> GCEGuestAgent[1269]: 2021-04-29T16:28:35.1302Z GCEGuestAgent Error main.go:181: Error watching metadata: Get http://metadata.google.internal/computeMetadata/v1//?recursive=true&alt=json&wait_for_change=true&timeout_sec=60&last_etag=4ac15b8179731d72: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Apr 29 16:29:31 <vm-name> OSConfigAgent[1215]: 2021-04-29T16:29:03.9464Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout

Apr 29 16:29:44 <vm-name> google_osconfig_agent[1215]: default

[16281276.978048] google_guest_agent[1269]: 2021/04/29 16:41:10 logging client: context deadline exceeded

Apr 29 17:11:14 <vm-name> google_guest_agent[1269]: 2021/04/29 16:41:10 logging client: context deadline exceeded

Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: 2021/04/29 16:42:42 logging client: context deadline exceeded

Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: default

Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:03:31 logging client: context deadline exceeded

Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: default

Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T16:41:09.0306Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout

Apr 29 17:27:20 <vm-name> NetworkManager[907]: <info>  [1619717130.6935] dhcp4 (eth0): option dhcp_lease_time      => '3600'

Apr 29 17:27:20 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:26:00 logging client: context deadline exceeded

Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:07:12.7039Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout

Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:22:13.2740Z OSConfigAgent Error main.go:189: network error when requesting metadata,make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable

Apr 29 17:28:55 <vm-name> NetworkManager[907]: <info>  [1619717199.5681] dhcp4 (eth0): option domain_name          => 'us-central1-a.c.<project-name>.internal'

Apr 29 17:34:48 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:27:54.8816Z OSConfigAgent Error main.go:189: network error when requesting metadata,make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable

Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: default

Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:36:13 logging client: context deadline exceeded

Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: default

Apr 29 17:55:16 <vm-name> NetworkManager[907]: <info>  [1619717287.9321] dhcp4 (eth0): option domain_name_servers  => '169.254.169.254'

Apr 29 17:58:36 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:30:47.0872Z OSConfigAgent Error main.go:189: network error when requesting metadata,make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable

Apr 29 18:03:20 <vm-name> google_osconfig_agent[1215]: default

Apr 29 18:03:20 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:46:22 logging client: context deadline exceeded

重启机器恢复网络。我们在日志中找不到任何其他内容,并且监控也没有显示任何可疑内容。是什么导致了这种情况?

解决方法

这个 OSConfigAgent 错误似乎是问题的原因:

Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:22:13.2740Z OSConfigAgent Error main.go:189: network error when requesting metadata,make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable

根据GCP documentation

Google Cloud 与位于 169.254.169.254 的每个实例一起运行本地元数据服务器。此服务器对于实例的运行至关重要,因此无论您配置任何防火墙规则,实例都可以访问它

基于此,/var/log/messages 中一定有一些有趣的东西值得一看。您可能会找到与网络适配器相关的内容。

,

根据以下日志摘录,我可以推断这很可能是因为网络连接中断

 OSConfigAgent Error main.go:189: network error when requesting metadata,make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/? recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable

从日志中,我注意到:

16:11:06,DHCP租约更新,持续3600秒。因此,DHCP 租约的第二次续订应该在 17:11:06 左右。但是从日志来看,在 17:26:39 发生的更新延迟了 15 分钟(大约)。下一次更新发生在 17:55:07,也就是 27 分钟(大约)之后。

包括 DHCP 更新在内的整个网络堆栈很可能由于 CPU 或内存过载而延迟。当主机遇到高 CPU 负载时​​,它会导致网络性能下降。

要检查过载是在 CPU 还是内存中, 打开云控制台 ---->Compute Engine ---> VM 实例 ---> 点击实例的三个点 ---> 查看监控 ---> CPU 和内存利用率图表(检查时间范围问题发生时)

确保您的实例有足够的内存利用率来维持负载。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-