Elasticsearch未分配的碎片CircuitBreakingException [[parent]数据太大

如何解决Elasticsearch未分配的碎片CircuitBreakingException [[parent]数据太大

我收到警告,指出elasticsearch有2个未分配的碎片。我在下面的api调用中进行了收集,以收集更多详细信息。

    curl -s http://localhost:9200/_cluster/allocation/explain | python -m json.tool

下面的输出

    "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes","can_allocate": "no","current_state": "unassigned","index": "docs_0_1603929645264","node_allocation_decisions": [
        {
            "deciders": [
                {
                    "decider": "max_retry","decision": "NO","explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry,[unassigned_info[[reason=ALLOCATION_FAILED],at[2020-10-30T06:10:16.305Z],failed_attempts[5],delayed=false,details[failed shard on node [o_9jyrmOSca9T12J4bY0Nw]: failed recovery,failure RecoveryFailedException[[docs_0_1603929645264][0]: Recovery failed from {elasticsearch-data-1}{fIaSuZsNTwODgZnt90f7kQ}{Qxl9iPacQVS-tN_t4YJqrw}{IP1}{IP:9300} into {elasticsearch-data-0}{o_9jyrmOSca9T12J4bY0Nw}{1w5mgwy0RYqBQ9c-qA_6Hw}{IP}{IP:9300}]; nested: RemoteTransportException[[elasticsearch-data-1][IP:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [129] files with total size of [4.4gb]]; nested: RemoteTransportException[[elasticsearch-data-0][IP2:9300][internal:index/shard/recovery/file_chunk]]; nested: 
CircuitBreakingException[[parent] Data too large,data for [<transport_request>] would be [1972835086/1.8gb],which is larger than the limit of [1972122419/1.8gb],real usage: [1972833976/1.8gb],new bytes reserved: [1110/1kb]]; ],allocation_status[no_attempt]]]"
                }
            ],"node_decision": "no","node_id": "1XEXS92jTK-asdfasdfasdf","node_name": "elasticsearch-data-2","transport_address": "IP1:9300"
        },{
            "deciders": [
                {
                    "decider": "max_retry",failure RecoveryFailedException[[docs_0_1603929645264][0]: Recovery failed from {elasticsearch-data-1}{fIaSuZsNTwODgZnt90f7kQ}{Qxl9iPacQVS-tN_t4YJqrw}{IP1}{IP1:9300} into {elasticsearch-data-0}{o_9jyrmOSca9T12J4bY0Nw}{1w5mgwy0RYqBQ9c-qA_6Hw}{IP2}{IP2:9300}]; nested: RemoteTransportException[[elasticsearch-data-1][IP1:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [129] files with total size of [4.4gb]]; nested: RemoteTransportException[[elasticsearch-data-0][IP2:9300][internal:index/shard/recovery/file_chunk]]; nested: 
CircuitBreakingException[[parent] Data too large,allocation_status[no_attempt]]]"
                },{
                    "decider": "same_shard","explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[docs_0_1603929645264][0],node[fIaSuZsNTwODgZnt90f7kQ],[P],s[STARTED],a[id=stHnyqjLQ7OwFbaqs5vWqA]]"
                }
            ],"node_id": "fIaSuZsNTwODgZnt90f7kQ","node_name": "elasticsearch-data-1",failure RecoveryFailedException[[docs_0_1603929645264][0]: Recovery failed from {elasticsearch-data-1}{fIaSuZsNTwODgZnt90f7kQ}{Qxl9iPacQVS-tN_t4YJqrw}{IP1}{IP1:9300} into {elasticsearch-data-0}{o_9jyrmOSca9T12J4bY0Nw}{1w5mgwy0RYqBQ9c-qA_6Hw}{Ip2}{IP2:9300}]; nested: RemoteTransportException[[elasticsearch-data-1][IP1:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [129] files with total size of [4.4gb]]; nested: RemoteTransportException[[elasticsearch-data-0][IP2:9300][internal:index/shard/recovery/file_chunk]]; nested: 
CircuitBreakingException[[parent] Data too large,"node_id": "o_9jyrmOSca9T12J4bY0Nw","node_name": "elasticsearch-data-0","transport_address": "IP2:9300"
        }
    ],"primary": false,"shard": 0,"unassigned_info": {
        "at": "2020-10-30T06:10:16.305Z","details": "failed shard on node [o_9jyrmOSca9T12J4bY0Nw]: failed recovery,new bytes reserved: [1110/1kb]]; ","failed_allocation_attempts": 5,"last_allocation_status": "no_attempt","reason": "ALLOCATION_FAILED"
    }
}

我查询了断路器配置

    curl -X GET "localhost:9200/_nodes/stats/breaker?pretty

并且可以看到3个节点(elasticsearch-data-0,elasticsearch-data-1和elasticsearch-data-2)的父limit_size_in_byes如下。

"parent" : {
          "limit_size_in_bytes" : 1972122419,"limit_size" : "1.8gb","estimated_size_in_bytes" : 1648057776,"estimated_size" : "1.5gb","overhead" : 1.0,"tripped" : 139
        }

我参考了这个答案https://stackoverflow.com/a/61954408,并计划增加断路器的内存百分比或整个JVM堆。

这是一个k8s环境,elasticsearch-data被部署为具有3个副本的有状态集。当我对状态集进行描述时,我可以看到下面定义的ENV变量

Containers:
   elasticsearch:
    Image:      custom/elasticsearch-oss-s3:7.0.0
    Port:       9300/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     10500m
      memory:  21Gi
    Requests:
      cpu:      10
      memory:   20Gi
    Environment:
      DISCOVERY_SERVICE:     elasticsearch-discovery
      NODE_MASTER:           false
      PROCESSORS:            11 (limits.cpu)
      ES_JAVA_OPTS:          -Djava.net.preferIPv4Stack=true -Xms2048m -Xmx2048m

按此,堆大小似乎是2048m

我登录了elasticsearch-data窗格,并在弹性搜索配置目录下看到以下文件

elasticsearch.keystore  elasticsearch.yml  jvm.options  log4j2.properties  repository-s3

elasticsearch.yml没有任何堆配置。它只是具有主节点的名称等。

下面是jvm.options文件


## JVM configuration

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms1g
-Xmx1g


## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly


## DNS cache policy
-Des.networkaddress.cache.ttl=60
-Des.networkaddress.cache.negative.ttl=10


# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# explicitly set the stack size
-Xss1m

# set to headless,just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow


# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

从上面看,似乎堆的总大小为1g。

但是从此pod的有状态集中定义的env变量来看,它似乎是2048m。

哪个是对的?

现在,通过下面的链接

Circuit breaker settings | Elasticsearch

可以使用以下设置来配置父级断路器:

indices.breaker.total.use_real_memory (静态)确定父断路器应该考虑实际内存使用情况(是)(真)还是仅考虑子断路器保留的数量(假)。默认为true。

indices.breaker.total.limit (动态)整个父级断路器的启动限制。如果index.breaker.total.use_real_memory为false,则默认为JVM堆的70%。如果index.breaker.total.use_real_memory为true,则默认为JVM堆的95%。

但是错误和我查询的断路器统计信息中的极限值是1972122419字节(1.8G)。这似乎不是2048M的95%或1g。

现在,如何增加断路器父级的堆或内存限制,以便摆脱此错误?

解决方法

这里有两件事,分片分配异常和断路器异常(看起来是嵌套异常)。

请使用以下命令在集群中重新触发分配,因为之前所有重试均失败,并且如果您仔细注意,在异常消息中建议使用相同的命令。关于此命令的详细信息,请参见此related Github issue comment

curl -XPOST':9200 / _cluster / reroute?retry_failed

如果仍然无法运行,则必须修复父级断路器异常,应使用http://localhost:9200/_nodes/stats API来了解ES节点的确切堆,并相应地增加它。

>

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-