如何将TKE/EKS集群事件日志持久化

腾讯云上的tke集群和eks集群的事件日志默认只会保留一个小时,有的时候,服务出现了问题,需要根据历史事件日志来进行排查下,因为历史事件日志只有1个小时,这样给我们排查带来了极大不便。腾讯云上默认是支持将集群的事件日志采集到cls,但是cls是需要收费的,而且很多人习惯用Elasticsearch来查询日志。 下面我们通过开源的eventrouter来将日志采集到Elasticsearch,然后通过kibana来查询事件日志。 eventrouter介绍说明:https://github.com/heptiolabs/eventrouter

eventrouter服务采用List-Watch机制,获取k8s集群中的实时事件events,并把这些事件推送到不同的通道,这里持久化方案是将eventrouter获取的事件保存到日志文件,然后在pod内部署一个filebeat的sidecar容器采集日志文件,将日志写到es,最终通过kinana来检索es里面的日志。

下面我们来具体部署下,本次部署是在tke集群,eks集群同样的方式部署既可。

1. 部署Elasticsearch

es集群的部署参考下面yaml创建

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: weixnie-es-test
    meta.helm.sh/release-namespace: weixnie
  labels:
    app: elasticsearch-master
    app.kubernetes.io/managed-by: Helm
    chart: elasticsearch
    heritage: Helm
    release: weixnie-es-test
  name: elasticsearch-master
  namespace: weixnie
spec:
  podManagementPolicy: Parallel
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: elasticsearch-master
  serviceName: elasticsearch-master-headless
  template:
    metadata:
      labels:
        app: elasticsearch-master
        chart: elasticsearch
        heritage: Helm
        release: weixnie-es-test
      name: elasticsearch-master
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - elasticsearch-master
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: node.name
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: cluster.initial_master_nodes
          value: elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2,
        - name: discovery.seed_hosts
          value: elasticsearch-master-headless
        - name: cluster.name
          value: elasticsearch
        - name: network.host
          value: 0.0.0.0
        - name: ES_JAVA_OPTS
          value: -Xmx1g -Xms1g
        - name: node.data
          value: "true"
        - name: node.ingest
          value: "true"
        - name: node.master
          value: "true"
        image: ccr.ccs.tencentyun.com/tke-market/elasticsearch:7.6.2
        imagePullPolicy: IfNotPresent
        name: elasticsearch
        ports:
        - containerPort: 9200
          name: http
          protocol: TCP
        - containerPort: 9300
          name: transport
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - |
              #!/usr/bin/env bash -e
              # If the node is starting up wait for the cluster to be ready (request params: 'wait_for_status=green&timeout=1s' )
              # Once it has started only check that the node itself is responding
              START_FILE=/tmp/.es_start_file

              http () {
                  local path="${1}"
                  if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
                    BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
                  else
                    BASIC_AUTH=''
                  fi
                  curl -XGET -s -k --fail ${BASIC_AUTH} http://127.0.0.1:9200${path}
              }

              if [ -f "${START_FILE}" ]; then
                  echo 'Elasticsearch is already running, lets check the node is healthy and there are master nodes available'
                  http "/_cluster/health?timeout=0s"
              else
                  echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )'
                  if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then
                      touch ${START_FILE}
                      exit 0
                  else
                      echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )'
                      exit 1
                  fi
              fi
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 3
          timeoutSeconds: 5
        resources: {}
        securityContext:
          capabilities:
            drop:
            - ALL
          runAsNonRoot: true
          runAsUser: 1000
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: elasticsearch-master
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - sysctl
        - -w
        - vm.max_map_count=262144
        image: ccr.ccs.tencentyun.com/tke-market/elasticsearch:7.6.2
        imagePullPolicy: IfNotPresent
        name: configure-sysctl
        resources: {}
        securityContext:
          privileged: true
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
      terminationGracePeriodSeconds: 120
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: elasticsearch-master
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 30Gi
      volumeMode: Filesystem
    status:
      phase: Pending

2. 部署eventrouter

创建下eventrouter,然后配置下filebeat,这里是直接用filebeat采集到es,如果你想采集到kafaka,然后转存到es,可以配置一个logstash来实现。

apiVersion: v1
kind: ServiceAccount
metadata:
  name: eventrouter 
  namespace: weixnie
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: eventrouter 
rules:
- apiGroups: [""]
  resources: ["events"]
  verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: eventrouter 
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: eventrouter
subjects:
- kind: ServiceAccount
  name: eventrouter
  namespace: weixnie
---
apiVersion: v1
data:
  config.json: |- 
    {
      "sink": "glog"
    }
kind: ConfigMap
metadata:
  name: eventrouter-cm
  namespace: weixnie
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: eventrouter
  namespace: weixnie
  labels:
    app: eventrouter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: eventrouter
  template:
    metadata:
      labels:
        app: eventrouter
        tier: control-plane-addons
    spec:
      containers:
        - name: kube-eventrouter
          image: baiyongjie/eventrouter:v0.2
          imagePullPolicy: IfNotPresent
          command:
            - "/bin/sh"
          args:
            - "-c"
            - "/eventrouter -v 3 -log_dir /data/log/eventrouter"
          volumeMounts:
          - name: config-volume
            mountPath: /etc/eventrouter
          - name: log-path
            mountPath: /data/log/eventrouter
        - name: filebeat
          image: elastic/filebeat:7.6.2
          command:
            - "/bin/sh"
          args:
            - "-c"
            - "filebeat -c /etc/filebeat/filebeat.yml"
          volumeMounts:
          - name: filebeat-config
            mountPath: /etc/filebeat/
          - name: log-path
            mountPath: /data/log/eventrouter
      serviceAccount: eventrouter
      volumes:
        - name: config-volume
          configMap:
            name: eventrouter-cm
        - name: filebeat-config
          configMap:
            name: filebeat-config
        - name: log-path
          emptyDir: {}

---
apiVersion: v1
data:
  filebeat.yml: |-
    filebeat.inputs:
      - type: log
        enabled: true
        paths:
          - "/data/log/eventrouter/*"

    setup.template.name: "tke-event"     # 设置一个新的模板,模板的名称
    setup.template.pattern: "tke-event-*" # 模板匹配那些索引,这里表示以nginx开头的所有的索引
    setup.template.enabled: false     # 关掉默认的模板配置
    setup.template.overwrite: true    # 开启新设置的模板
    setup.ilm.enabled: false  # 索引生命周期管理ilm功能默认开启,开启的情况下索引名称只能为filebeat-*, 通过setup.ilm.enabled false

    output.elasticsearch:
      hosts: ['elasticsearch-master:9200']
      index: "tke-event-%{+yyyy.MM.dd}"
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: weixnie

如果要测试日志是否采集成功,可以看下es的所有是否正常创建,es索引创建正常,则说明日志采集正常

[root@VM-55-14-tlinux ~]# curl 10.55.254.57:9200/_cat/indices
green open .kibana_task_manager_1           31GLIGOZRSWaLvCD9Qi6pw 1 1    2 0    68kb    34kb
green open .apm-agent-configuration         kWHztrKkRJG0QNAQuNc5_A 1 1    0 0    566b    283b
green open ilm-history-1-000001             rAcye5j4SCqp_mcL3r3q2g 1 1   18 0  50.6kb  25.3kb
green open tke-event-2022.04.30             R4R1MOJiSuGCczWsSu2bVA 1 1  390 0 590.3kb 281.3kb
green open .kibana_1                        NveB_wCWTkqKVqadI2DNjw 1 1   10 1 351.9kb 175.9kb

3. 部署kibana

为了方便检索日志,这边创建一个kibana来检索事件日志

apiVersion: v1
data:
  kibana.yml: |
    elasticsearch.hosts: http://elasticsearch-master:9200
    server.host: "0"
    server.name: kibana
kind: ConfigMap
metadata:
  labels:
    app: kibana
  name: kibana
  namespace: weixnie

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: kibana
  name: kibana
  namespace: weixnie
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - image: kibana:7.6.2
        imagePullPolicy: IfNotPresent
        name: kibana
        ports:
        - containerPort: 5601
          name: kibana
          protocol: TCP
        securityContext:
          privileged: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/kibana/config/kibana.yml
          name: kibana
          subPath: kibana.yml
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: kibana
        name: kibana
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: kibana
  name: kibana
  namespace: weixnie
spec:
  ports:
  - name: 5601-5601-tcp
    port: 5601
    protocol: TCP
    targetPort: 5601
  selector:
    app: kibana
  sessionAffinity: None
  type: ClusterIP

如果集群内安装了nginx-ingress,可以通过ingress来给kibana暴露一个域名开访问

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx-intranet
  name: kibana-ingress
  namespace: weixnie
spec:
  rules:
  - host: kibana.tke.niewx.cn
    http:
      paths:
      - backend:
          serviceName: kibana
          servicePort: 5601
        path: /
        pathType: ImplementationSpecific

4. 测试检索事件

登录下kibana

然后创建下索引,这里filebeat设置的索引名称都是tke-event开头,kibana里面创建一个tke-event-*的索引即可。

下面我们直接删除一个测试pod,来产生事件,看下能否在kibana检索到

[niewx@VM-0-4-centos ~]$ k delete pod nginx-6ccd9d7969-f4rfj
pod "nginx-6ccd9d7969-f4rfj" deleted
[niewx@VM-0-4-centos ~]$ k get pod | grep nginx
nginx-6ccd9d7969-fbz9d            1/1     Running       0          23s
[niewx@VM-0-4-centos ~]$ k describe pod nginx-6ccd9d7969-fbz9d | grep -A 10 Events
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  58s   default-scheduler  Successfully assigned weixnie/nginx-6ccd9d7969-fbz9d to 172.16.22.23
  Normal  Pulling    58s   kubelet            Pulling image "nginx:latest"
  Normal  Pulled     55s   kubelet            Successfully pulled image "nginx:latest"
  Normal  Created    55s   kubelet            Created container nginx
  Normal  Started    55s   kubelet            Started container nginx

这里能检索正常,说明我们的event日志持久化到es成功。

5. 定时清理es索引

事件日志是存在es里面,每天的事件都会写到一个索引,如果事件日志较多,保留太长的时间的事件会很容易将磁盘空间打满,这里我们可以写个脚本,然后配置下cronjob来定时清理es里面的索引。

清理索引脚本clean-es-indices.sh,这里需要传入2个参数,第一个参数是清理多少天以前的索引,第二个参数是es的host地址。还需要注意的是脚本里面日期的格式,因为我这边创建的索引名称日期是+%Y.%m.%d,所以脚本里面是这个,如果日期格式不是这个,需要自行修改脚本,然后重新打镜像。

#/bin/bash

day=$1
es_host=$2

DATA=`date -d "${day} days ago" +%Y.%m.%d`

echo "开始清理  $DATA 索引"

#当前日期
time=`date`

#删除n天前的日志
curl -XGET "http://${es_host}:9200/_cat/indices/?v"|grep $DATA
if [ $? == 0 ];then
  curl -XDELETE "http://${es_host}:9200/*-${DATA}"
  echo "于 $time 清理 $DATA 索引!"
else
  echo "无 $DATA 天前索引需要清理"
fi

写个dockerfile来将脚本打到镜像里面,Dockerfile如下

FROM centos:7
COPY clean-es-indices.sh /

如果没有docker环境构建,这里也可以直接使用我已经打好的镜像ccr.ccs.tencentyun.com/nwx_registry/clean-es-indices:latest

下面我们用这个镜像创建一个cronjob

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  labels:
    k8s-app: clean-es-indices
    qcloud-app: clean-es-indices
  name: clean-es-indices
  namespace: weixnie
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      completions: 1
      parallelism: 1
      template:
        metadata:
          labels:
            k8s-app: clean-es-indices
            qcloud-app: clean-es-indices
        spec:
          containers:
          - args:
            - sh -x /clean-es-indices.sh 3 elasticsearch-master
            command:
            - sh
            - -c
            image: ccr.ccs.tencentyun.com/nwx_registry/clean-es-indices:latest
            imagePullPolicy: Always
            name: clean-es-indices
            resources: {}
            securityContext:
              privileged: false
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          imagePullSecrets:
          - name: qcloudregistrykey
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
  schedule: 0 */23 * * *
  successfulJobsHistoryLimit: 3
  suspend: false

这里的cronjob执行策略是在每小时的第 0 分钟执行, 每隔23小时执行一次,相当于每一天执行一次。启动命令里面的参数,我这里配置是3和elasticsearch-master,我这里是清理3天之前的索引,因为es和cronjob是在同namespace,所以我这里直接通过service name访问。

原文地址:https://cloud.tencent.com/developer/article/1990549

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


学习编程是顺着互联网的发展潮流,是一件好事。新手如何学习编程?其实不难,不过在学习编程之前你得先了解你的目的是什么?这个很重要,因为目的决定你的发展方向、决定你的发展速度。
IT行业是什么工作做什么?IT行业的工作有:产品策划类、页面设计类、前端与移动、开发与测试、营销推广类、数据运营类、运营维护类、游戏相关类等,根据不同的分类下面有细分了不同的岗位。
女生学Java好就业吗?女生适合学Java编程吗?目前有不少女生学习Java开发,但要结合自身的情况,先了解自己适不适合去学习Java,不要盲目的选择不适合自己的Java培训班进行学习。只要肯下功夫钻研,多看、多想、多练
Can’t connect to local MySQL server through socket \'/var/lib/mysql/mysql.sock问题 1.进入mysql路径
oracle基本命令 一、登录操作 1.管理员登录 # 管理员登录 sqlplus / as sysdba 2.普通用户登录
一、背景 因为项目中需要通北京网络,所以需要连vpn,但是服务器有时候会断掉,所以写个shell脚本每五分钟去判断是否连接,于是就有下面的shell脚本。
BETWEEN 操作符选取介于两个值之间的数据范围内的值。这些值可以是数值、文本或者日期。
假如你已经使用过苹果开发者中心上架app,你肯定知道在苹果开发者中心的web界面,无法直接提交ipa文件,而是需要使用第三方工具,将ipa文件上传到构建版本,开...
下面的 SQL 语句指定了两个别名,一个是 name 列的别名,一个是 country 列的别名。**提示:**如果列名称包含空格,要求使用双引号或方括号:
在使用H5混合开发的app打包后,需要将ipa文件上传到appstore进行发布,就需要去苹果开发者中心进行发布。​
+----+--------------+---------------------------+-------+---------+
数组的声明并不是声明一个个单独的变量,比如 number0、number1、...、number99,而是声明一个数组变量,比如 numbers,然后使用 nu...
第一步:到appuploader官网下载辅助工具和iCloud驱动,使用前面创建的AppID登录。
如需删除表中的列,请使用下面的语法(请注意,某些数据库系统不允许这种在数据库表中删除列的方式):
前不久在制作win11pe,制作了一版,1.26GB,太大了,不满意,想再裁剪下,发现这次dism mount正常,commit或discard巨慢,以前都很快...
赛门铁克各个版本概览:https://knowledge.broadcom.com/external/article?legacyId=tech163829
实测Python 3.6.6用pip 21.3.1,再高就报错了,Python 3.10.7用pip 22.3.1是可以的
Broadcom Corporation (博通公司,股票代号AVGO)是全球领先的有线和无线通信半导体公司。其产品实现向家庭、 办公室和移动环境以及在这些环境...
发现个问题,server2016上安装了c4d这些版本,低版本的正常显示窗格,但红色圈出的高版本c4d打开后不显示窗格,
TAT:https://cloud.tencent.com/document/product/1340