使用Prometheus监控docker compose方式部署的ES

技术分享 2年前 (2023-01-07) 0 999+

关注

需求

收集 ES 的指标, 并进行展示和告警;

现状

ES 通过 docker compose 安装
所在环境的 K8S 集群有 Prometheus 和 AlertManager 及 Grafana

方案

复用现有的监控体系, 通过: Prometheus 监控 ES.

具体实现为:

采集端 `elasticsearch_exporter`

可以监控的指标为:

Name	Type	Cardinality	Help
elasticsearch_breakers_estimated_size_bytes	gauge	4	Estimated size in bytes of breaker
elasticsearch_breakers_limit_size_bytes	gauge	4	Limit size in bytes for breaker
elasticsearch_breakers_tripped	counter	4	tripped for breaker
elasticsearch_cluster_health_active_primary_shards	gauge	1	The number of primary shards in your cluster. This is an aggregate total across all indices.
elasticsearch_cluster_health_active_shards	gauge	1	Aggregate total of all shards across all indices, which includes replica shards.
elasticsearch_cluster_health_delayed_unassigned_shards	gauge	1	Shards delayed to reduce reallocation overhead
elasticsearch_cluster_health_initializing_shards	gauge	1	Count of shards that are being freshly created.
elasticsearch_cluster_health_number_of_data_nodes	gauge	1	Number of data nodes in the cluster.
elasticsearch_cluster_health_number_of_in_flight_fetch	gauge	1	The number of ongoing shard info requests.
elasticsearch_cluster_health_number_of_nodes	gauge	1	Number of nodes in the cluster.

...

可以直接在 github 上找到完整的

展示端基于Grafana

📚️ Reference:

ElasticSearch dashboard for Grafana | Grafana Labs

告警指标基于prometheus alertmanager

📚️ Reference:

ElasticSearch：https://awesome-prometheus-alerts.grep.to/rules.html#elasticsearch-1

实施步骤

以下为手动实施步骤

Docker Compose

docker pull quay.io/prometheuscommunity/elasticsearch-exporter:v1.3.0

docker-compose.yml 示例:

🐾 Warning:

exporter 在每次刮削时都会从 ElasticSearch 集群中获取信息，因此过短的刮削间隔会给 ES 主节点带来负载，特别是当你使用 --es.all 和 --es.indices 运行时。我们建议你测量获取/_nodes/stats和/_all/_stats对你的ES集群来说需要多长时间，以确定你的刮削间隔是否太短。

原 ES 的 docker-copmose.yml 示例如下:

version: '3' services:   elasticsearch:     image: elasticsearch-plugins:6.8.18     ...     ports:       - 9200:9200       - 9300:9300     restart: always

增加了 elasticsearch_exporter 的yaml如下:

version: '3' services:   elasticsearch:     image: elasticsearch-plugins:6.8.18     ...     ports:       - 9200:9200       - 9300:9300     restart: always   elasticsearch_exporter:       image: quay.io/prometheuscommunity/elasticsearch-exporter:v1.3.0       command:        - '--es.uri=http://elasticsearch:9200'       - '--es.all'       - '--es.indices'       - '--es.indices_settings'       - '--es.indices_mappings'       - '--es.shards'       - '--es.snapshots'       - '--es.timeout=30s'             restart: always       ports:       - "9114:9114"

Prometheus 配置调整

prometheus 配置

Prometheus 增加静态抓取配置:

scrape_configs:   - job_name: "es"     static_configs:       - targets: ["x.x.x.x:9114"]

说明:

x.x.x.x 为 ES Exporter IP, 因为 ES Exporter 通过 docker compose 和 ES部署在同一台机器, 所以这个 IP 也是 ES 的IP.

Prometheus Rules

增加 ES 相关的 Prometheus Rules:

groups:   - name: elasticsearch     rules:       - record: elasticsearch_filesystem_data_used_percent         expr: 100 * (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes)           / elasticsearch_filesystem_data_size_bytes       - record: elasticsearch_filesystem_data_free_percent         expr: 100 - elasticsearch_filesystem_data_used_percent       - alert: ElasticsearchTooFewNodesRunning         expr: elasticsearch_cluster_health_number_of_nodes < 3         for: 0m         labels:           severity: critical         annotations:           description: "Missing node in Elasticsearch clustern  VALUE = {{ $value }}n  LABELS = {{ $labels }}"           summary: ElasticSearch running on less than 3 nodes(instance {{ $labels.instance }}, node {{$labels.node}})       - alert: ElasticsearchDiskSpaceLow         expr: elasticsearch_filesystem_data_free_percent < 20         for: 2m         labels:           severity: warning         annotations:           summary: Elasticsearch disk space low (instance {{ $labels.instance }}, node {{$labels.node}})           description: "The disk usage is over 80%n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchDiskOutOfSpace         expr: elasticsearch_filesystem_data_free_percent < 10         for: 0m         labels:           severity: critical         annotations:           summary: Elasticsearch disk out of space (instance {{ $labels.instance }}, node {{$labels.node}})           description: "The disk usage is over 90%n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchHeapUsageWarning         expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80         for: 2m         labels:           severity: warning         annotations:           summary: Elasticsearch Heap Usage warning (instance {{ $labels.instance }}, node {{$labels.node}})           description: "The heap usage is over 80%n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchHeapUsageTooHigh         expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90         for: 2m         labels:           severity: critical         annotations:           summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }}, node {{$labels.node}})           description: "The heap usage is over 90%n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchClusterRed         expr: elasticsearch_cluster_health_status{color="red"} == 1         for: 0m         labels:           severity: critical         annotations:           summary: Elasticsearch Cluster Red (instance {{ $labels.instance }}, node {{$labels.node}})           description: "Elastic Cluster Red statusn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchClusterYellow         expr: elasticsearch_cluster_health_status{color="yellow"} == 1         for: 0m         labels:           severity: warning         annotations:           summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }}, node {{$labels.node}})           description: "Elastic Cluster Yellow statusn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchHealthyDataNodes         expr: elasticsearch_cluster_health_number_of_data_nodes < 3         for: 0m         labels:           severity: critical         annotations:           summary: Elasticsearch Healthy Data Nodes (instance {{ $labels.instance }}, node {{$labels.node}})           description: "Missing data node in Elasticsearch clustern  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchRelocatingShards         expr: elasticsearch_cluster_health_relocating_shards > 0         for: 0m         labels:           severity: info         annotations:           summary: Elasticsearch relocating shards (instance {{ $labels.instance }}, node {{$labels.node}})           description: "Elasticsearch is relocating shardsn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchRelocatingShardsTooLong         expr: elasticsearch_cluster_health_relocating_shards > 0         for: 15m         labels:           severity: warning         annotations:           summary: Elasticsearch relocating shards too long (instance {{ $labels.instance }}, node {{$labels.node}})           description: "Elasticsearch has been relocating shards for 15minn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchInitializingShards         expr: elasticsearch_cluster_health_initializing_shards > 0         for: 0m         labels:           severity: info         annotations:           summary: Elasticsearch initializing shards (instance {{ $labels.instance }}, node {{$labels.node}})           description: "Elasticsearch is initializing shardsn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchInitializingShardsTooLong         expr: elasticsearch_cluster_health_initializing_shards > 0         for: 15m         labels:           severity: warning         annotations:           summary: Elasticsearch initializing shards too long (instance {{ $labels.instance }}, node {{$labels.node}})           description: "Elasticsearch has been initializing shards for 15 minn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchUnassignedShards         expr: elasticsearch_cluster_health_unassigned_shards > 0         for: 0m         labels:           severity: critical         annotations:           summary: Elasticsearch unassigned shards (instance {{ $labels.instance }}, node {{$labels.node}})           description: "Elasticsearch has unassigned shardsn  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchPendingTasks         expr: elasticsearch_cluster_health_number_of_pending_tasks > 0         for: 15m         labels:           severity: warning         annotations:           summary: Elasticsearch pending tasks (instance {{ $labels.instance }}, node {{$labels.node}})           description: "Elasticsearch has pending tasks. Cluster works slowly.n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"       - alert: ElasticsearchNoNewDocuments         expr: increase(elasticsearch_indices_docs{es_data_node="true"}[10m]) < 1         for: 0m         labels:           severity: warning         annotations:           summary: Elasticsearch no new documents (instance {{ $labels.instance }}, node {{$labels.node}})           description: "No new documents for 10 min!n  VALUE = {{ $value }}n  LABELS = {{ $labels }}"

并重启生效.

🐾Warning:

ElasticsearchTooFewNodesRunning 告警的条件是 es 集群的node 少于 3个, 对于单节点 ES 会误报, 所以按需开启rule或按需屏蔽(slience).

ElasticsearchHealthyDataNodes 告警同上.

AlertManager 告警规则及收件人配置

按需调整, 示例如下:

'global':   'smtp_smarthost': ''   'smtp_from': ''   'smtp_require_tls': false   'resolve_timeout': '5m' 'receivers':   - 'name': 'es-email'     'email_configs':       - 'to': 'sfw@example.com,sdfwef@example.com'         'send_resolved': true 'route':   'group_by':     - 'job'   'group_interval': '5m'   'group_wait': '30s'   'routes':     - 'receiver': 'es-email'       'match':         'job': 'es'

并重启生效.

Grafana 配置

导入 json 格式的 Grafana Dashboard:

{     "__inputs": [],     "__requires": [         {             "type": "grafana",             "id": "grafana",             "name": "Grafana",             "version": "5.4.0"         },         {             "type": "panel",             "id": "graph",             "name": "Graph",             "version": "5.0.0"         },         {             "type": "datasource",             "id": "prometheus",             "name": "Prometheus",             "version": "5.0.0"         },         {             "type": "panel",             "id": "singlestat",             "name": "Singlestat",             "version": "5.0.0"         }     ], ...