自动化监控运维(五) Alertmanager 基于 email 告警配置

一、 安装 Alertmanager

1.1 基于二进制安装

  • 下载
    https://github.com/prometheus/alertmanager/releases

  • 解压
    tar -xzvf alertmanager-0.26.0.linux-amd64.tar.gz

  • 运行
    ./alertmanager --config.file="alertmanager.yml"

1.2 基于 docker 安装

docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager

二、 配置 Alertmanager

2.1 配置 Alertmanager 告警方式

  • 编辑配置文件: alertmanager.yml

  • smtp_smarthost 字段需要注明端口

    global: 
    resolve_timeout: 5m
    smtp_smarthost: 'smtp.qq.com:25'
    smtp_from: '[email protected]'
    smtp_auth_username: '[email protected]' #
    smtp_auth_password: 'yourpasswd/authenticCode'
    route:
    group_by: ['alertname']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 1h
    receiver: 'email'
    receivers:
    - name: 'email'
    email_configs:
    - to: '[email protected]'
    inhibit_rules:
    - source_match:
    severity: 'critical'
    target_match:
    severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

2.2 配置 Prometheus

  • 编辑配置文件 prometheus.yml

    alerting:
    alertmanagers:
    - static_configs:
    - targets: ["192.168.1.201:9093"]

    rule_files:
    - "/usr/local/prometheus/rules/*"
  • /usr/local/prometheus/rules/* 用于存放告警规则

2.3 告警示例

vim /usr/local/prometheus/rules/first.rules.yml

groups:
- name: cpuAlertGroup
rules:
- alert: hostCPUUsageTooHigh
expr: (1 - sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance) ) * 100 > 50
for: 30s
labels:
biz_type: cpu_usage
annotations:
summary: "Instance {{ $labels.instance }} CPU usgae high"
description: "{{ $labels.instance }} CPU usage above 50% (current : {{ $value }})"

三、 测试

  • 测试脚本
#!/usr/bin/python
# -*- coding: UTF-8 -*-

### 用法: python cpu_mem.py 4 1024
### 解释:
### 占满 4个核心
### 占用1024MB内存

import sys
import time
from multiprocessing import Process

def exec_func(bt):

while True:
for i in range(0, 9600000):
pass
time.sleep(bt)

if __name__ == "__main__":

if len(sys.argv) != 3:
print('Need one argument! ')
sys.exit()
cpu_logical_count = int(sys.argv[1])
cpu_sleep_time = 0.01
memory_used_mb = int(sys.argv[2])

try:
s = ' ' * (memory_used_mb * 1024 * 1024)
except MemoryError:
print("剩余内存不足,内存有溢出......")

try:
p = Process(target=exec_func, args=("bt",))
ps_list = []
for i in range(0, cpu_logical_count):
ps_list.append(Process(target=exec_func, args=(cpu_sleep_time,)))
for p in ps_list:
p.start()
for p in ps_list:
p.join()
except KeyboardInterrupt:
print("资源浪费结束!")

图
图