1.创建用户以及组:useradd -m -s /bin/false prometheus
2.创建目录:
mkdir /etc/prometheus
mkdir /var/lib/prometheus
3.授权:chown prometheus /var/lib/prometheus/
4.下载安装包:
wget https://github.com/prometheus/prometheus/releases/download/v2.23.0/prometheus-2.23.0.linux-amd64.tar.gz
5.解压:tar zxvf prometheus-2.23.0.linux-amd64.tar.gz
6.进入解压目录:cd prometheus-2.23.0.linux-amd64
7.复制文件到路径:
cp prometheus /usr/local/bin
cp promtool /usr/local/bin
8.编辑配置:vim /etc/prometheus/prometheus.yml
global: scrape_interval:15s #设置间隔为每15秒。默认值为每1分钟
Evaluation_interval:15s #每15秒评估一次规则。默认值为每1分钟
scrape_timeout:15s #scrape_timeout设置为全局默认值(10s)
9.防火墙放行:iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 9090 -j ACCEPT
10.保存:service iptables save
11.重启:/bin/systemctl restart iptables.service
12.创建文件:vi /etc/systemd/system/prometheus.service
[Unit] Description=Prometheus Time Series Collection and Processing Server Wants=network-online.target After=network-online.target
[Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yml --storage.tsdb.path /var/lib/prometheus/ --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries
[Install] WantedBy=multi-user.target
13.重载 systemctl :systemctl daemon-reload
14.启动以及开机自启:systemctl start prometheus && systemctl enable prometheus
15.查看:systemctl status prometheus
16.访问:http://ip:9090
安装 node_exporter :(收集数据程序)1.创建用户:useradd -m -s /bin/false node_exporter
2.下载安装包:
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
3.解压:tar zxvf node_exporter-1.0.1.linux-amd64.tar.gz
4.复制文件到路径:cp node_exporter-1.0.1.linux-amd64/node_exporter /usr/local/bin
5.授权:chown node_exporter:node_exporter /usr/local/bin/node_exporter
6.编辑启动服务文件:vi /etc/systemd/system/node_exporter.service
[Unit] Description=Prometheus Node Exporter Wants=network-online.target After=network-online.target
[Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter
[Install] WantedBy=multi-user.target
7.重载 systemctl :systemctl daemon-reload
8.启动以及开机自启:systemctl start node_exporter && systemctl enable node_exporter
9.启动并使节点导出器在系统引导时运行:systemctl enable --now node_exporter.service
10.查看:systemctl status node_exporter
11.防火墙放行:iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 9100 -j ACCEPT
12.保存:service iptables save
13.重启:/bin/systemctl restart iptables.service
14.prometheus 修改配置:vim /etc/prometheus/prometheus.yml
- job_name: 'node_exporter' #名字
static_configs:
- targets: ['localhost:9090']
- targets: ['192.168.6.160:9100'] #ip加端口
15.重启服务:systemctl restart prometheus
16.网页可打开测试:http://ip:9100/metrics
安装 mysql_exporter:(监控数据库)
文件安装:1.下载安装包:
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz
2.解压:tar zxvf mysqld_exporter-0.12.1.linux-amd64.tar.gz
3.移动加重命名:mv mysqld_exporter-0.12.1.linux-amd64 /usr/local/mysql_exporter
4.编辑文件:vim /usr/local/mysql_exporter/.my.cnf
[client] user=账号 password=密码
host=ip
port=端口
注意:数据库要有一个授权的账号
5.进入目录:cd /usr/local/mysql_exporter/
6.防火墙放行:iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 9104 -j ACCEPT
7.保存:service iptables save
8.重启:/bin/systemctl restart iptables.service
9.启动:./mysqld_exporter --config.my-cnf=.my.cnf &
10.prometheus 修改配置:vim /etc/prometheus/prometheus.yml
- job_name: '103.138.75.156'
static_configs:
- targets: ['localhost:9090']
- targets: ['103.138.75.156:9104']
11.重启服务:systemctl restart prometheus
12.网页测试:http://ip:9104/metrics
docker安装:1.创建网络:docker network create my-mysql-network
2.创建 mysql_exporter:
docker run -d -p 9104:9104 --network my-mysql-network --restart="always" --name test -e DATA_SOURCE_NAME="test:123456@(103.138.75.103:33007)/" prom/mysqld-exporter
数据库账号:test
密码:123456
ip:103.138.75.103
端口:33007
3.防火墙放行:iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 9104 -j ACCEPT
4.保存:service iptables save
5.重启:/bin/systemctl restart iptables.service
6.重启 docker:systemctl restart docker
7.启动容器:docker start test
8.prometheus 修改配置:vim /etc/prometheus/prometheus.yml
- job_name: '103.138.75.156'
static_configs:
- targets: ['localhost:9090']
- targets: ['103.138.75.156:9104']
9.重启服务:systemctl restart prometheus
10.网页测试:http://ip:9104/metrics
安装 blackbox-exporter:(监控网络,这里监控 ping 以及用 docker 安装)
1.获取镜像:docker pull prom/blackbox-exporter
2.创建:docker run -d -p 9115:9115 --name blackbox-exporter prom/blackbox-exporter
3.防火墙放行:iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 9115 -j ACCEPT
4.保存:service iptables save
5.重启:/bin/systemctl restart iptables.service
6.重启 docker:systemctl restart docker
7.启动容器:docker start blackbox-exporter
8.prometheus 修改配置:vim /etc/prometheus/prometheus.yml
- job_name: 'blackbox_ping' #名字
scrape_interval: 1s #间隔
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- 103.138.75.156 #目标 ping 的地址
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 103.138.75.53:9115 #源的 ip 地址以及端口
9.重启服务:systemctl restart prometheus
10.网页测试:http://103.138.75.53:9115/probe?target=103.138.75.156&module=icmp
安装 Grafana:1.下载安装包:wget https://dl.grafana.com/oss/release/grafana-7.2.1-1.x86_64.rpm
2.安装:yum install grafana-7.2.1-1.x86_64.rpm -y
3.启动以及开机自启:systemctl start grafana-server && systemctl enable grafana-server
4.防火墙放行:iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 3000 -j ACCEPT
5.保存:service iptables save
6.重启:/bin/systemctl restart iptables.service
7.访问:ip:3000
8.默认账号密码:admin
相对应的模板下载连接:https://grafana.com/grafana/dashboards
安装饼图图示化:1.下载:git clone https://github.com/grafana/piechart-panel.git --branch release-1.3.8
2.编辑文件:vim /etc/grafana/grafana.ini
[plugin.piechart] path = /home/your/clone/dir/piechart-panel ##文件路径
3.重启:service grafana-server restart
添加监控模板:(前提是已经配置好,导入就可以使用)
主机下载链接:https://grafana.com/grafana/dashboards/8307
数据库下载链接:https://grafana.com/grafana/dashboards/7362
数据库主从下载链接:https://grafana.com/grafana/dashboards/7371
ping 的下载链接:https://grafana.com/grafana/dashboards/12275
1.添加:
2.可创建文件夹:
3.导入:
告警:
1.下载钉钉插件:
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v0.3.0/prometheus-webhook-dingtalk-0.3.0.linux-amd64.tar.gz
2.解压:tar -zxf prometheus-webhook-dingtalk-0.3.0.linux-amd64.tar.gz -C /opt/
3.重命名:
mv /opt/prometheus-webhook-dingtalk-0.3.0.linux-amd64 /opt/prometheus-webhook-dingtalk
4.创建钉钉机器人:
注意:如有其他的字眼需要触发,也要在这里加上关键字
5.方便启动配置:vim /etc/systemd/system/prometheus-webhook-dingtalk.service
[Unit]
Description=prometheus-webhook-dingtalk
After=network-online.target
[Service]
Restart=on-failure
ExecStart=/opt/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --ding.profile=ops_dingding=自己钉钉机器人的Webhook地址
[Install]
WantedBy=multi-user.target
6.重载 systemctl 命令:systemctl daemon-reload
7.启动以及开机自启:
systemctl start prometheus-webhook-dingtalk && systemctl enable prometheus-webhook-dingtalk
8.查看:ss -tnl | grep 8060
9.测试:
curl -H "Content-Type: application/json" -d '{"msgtype":"text","text":{"content":"告警"}}' 自己钉钉机器人的Webhook地址
10.安装 Alertmanager:
wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz
11.解压:tar xf alertmanager-0.19.0.linux-amd64.tar.gz
12.移动以及重命名:mv alertmanager-0.19.0.linux-amd64 alertmanager
13.编辑文件:vim /opt/alertmanager/alertmanager.yml
global:
resolve_timeout: 1m #每一分钟检查一次是否恢复
route:
group_by: ['alertname'] #采用哪个标签来作为分组依据
group_wait: 10s #告警产生后等待10s,如果有同组告警一起发出
group_interval: 10s #两组告警的间隔时间
repeat_interval: 1h #重复告警的间隔时间
receiver: 'warning' #接收人,一般用机器人名字
routes:
- receiver: 'warning' #告警接收人
group_wait: 10s
match_re:
alertname: 内存使用率|CPU使用率|磁盘使用率 #这里是告警规则中的名字定义 (alert)
receivers:
- name: 'warning'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/ops_dingding/send' #这里注意刚刚插件服务里的名字
send_resolved: true #警报被解决之后是否通知
# 当与另一组匹配器的规则匹配时,仅其中一组失效,前提是两个告警组必须有相同的标签
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
14.进入目录:cd /opt/alertmanager/
15.放后台启动:
./alertmanager --config.file=alertmanager.yml --cluster.advertise-address=0.0.0.0:9093 &
16.查看:netstat -anput | grep 9093
17.创建规则目录:mkdir /opt/prometheus/rules
18.定义规则:vim /opt/prometheus/rules/cpu_usage.yml
groups:
- name: CPU告警规则 #定义名字
rules:
- alert: CPU使用率 #跟 alertmanager.yml 文件的名字定义要一样
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 80
for: 1m #在1分钟内
labels:
user: prometheus #用户,有时 root 用户会报错
severity: warning
annotations:
description: "服务器CPU使用率超过80% (当前:{{ $value }}%)"
19.编辑 prometheus 文件:vim /opt/prometheus/prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- 103.138.75.3:9093 #prometheus 的 ip 以及 alertmanager 端口
rule_files:
- "rules/*.yml" #规则路径,必须放在prometheus主配置下
20.重启即可:systemctl restart prometheus
告警模板官网:https://awesome-prometheus-alerts.grep.to/rules#prometheus-self-monitoring
CPU大于80%告警:
groups:
- name: CPU告警规则
rules:
- alert: CPU使用率
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 80
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "服务器CPU使用率超过80% (当前:{{ $value }}%)"
内存大于75%告警:
groups:
- name: 内存告警规则
rules:
- alert: 内存使用率
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 75
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "内存使用率超过75% (当前:{{ $value }}%)"
磁盘大于90%告警:
groups:
- name: 磁盘告警规则
rules:
- alert: 磁盘使用率
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 85 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "磁盘分区空间使用率超过85% "
ping 延迟大于300 ms 告警:
groups:
- name: ping告警
rules:
- alert: ping告警
expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 0.3
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: Blackbox probe slow ping (instance {{ $labels.instance }})
description: "Blackbox ping took more than 1sn VALUE = {{ $value }}n LABELS = {{ $labels }}"
主机挂掉告警:
groups:
- name: 主机告警规则
rules:
- alert: 主机失去联系
expr: up == 0
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "主机{{ $labels.instance }}已经失去联系超过1分钟"
数据库主从告警:
第一个配置:
groups:
- name: 主从告警规则
rules:
- alert: 主从告警
expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_io_running == 0
for: 0m
labels:
user: prometheus
severity: critical
annotations:
summary: MySQL Slave IO thread not running (instance {{ $labels.instance }})
description: "MySQL Slave IO thread not running on {{ $labels.instance }}n VALUE = {{ $value }}n LABELS = {{ $labels }}"
第二个配置:
groups:
- name: 主从告警规则
rules:
- alert: 主从告警
expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_sql_running == 0
for: 0m
labels:
user: prometheus
severity: critical
annotations:
summary: MySQL Slave SQL thread not running (instance {{ $labels.instance }})
description: "MySQL Slave SQL thread not running on {{ $labels.instance }}n VALUE = {{ $value }}n LABELS = {{ $labels }}"
数据库关闭告警:
groups:
- name: 数据库告警规则
rules:
- alert: 数据库失去联系
expr: mysql_up == 0
for: 0m
labels:
user: prometheus
severity: critical
annotations:
summary: MySQL down (instance {{ $labels.instance }})
description: "MySQL instance is down on {{ $labels.instance }}n VALUE = {{ $value }}n LABELS = {{ $labels }}"
数据库连接大于70%告警:
groups:
- name: 连接告警规则
rules:
- alert: 连接告警
expr: avg by (instance) (rate(mysql_global_status_threads_connected[1m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 70
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: MySQL too many connections (> 70%) (instance {{ $labels.instance }})
description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}n VALUE = {{ $value }}n LABELS = {{ $labels }}"
数据库大于60%高线程告警:
groups:
- name: 高线程运行告警规则
rules:
- alert: 高线程运行告警
expr: avg by (instance) (rate(mysql_global_status_threads_running[1m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 60
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: MySQL high threads running (instance {{ $labels.instance }})
description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}n VALUE = {{ $value }}n LABELS = {{ $labels }}"
数据库复制延迟大于30s告警:
groups:
- name: 数据库复制延迟告警规则
rules:
- alert: 数据库复制延迟告警
expr: mysql_slave_status_master_server_id > 0 and ON (instance) (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) > 30
for: 1m
labels:
user: prometheus
severity: critical
annotations:
summary: MySQL Slave replication lag (instance {{ $labels.instance }})
description: "MySQL replication lag on {{ $labels.instance }}n VALUE = {{ $value }}n LABELS = {{ $labels }}"
数据库慢查询告警:
groups:
- name: 数据库慢查询告警规则
rules:
- alert: 慢查询告警
expr: increase(mysql_global_status_slow_queries[1m]) > 0
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: MySQL slow queries (instance {{ $labels.instance }})
description: "MySQL server mysql has some new slow query.n VALUE = {{ $value }}n LABELS = {{ $labels }}"
数据库日志写入等待大于10s告警:
groups:
- name: 数据库日志写入等待告警规则
rules:
- alert: 数据库日志写入等待告警
expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
for: 0m
labels:
user: prometheus
severity: warning
annotations:
summary: MySQL InnoDB log waits (instance {{ $labels.instance }})
description: "MySQL innodb log writes stallingn VALUE = {{ $value }}n LABELS = {{ $labels }}"



