Prometheus学习笔记

Prometheus 一、概要

prometheus是一款基于时序数据库（TSDB）的开源监控告警系统。

前世今生：
SoundCloud是一个在线音乐分享的平台，类似于做视频分享的 YouTube，由于他们在微服务架构的道路上越走越远，出现了成百上千的服务，使用传统的监控系统 StatsD 和 Graphite 存在大量的局限性，于是他们在 2012 年开始着手开发一套全新的监控系统。Prometheus 的原作者是 Matt T. Proud，他也是在 2012 年加入 SoundCloud 的，实际上，在加入 SoundCloud 之前，Matt 一直就职于 Google，他从 Google 的集群管理器 Borg 和它的监控系统 Borgmon 中获取灵感，开发了开源的监控系统 Prometheus，和 Google 的很多项目一样，使用的编程语言是 Go。

二、搭建 1、jar的方式搭建环境准备

jdk 8
创建工作目录，本文档以为工作目录/data/monitor_cloud/prometheus

文件准备

去官网下载Prometheus的安装包https://prometheus.io/download/

将prometheus-2.14.0.linux-amd64.tar.gz 文件上传至monitor_cloud目录中
并进行解压

解压命令
tar -zxvf prometheus-2.14.0.linux-amd64.tar.gz
重命名文件夹
mv  prometheus-2.14.0.linux-amd64  prometheus

此时/data/monitor_cloud/prometheus 为prometheus的安装目录

创建日志目录

进入prometheus的安装目录
cd /data/monitor_cloud/prometheus
创建日志文件夹
mkdir logs

创建启动相关shell脚本 1 启动脚本

touch promserver.sh

启动脚本如下：

PATH=/data/monitor_cloud/prometheus
LOG=$PATH/logs
PIDFile=/var/run/prometheus.pid
$PATH/prometheus --log.level=info --web.enable-lifecycle --web.enable-admin-api --query.max-concurrency=20 --query.timeout=2m --storage.tsdb.path=$PATH/data --storage.tsdb.retention.time=180d --config.file=$PATH/prometheus.yml --web.listen-address=:9090 >> $LOG/prometheus.log  2>&1 & echo $! > $PIDFile

注意：

替换脚本中PATH的值，替换为prometheus的安装目录
web.listen-address 配置的是端口，一般默认9090

给脚本赋予权限

chmod -x promserver.sh

2 热部署脚本(该脚本不是启动脚本)

当prometheus修改配置文件时需要进行热启动，重新加载配置文件

touch promwarm_start.sh

脚本内容如下

curl -XPOST http://localhost:9090/-/reload

3 停止prometheus

尽量不要去要用命令kill掉prometheus，因为prometheus没有正常关闭，很有可能破坏 TSDB 正在落盘的数据，从而导致一些莫名的 bug，以至于再也无法启动 Prometheus 了

touch promserver_stop.sh

脚本内容如下

curl -X POST http://localhost:9090/-/quit

正常运行该脚本会返回

Requesting termination... Goodbye!

最终的目录结构如下

.
├── console_libraries
├── consoles
├── data
├── LICENSE
├── logs
├── NOTICE
├── prometheus
├── prometheus.yml
├── promserver.sh
├── promserver_stop.sh
├── promtool
├── promwarm_start.sh
├── rules
└── tsdb

验证运行

访问 http://localhost:9090/

2、docker部署准备环境

docker，docker-compose环境准备

docker-compose.yml

version: "3.5"
services:
  prometheus:
    image: prom/prometheus:v2.27.1
    restart: always
    container_name: prometheus
    networks:
      - proxy
    ports:
      - 9090:9090
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./rules:/etc/prometheus/rules

networks:
  proxy:
    external: true

prometheus.yml

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.


  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# influxdb配置；需在influxdb中创建好db
remote_write:
  - url: "http://xxx.xxx.xxx.xxx:8086/api/v1/prom/write?db=prometheus&u=root&p=***"

remote_read:
  - url: "http://xxx.xxx.xxx.xxx:8086/api/v1/prom/read?db=prometheus&u=root&p=***"

rule_files:
  - "./rules/*.yml"

# 告警配置
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['xxx.xxx.xxx.xxx:9093']

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'agent'
    # 基本认证(启动node_exporter时配置的用户名密码)
    basic_auth:
      username: xxxx
      password: ****

    static_configs:
      # 需要监控服务器
      - targets: ['xxx.xxx.xxx.xxx:9100', 'xxx.xxx.xxx.xxx:9100']

运行

docker-compose up -d

三、PromQL

PromQL (Prometheus Query Language) 是 Prometheus 自己开发的数据查询 DSL 语言。

在页面 http://localhost:9090/graph 中，输入下面的查询语句，查看结果，例如：

eg : vm_disk_io_write_bytes{code="200"}
       指标名 {标签名="标签值"}

查询结果类型

PromQL 查询结果主要有 3 种类型：

瞬时数据 (Instant vector): 包含一组时序，每个时序只有一个点，例如：vm_disk_io_write_bytes
区间数据 (Range vector): 包含一组时序，每个时序有多个点，例如：vm_disk_io_write_bytes[5m]
纯量数据 (Scalar): 纯量只有一个数字，没有时序，例如：count(vm_disk_io_write_bytes)

查询条件

Prometheus 存储的是时序数据，而它的时序是由名字和一组标签构成的，其实名字也可以写出标签的形式，例如 vm_disk_io_write_bytes 等价于 {name=“vm_disk_io_write_bytes”}。

一个简单的查询相当于是对各种标签的筛选，例如：

vm_disk_io_write_bytes{code="200"}// 表示查询名字为 vm_disk_io_write_bytes，code 为 "200" 的数据

查询条件支持正则匹配，例如：

vm_disk_io_write_bytes{code!="200"}// 表示查询 code 不为 "200" 的数据
vm_disk_io_write_bytes{code=～"2.."}// 表示查询 code 为 "2xx" 的数据
vm_disk_io_write_bytes{code!～"2.."}// 表示查询 code 不为 "2xx" 的数据

操作符

Prometheus 查询语句中，支持常见的各种表达式操作符，例如

算术运算符:

支持的算术运算符有 +，-，*，/，%，^, 例如 vm_disk_io_write_bytes * 2 表示将 vm_disk_io_write_bytes 所有数据都乘以2。

比较运算符:

支持的比较运算符有 ==，!=，>，<，>=，<=, 例如 vm_disk_io_write_bytes >= 100 表示 vm_disk_io_write_bytes 结果中等于大于 100 的数据。

逻辑运算符:

支持的逻辑运算符有 and，or，unless, 例如 vm_disk_io_write_bytes == 5 or vm_disk_io_write_bytes == 2 表示 vm_disk_io_write_bytes 结果中等于 5 或者 2 的数据。

聚合运算符:

支持的聚合运算符有 sum，min，max，avg，stddev，stdvar，count，count_values，bottomk，topk，quantile，, 例如 max(vm_disk_io_write_bytes) 表示 vm_disk_io_write_bytes 结果中最大的数据。

名称	备注
sum	总和
min	最小
max	最大
avg	平均值
group	结果向量中的所有值都是1
stddev	计算维度上的总体标准量的偏差
stdvar	计算维度上的总体标准方差
count	统计
count_values	统计具有相同值得元素
bottomk	值最小的k个元素
topk	值最大的k个元素
quantile	在维度上计算 φ-分位数 (0 ≤ φ ≤ 1)

这些运算符既可用于聚合所有标签维度，也可通过包含withoutorby子句来保留不同的维度。这些子句可以在表达式之前或之后使用。

 [without|by ()] ([parameter,] )

或者

([parameter,] ) [without|by ()]

label list是加引号的标签，可以包括后面的逗号，即两者的名单(label1, label2)和(label1, label2,)有效的语法。

without从结果向量中删除列出的标签，而所有其他标签都保留在输出中。by执行相反的操作并删除by子句中未列出的标签，即使它们的标签值在向量的所有元素之间都相同。

parameter只需要为count_values，quantile，topk和 bottomk。

count_values每个唯一样本值输出一个时间序列。每个系列都有一个附加标签。该标签的名称由聚合参数给出，标签值是唯一的样本值。每个时间序列的值是样本值出现的次数。

topk并且bottomk与其他聚合器的不同之处在于输入样本的子集（包括原始标签）在结果向量中返回。by并且without仅用于对输入向量进行分桶。

quantile计算 φ-分位数，即在聚合维度的 N 个度量值中排名第 φ*N 的值。φ 作为聚合参数提供。例如，quantile(0.5, ...)计算中位数， quantile(0.95, ...)即第 95 个百分位数。

例子：
如果该指标http_requests_total具有按application、instance和group标签扇出的时间序列，我们可以通过以下方式计算每个应用程序和组在所有实例上看到的 HTTP 请求总数：

sum without (instance) (http_requests_total)

这相当于：

 sum by (application, group) (http_requests_total)

如果我们只对我们在所有应用程序中看到的 HTTP 请求总数感兴趣，我们可以简单地编写：

sum(http_requests_total)

要计算运行每个构建版本的二进制文件的数量，我们可以编写：

count_values("version", build_version)

要在所有实例中获得 5 个最大的 HTTP 请求数，我们可以编写：

topk(5, http_requests_total)

二元运算符优先级

注意，和四则运算类型，Prometheus 的运算符也有优先级，它们遵从（^）> (*, /, %) > (+, -) > (==, !=, <=, <, >=, >) > (and, unless) > (or) 的原则。

内置函数

Prometheus 内置不少函数，方便查询以及数据格式化，例如将结果由浮点数转为整数的 floor 和 ceil，

floor(avg(vm_disk_io_write_bytes{code="200"}))
ceil(avg(vm_disk_io_write_bytes{code="200"}))

查看 vm_disk_io_write_bytes 5 分钟内，平均每秒数据

rate(vm_disk_io_write_bytes[5m])

abs()

abs(v instant-vector)返回输入向量的所有样本的绝对值。

absent()

absent(v instant-vector)，如果传递给它的向量参数具有样本数据，则返回空向量；如果传递的向量参数没有样本数据，则返回不带度量指标名称且带有标签的时间序列，且样本值为1。

当监控度量指标时，如果获取到的样本数据是空的，使用 absent 方法对告警是非常有用的。例如：

# 这里提供的向量有样本数据
absent(http_requests_total{method="get"})  => no data
absent(sum(http_requests_total{method="get"}))  => no data

# 由于不存在度量指标 nonexistent，所以 返回不带度量指标名称且带有标签的时间序列，且样本值为1
absent(nonexistent{job="myjob"})  => {job="myjob"}  1
# 正则匹配的 instance 不作为返回 labels 中的一部分
absent(nonexistent{job="myjob",instance=~".*"})  => {job="myjob"}  1

# sum 函数返回的时间序列不带有标签，且没有样本数据
absent(sum(nonexistent{job="myjob"}))  => {}  1

ceil()

ceil(v instant-vector) 将 v 中所有元素的样本值向上四舍五入到最接近的整数。例如：

node_load5{instance="192.168.1.75:9100"} # 结果为 2.79
ceil(node_load5{instance="192.168.1.75:9100"}) # 结果为 3

changes()

changes(v range-vector) 输入一个区间向量，返回这个区间向量内每个样本数据值变化的次数（瞬时向量）。例如：

# 如果样本数据值没有发生变化，则返回结果为 1
changes(node_load5{instance="192.168.1.75:9100"}[1m]) # 结果为 1

clamp_max()

clamp_max(v instant-vector, max scalar) 函数，输入一个瞬时向量和最大值，样本数据值若大于 max，则改为 max，否则不变。例如：

node_load5{instance="192.168.1.75:9100"} # 结果为 2.79
clamp_max(node_load5{instance="192.168.1.75:9100"}, 2) # 结果为 2

clamp_min()

clamp_min(v instant-vector, min scalar) 函数，输入一个瞬时向量和最小值，样本数据值若小于 min，则改为 min，否则不变。例如：

node_load5{instance="192.168.1.75:9100"} # 结果为 2.79
clamp_min(node_load5{instance="192.168.1.75:9100"}, 3) # 结果为 3

day_of_month()

day_of_month(v=vector(time()) instant-vector) 函数，返回被给定 UTC 时间所在月的第几天。返回值范围：1~31。

day_of_week()

day_of_week(v=vector(time()) instant-vector) 函数，返回被给定 UTC 时间所在周的第几天。返回值范围：0~6，0 表示星期天。

days_in_month()

days_in_month(v=vector(time()) instant-vector) 函数，返回当月一共有多少天。返回值范围：28~31。

delta()

delta(v range-vector) 的参数是一个区间向量，返回一个瞬时向量。它计算一个区间向量 v 的第一个元素和最后一个元素之间的差值。由于这个值被外推到指定的整个时间范围，所以即使样本值都是整数，你仍然可能会得到一个非整数值。例如，下面的例子返回过去两小时的 CPU 温度差：

delta(cpu_temp_celsius{host="zeus"}[2h])

这个函数一般只用在 Gauge 类型的时间序列上。

exp()

exp(v instant-vector) 函数，输入一个瞬时向量，返回各个样本值的 e 的指数值，即 e 的 N 次方。当 N 的值足够大时会返回 +Inf。特殊情况为：

Exp(+Inf) = +Inf
Exp(NaN) = NaN

floor()

floor(v instant-vector) 函数与 ceil() 函数相反，将 v 中所有元素的样本值向下四舍五入到最接近的整数。

hour()

hour(v=vector(time()) instant-vector) 函数返回被给定 UTC 时间的当前第几个小时，时间范围：0~23。

idelta()

idelta(v range-vector) 的参数是一个区间向量, 返回一个瞬时向量。它计算最新的 2 个样本值之间的差值。这个函数一般只用在 Gauge 类型的时间序列上。

minute()

minute(v=vector(time()) instant-vector) 函数返回给定 UTC 时间当前小时的第多少分钟。结果范围：0~59。

month()

month(v=vector(time()) instant-vector) 函数返回给定 UTC 时间当前属于第几个月，结果范围：0~12。

sort()

sort(v instant-vector) 函数对向量按元素的值进行升序排序，返回结果：key: value = 度量指标：样本值[升序排列]。

sort_desc()

sort(v instant-vector) 函数对向量按元素的值进行降序排序，返回结果：key: value = 度量指标：样本值[降序排列]。

sqrt()

sqrt(v instant-vector)函数计算向量 v 中所有元素的平方根。

time()

time() 函数返回从 1970-01-01 到现在的秒数。注意：它不是直接返回当前时间，而是时间戳

timestamp()

timestamp(v instant-vector) 函数返回向量 v 中的每个样本的时间戳（从 1970-01-01 到现在的秒数）。该函数从 Prometheus 2.0 版本开始引入。

vector()

vector(s scalar) 函数将标量 s 作为没有标签的向量返回，即返回结果为：key: value= {}, s。

year()

year(v=vector(time()) instant-vector) 函数返回被给定 UTC 时间的当前年份。

_over_time()

下面的函数列表允许传入一个区间向量，它们会聚合每个时间序列的范围，并返回一个瞬时向量：

avg_over_time(range-vector) : 区间向量内每个度量指标的平均值。

min_over_time(range-vector) : 区间向量内每个度量指标的最小值。

max_over_time(range-vector) : 区间向量内每个度量指标的最大值。

sum_over_time(range-vector) : 区间向量内每个度量指标的求和。

count_over_time(range-vector) : 区间向量内每个度量指标的样本数据个数。

quantile_over_time(scalar, range-vector) : 区间向量内每个度量指标的样本数据值分位数，φ-quantile (0 ≤ φ ≤ 1)。

stddev_over_time(range-vector) : 区间向量内每个度量指标的总体标准差。

stdvar_over_time(range-vector) : 区间向量内每个度量指标的总体标准方差。

[info] 注意即使区间向量内的值分布不均匀，它们在聚合时的权重也是相同的。

更多请参见Prometheus中文文档。

Prometheus学习笔记

Linux相关栏目本月热门文章