PostgreSQL monitor - check_postgres usage - 2

4 minute read

背景

本例要介绍一下结合nagios使用check_postgres监控数据库的配置.

7. 配置被监控机

7.1 修改本地执行监控命令的用户和组, 因为在监控中会涉及数据库表空间文件系统等监控, 而这些目录可能是700权限的. 所以尽量使用postgres的启动用户. 如下 :

[root@db-172-16-3-39 ~]# vi /etc/xinetd.d/nrpe  
        user            = pg92  
        group           = pg92  

7.2 修改监控项配置, 新增监控项

[root@db-172-16-3-39 ~]# vi /opt/nagios/etc/nrpe.cfg  
command[check_arch]=opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=archive_ready -w 10 -c 20  
command[check_freeze]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=autovac_freeze -w 90% -c 95% --exclude=template1,template0  
command[check_conns]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=backends -w=120 -c=150  
command[check_bloat]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres,digoal_01 -u postgres --action=bloat -w="30% or 100M" -c="40% or 200M" --perflimit=5  
command[check_checkpoint]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios --datadir=/pgdata1919 --action=checkpoint -w 300s -c 600s  
command[check_commitratio]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=commitratio -w 90% -c 80%  
command[check_dbsize]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=database_size -w 1T -c 2T  
command[check_triggers]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres,digoal_01 -u postgres --action=disabled_triggers -w 1 -c 2  
command[check_hitratio]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres,digoal_01 -u postgres --action=hitratio -w 90% -c 80%  
command[check_relsize]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -db postgres,digoal_01 -u postgres --action=relation_size -w 10G -c 20G --perflimit=5  
command[check_autovacuum]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres,digoal_01 --action=last_autovacuum --perflimit=3 -w '1 h'  
command[check_autoanalyze]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres,digoal_01 --action=last_autoanalyze --perflimit=3 -w '1 h'  
command[check_locks]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=locks -w 100 -c total=250:waiting=5:exclusive=20  
command[check_query_time]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=query_time -w '10s'  
command[check_seq1]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db digoal_01 --action=sequence -w 90%  
command[check_seq2]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=sequence -w 90%  
command[check_txnidle]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=txn_idle -w "1 for 100s" -c "3 for 100s"  
command[check_xlogs]=/opt/check_postgres-2.20.0/check_postgres.pl --output=nagios -H 127.0.0.1 -p 1919 -u postgres --db postgres --action=wal_files -w 100  

7.3 重启xinetd

[root@db-172-16-3-39 ~]# service xinetd restart  

8. 配置nagios服务端,

配置服务之前, 可以先使用nrpe远程调用以下远端的命令, 看看是否正常, 正常后再配置服务.

[root@db-172-16-3-33 db_servers]# /opt/nagios-3.5.0/libexec/check_nrpe -H 172.16.3.39 -c check_arch  
ERROR: Could not find a suitable psql executable  

这里报错, 找不到psql命令.

所以需要在被监控机上指定psql的路径. 首先要修改check_postgres.pl脚本, 允许使用环境变量.

[root@db-172-16-3-39 ~]# vi /opt/check_postgres-2.20.0/check_postgres.pl  
## If this is true, $opt{PSQL} and $opt{PGBINDIR} are disabled for security reasons  
our $NO_PSQL_OPTION = 0;  

然后修改nrpe.cfg, 指定环境变量. –PGBINDIR=/opt/pgsql92/bin

[root@db-172-16-3-39 ~]# vi /opt/nagios/etc/nrpe.cfg  
command[check_arch]=opt/check_postgres-2.20.0/check_postgres.pl --PGBINDIR=/opt/pgsql92/bin --output=nagios -H 127.0.0.1 -p 1919 -db postgres -u postgres --action=archive_ready -w 10 -c 20  

其他命令略, 请自行修改.

修改好后重启xinetd服务

再次通过nrpe调用, OK

[root@db-172-16-3-33 db_servers]# /opt/nagios-3.5.0/libexec/check_nrpe -H 172.16.3.39 -c check_arch  
POSTGRES_ARCHIVE_READY OK: DB "postgres" (host:127.0.0.1) (port=1919) WAL ".ready" files found: 0 | time=0.01s files=0;10;20  

本例参考前面讲解nagios配置的环境, 所以以下文件以及配置好了, 只需要添加监控服务即可.

[root@db-172-16-3-33 db_servers]# cd /opt/nagios-3.5.0/etc/db_servers  
[root@db-172-16-3-33 db_servers]# less hosts.cfg   
define host{  
  use linux-box ; Inherit default values from a template  
  host_name db_3_39 ; The name we're giving to this server  
  alias postgresql_3_39 ; A longer name for the server  
  address 172.16.3.39 ; IP address of the server  
}  

增加监控服务 :

同一主机的service_description不能重复.

[root@db-172-16-3-33 db_servers]# vi services.cfg  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_arch  
  check_command check_nrpe!check_arch  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_freeze  
  check_command check_nrpe!check_freeze  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_conns  
  check_command check_nrpe!check_conns  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_bloat  
  check_command check_nrpe!check_bloat  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_checkpoint  
  check_command check_nrpe!check_checkpoint  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_commitratio  
  check_command check_nrpe!check_commitratio  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_dbsize  
  check_command check_nrpe!check_dbsize  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_triggers  
  check_command check_nrpe!check_triggers  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_hitratio  
  check_command check_nrpe!check_hitratio  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_relsize  
  check_command check_nrpe!check_relsize  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_autovacuum  
  check_command check_nrpe!check_autovacuum  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_autoanalyze  
  check_command check_nrpe!check_autoanalyze  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_locks  
  check_command check_nrpe!check_locks  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_query_time  
  check_command check_nrpe!check_query_time  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_seq1  
  check_command check_nrpe!check_seq1  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_seq2  
  check_command check_nrpe!check_seq2  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_txnidle  
  check_command check_nrpe!check_txnidle  
}  
define service{  
  use generic-service  
  host_name db_3_39  
  service_description check_xlogs  
  check_command check_nrpe!check_xlogs  
}  

检查配置文件正确性, 确保无错误.

[root@db-172-16-3-33 db_servers]# /opt/nagios-3.5.0/bin/nagios -v /opt/nagios-3.5.0/etc/nagios.cfg   

重启nagios

[root@db-172-16-3-33 db_servers]# service nagios restart  
Running configuration check...done.  
Stopping nagios: done.  
Starting nagios: done.  

9. nagios web界面截图

pic

其他

1. check_postgres的过滤选项

--exclude  
--include  

先排除后包含. 所以如果同一个值在包含和排除中都存在, 那么最后还是包含的.

2. LANG变量设置.

最好设置成en_US.UTF8

3. export LC_TIME=C , 需要与数据库一致.

pg92@db-172-16-3-39-> psql -h 127.0.0.1 -U postgres  
psql (9.2beta1)  
Type "help" for help.  
  
postgres=# \l  
                             List of databases  
   Name    |   Owner   | Encoding | Collate | Ctype |   Access privileges     
-----------+-----------+----------+---------+-------+-----------------------  
 digoal    | postgres  | UTF8     | C       | C     |   
 digoal_01 | digoal_01 | UTF8     | C       | C     |   
 digoal_02 | digoal_02 | UTF8     | C       | C     |   
 postgres  | postgres  | UTF8     | C       | C     |   
 template0 | postgres  | UTF8     | C       | C     | =c/postgres          +  
           |           |          |         |       | postgres=CTc/postgres  
 template1 | postgres  | UTF8     | C       | C     | =c/postgres          +  
           |           |          |         |       | postgres=CTc/postgres  
(6 rows)  

一致的情况下, 时间比对正常.

pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios --datadir=$PGDATA --action=checkpoint -w 10s -c 200000s  
POSTGRES_CHECKPOINT WARNING:  Last checkpoint was 1265 seconds ago | age=1265;10;200000  

如果环境变量与数据库不一致, check point会有问题.

pg92@db-172-16-3-39-> export LC_TIME=en_US.UTF8  
pg92@db-172-16-3-39-> /opt/check_postgres-2.20.0/check_postgres.pl --output=nagios --datadir=$PGDATA --action=checkpoint -w 10s -c 200000s  
POSTGRES_CHECKPOINT OK:  Last checkpoint was -49101 seconds ago | age=-49101;10;200000   

4. 修改LC_TIME后在本地调用正常, 但是在nagios服务端使用check_nrpe调用还是异常.

[root@db-172-16-3-33 db_servers]# /opt/nagios-3.5.0/libexec/check_nrpe -H 172.16.3.39 -c check_checkpoint  
POSTGRES_CHECKPOINT OK:  Last checkpoint was -38070 seconds ago | age=-38070;300;600   

即使将LC_TIME=C写在nrpe.cfg中也不能正常使用.

command[check_checkpoint]=LC_TIME=C /opt/check_postgres-2.20.0/check_postgres.pl --PGBINDIR=/opt/pgsql92/bin --output=nagios --datadir=/pgdata1919 --action=checkpoint -w 300s -c 600s  
  
[root@db-172-16-3-39 check_postgres-2.20.0]# service xinetd restart  
Stopping xinetd: [  OK  ]  
Starting xinetd: [  OK  ]  
[root@db-172-16-3-33 db_servers]# /opt/nagios-3.5.0/libexec/check_nrpe -H 172.16.3.39 -c check_checkpoint  
POSTGRES_CHECKPOINT OK:  Last checkpoint was -37864 seconds ago | age=-37864;300;600   

参考

1. http://bucardo.org/check_postgres/check_postgres.pl.html

2. http://blog.163.com/digoal@126/blog/static/16387704020135313354383/

3. http://blog.163.com/digoal@126/blog/static/16387704020135334157531/

4. http://blog.163.com/digoal@126/blog/static/1638770402013531131023/

5. http://search.cpan.org/

6. http://www.postgresql.org/docs/9.3/static/runtime-config-autovacuum.html

7. http://blog.163.com/digoal@126/blog/static/1638770402013548284410/

Flag Counter

digoal’s 大量PostgreSQL文章入口