PostgreSQL nagios monitor script (archive, vacuum, age, conn, rollback, standby, lock, xact, seq, index…)

2 minute read

背景

bucardo提供的check_postgres脚本过于庞大, 我把其中一些比较常用的监控项摘取出来, 加上一些经验值做成了一个shell的nagios监控脚本, 包含了大部分的常用监控项.

1. 是否打开归档

2. 是否打开vacuum

3. 数据库年龄

4. 连接数

5. 提交和回滚的比例(问题修复后,手工清除统计信息pg_stat_reset(), 连接到对应的库执行)

6. standby延迟

7. 锁等待

8. 长事务/空闲事务

9. prepared事务

10. 序列剩余量(每个库查询)

11. 未使用的索引(每个库查询)

以PostgreSQL 9.3为例, 脚本如下 :

vi nagios.sh   
#!/bin/bash  
  
# exp1 : $prog arch  
# exp2 : $prog idx $dbname  
# please use database superuser  
  
export LANG=en_US.utf8  
export PGPORT=5432  
export PGDATA=/data01/pgdata/pg_root  
export PGHOME=/opt/pgsql  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH  
export DATE=`date +"%Y%m%d%H%M"`  
export PATH=$PGHOME/bin:$PATH:.  
export PGUSER=postgres  
export PGHOST=$PGDATA  
export PGDATABASE=postgres  
  
  
# monitor parameter  
# cluster monitor  
arch="select case setting when 'on' then 'Normal ' else 'Warning 'end||name||','||setting from pg_settings where name='archive_mode';"  
  
autovacuum="select case setting when 'on' then 'Normal 'else 'Warning ' end||name||','||setting from pg_settings where name='autovacuum';"  
  
age="select case when age>600000000 then 'Warning ' else 'Normal ' end||datname||' age '||age from (select datname,age(datfrozenxid) as age from pg_database) t order by age desc;"  
  
conn="select case when t2.setting-t1.cnt < 200 then 'Warning ' else 'Normal ' end||'max '||t2.setting||' now '||t1.cnt from (select count(*) as cnt from pg_stat_activity) t1, (select setting::int8 from pg_settings where name = 'max_connections') t2;"  
  
rollback="select case when xact_rollback/xact_commit::numeric <0.1 then 'Normal ' else 'Warning  ' end||datname||' c '||xact_commit||' r '||xact_rollback  from pg_stat_database where xact_commit>100 order by 1 desc;"  
  
standby="select case when abs>10240000 then 'Warning ' else 'Normal ' end||'ins '||ins||' sent '||sent||' diff '||abs from (select pg_current_xlog_insert_location() as ins,sent_location as sent,abs(pg_xlog_location_diff(pg_current_xlog_insert_location(),sent_location)) as abs from pg_stat_replication) t;"  
  
lock="select 'Warning '||client_addr||' '||usename||' '||datname||' '||query_start||' '||now()||' '||query from pg_stat_activity where waiting and now()-query_start > interval '1 min';"  
  
xact="select 'Warning '||client_addr||' '||usename||' '||datname||' '||xact_start||' '||now()||' '||query from pg_stat_activity where now()-xact_start > interval '1 min' and query !~ '^COPY';"  
  
pre_xact="select 'Warning '||transaction||' '||gid||' '||prepared||' '||owner||' '||database from pg_prepared_xacts where now()-prepared > interval '1 min';"  
  
# per database monitor  
seq="do language plpgsql \$\$  
declare  
  v_seq name;   
  v_max int8 := 0;   
  v_last int8 := 0;  
begin  
  for v_seq in   
    select quote_ident(t2.nspname) || '.' || quote_ident(t1.relname) from pg_class t1, pg_namespace t2 where t1.relnamespace=t2.oid and relkind='S'   
  loop  
    execute 'select max_value,last_value from '||v_seq into v_max, v_last;   
    if v_max-v_last<500000000 then   
      raise notice 'Warning seq % last % max %', v_seq, v_last, v_max ;   
    -- else  
    --   raise notice 'Normal seq % last % max %', v_seq, v_last, v_max ;   
    end if;  
  end loop;  
end;  
\$\$;"  
  
idx="select 'Warning '||schemaname||' '||relname||' '||indexrelname||' '||idx_scan from pg_stat_all_indexes where idx_scan<10 and schemaname not in ('pg_catalog','pg_toast') and indexrelid not in (select conindid from pg_constraint where contype in ('p','u')) and pg_relation_size(indexrelid)>65536 and not pg_is_in_recovery() order by idx_scan;"  
  
  
eval SQL="$"$1  
# echo "$SQL"  
DB=$2  
  
res=`psql -q -A -n -t $DB -c "$SQL" 2>&1`  
if [ $? -eq 0 ]; then  
  (echo "$res"|grep "Critical" >/dev/null) && echo "$res"|grep "Critical" && exit 2  
  (echo "$res"|grep "Warning" >/dev/null) && echo "$res"|grep "Warning" && exit 1  
  echo "$res"; exit 0  
else  
  echo "$res"; exit 3  
fi  
chmod 500 nagios.sh  

使用举例 :

postgres@db-digoal-> ./nagios.sh arch  
Normal archive_mode,on  
  
postgres@db-digoal-> echo $?  
0  
  
postgres@db-digoal-> ./nagios.sh autovacuum  
Normal autovacuum,on  
  
postgres@db-digoal-> ./nagios.sh age  
Normal template1 age 212895  
Normal template0 age 212895  
Normal postgres age 212895  
Normal sky_pg_cluster age 212895  
Normal digoal age 212895  
  
postgres@db-digoal-> ./nagios.sh conn  
Normal max 6000 now 42  
  
postgres@db-digoal-> ./nagios.sh rollback  
Warning  digoal c 10813 r 2849  
Normal sky_pg_cluster c 133130 r 0  
Normal postgres c 625 r 2  
  
postgres@db-digoal-> ./nagios.sh standby  
Normal ins 1/747657D8 sent 1/747657D8 diff 0  
  
postgres@db-digoal-> ./nagios.sh lock  
  
postgres@db-digoal-> echo $?  
0  
  
postgres@db-digoal-> ./nagios.sh xact  
  
postgres@db-digoal-> ./nagios.sh pre_xact  
  
postgres@db-digoal-> ./nagios.sh seq digoal  
  
postgres@db-digoal-> ./nagios.sh idx digoal  
Warning digoal tbl_access_log idx_tbl_access_log_1_ntime 0  

部署到数据库服务器的nagios监控agent即可.

部署时nrpe请使用postgres用户, 因为使用unix sock连接数据库.

如果使用IP地址连接的话, 则不需要postgres用户.

参考

1. http://blog.163.com/digoal@126/blog/static/16387704020135313354383/

2. http://blog.163.com/digoal@126/blog/static/16387704020135334157531/

3. http://blog.163.com/digoal@126/blog/static/16387704020135562545536/

4. http://bucardo.org/check_postgres/check_postgres.pl.html

5. https://raw.github.com/bucardo/check_postgres/master/check_postgres.pl

Flag Counter

digoal’s 大量PostgreSQL文章入口