PostgreSQL nagios monitor script (archive, vacuum, age, conn, rollback, standby, lock, xact, seq, index…)
背景
bucardo提供的check_postgres脚本过于庞大, 我把其中一些比较常用的监控项摘取出来, 加上一些经验值做成了一个shell的nagios监控脚本, 包含了大部分的常用监控项.
1. 是否打开归档
2. 是否打开vacuum
3. 数据库年龄
4. 连接数
5. 提交和回滚的比例(问题修复后,手工清除统计信息pg_stat_reset(), 连接到对应的库执行)
6. standby延迟
7. 锁等待
8. 长事务/空闲事务
9. prepared事务
10. 序列剩余量(每个库查询)
11. 未使用的索引(每个库查询)
以PostgreSQL 9.3为例, 脚本如下 :
vi nagios.sh
#!/bin/bash
# exp1 : $prog arch
# exp2 : $prog idx $dbname
# please use database superuser
export LANG=en_US.utf8
export PGPORT=5432
export PGDATA=/data01/pgdata/pg_root
export PGHOME=/opt/pgsql
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$PGHOME/bin:$PATH:.
export PGUSER=postgres
export PGHOST=$PGDATA
export PGDATABASE=postgres
# monitor parameter
# cluster monitor
arch="select case setting when 'on' then 'Normal ' else 'Warning 'end||name||','||setting from pg_settings where name='archive_mode';"
autovacuum="select case setting when 'on' then 'Normal 'else 'Warning ' end||name||','||setting from pg_settings where name='autovacuum';"
age="select case when age>600000000 then 'Warning ' else 'Normal ' end||datname||' age '||age from (select datname,age(datfrozenxid) as age from pg_database) t order by age desc;"
conn="select case when t2.setting-t1.cnt < 200 then 'Warning ' else 'Normal ' end||'max '||t2.setting||' now '||t1.cnt from (select count(*) as cnt from pg_stat_activity) t1, (select setting::int8 from pg_settings where name = 'max_connections') t2;"
rollback="select case when xact_rollback/xact_commit::numeric <0.1 then 'Normal ' else 'Warning ' end||datname||' c '||xact_commit||' r '||xact_rollback from pg_stat_database where xact_commit>100 order by 1 desc;"
standby="select case when abs>10240000 then 'Warning ' else 'Normal ' end||'ins '||ins||' sent '||sent||' diff '||abs from (select pg_current_xlog_insert_location() as ins,sent_location as sent,abs(pg_xlog_location_diff(pg_current_xlog_insert_location(),sent_location)) as abs from pg_stat_replication) t;"
lock="select 'Warning '||client_addr||' '||usename||' '||datname||' '||query_start||' '||now()||' '||query from pg_stat_activity where waiting and now()-query_start > interval '1 min';"
xact="select 'Warning '||client_addr||' '||usename||' '||datname||' '||xact_start||' '||now()||' '||query from pg_stat_activity where now()-xact_start > interval '1 min' and query !~ '^COPY';"
pre_xact="select 'Warning '||transaction||' '||gid||' '||prepared||' '||owner||' '||database from pg_prepared_xacts where now()-prepared > interval '1 min';"
# per database monitor
seq="do language plpgsql \$\$
declare
v_seq name;
v_max int8 := 0;
v_last int8 := 0;
begin
for v_seq in
select quote_ident(t2.nspname) || '.' || quote_ident(t1.relname) from pg_class t1, pg_namespace t2 where t1.relnamespace=t2.oid and relkind='S'
loop
execute 'select max_value,last_value from '||v_seq into v_max, v_last;
if v_max-v_last<500000000 then
raise notice 'Warning seq % last % max %', v_seq, v_last, v_max ;
-- else
-- raise notice 'Normal seq % last % max %', v_seq, v_last, v_max ;
end if;
end loop;
end;
\$\$;"
idx="select 'Warning '||schemaname||' '||relname||' '||indexrelname||' '||idx_scan from pg_stat_all_indexes where idx_scan<10 and schemaname not in ('pg_catalog','pg_toast') and indexrelid not in (select conindid from pg_constraint where contype in ('p','u')) and pg_relation_size(indexrelid)>65536 and not pg_is_in_recovery() order by idx_scan;"
eval SQL="$"$1
# echo "$SQL"
DB=$2
res=`psql -q -A -n -t $DB -c "$SQL" 2>&1`
if [ $? -eq 0 ]; then
(echo "$res"|grep "Critical" >/dev/null) && echo "$res"|grep "Critical" && exit 2
(echo "$res"|grep "Warning" >/dev/null) && echo "$res"|grep "Warning" && exit 1
echo "$res"; exit 0
else
echo "$res"; exit 3
fi
chmod 500 nagios.sh
使用举例 :
postgres@db-digoal-> ./nagios.sh arch
Normal archive_mode,on
postgres@db-digoal-> echo $?
0
postgres@db-digoal-> ./nagios.sh autovacuum
Normal autovacuum,on
postgres@db-digoal-> ./nagios.sh age
Normal template1 age 212895
Normal template0 age 212895
Normal postgres age 212895
Normal sky_pg_cluster age 212895
Normal digoal age 212895
postgres@db-digoal-> ./nagios.sh conn
Normal max 6000 now 42
postgres@db-digoal-> ./nagios.sh rollback
Warning digoal c 10813 r 2849
Normal sky_pg_cluster c 133130 r 0
Normal postgres c 625 r 2
postgres@db-digoal-> ./nagios.sh standby
Normal ins 1/747657D8 sent 1/747657D8 diff 0
postgres@db-digoal-> ./nagios.sh lock
postgres@db-digoal-> echo $?
0
postgres@db-digoal-> ./nagios.sh xact
postgres@db-digoal-> ./nagios.sh pre_xact
postgres@db-digoal-> ./nagios.sh seq digoal
postgres@db-digoal-> ./nagios.sh idx digoal
Warning digoal tbl_access_log idx_tbl_access_log_1_ntime 0
部署到数据库服务器的nagios监控agent即可.
部署时nrpe请使用postgres用户, 因为使用unix sock连接数据库.
如果使用IP地址连接的话, 则不需要postgres用户.
参考
1. http://blog.163.com/digoal@126/blog/static/16387704020135313354383/
2. http://blog.163.com/digoal@126/blog/static/16387704020135334157531/
3. http://blog.163.com/digoal@126/blog/static/16387704020135562545536/
4. http://bucardo.org/check_postgres/check_postgres.pl.html
5. https://raw.github.com/bucardo/check_postgres/master/check_postgres.pl