[未完待续] PostgreSQL 监控、诊断功能设计
背景
采集(数据库内部、数据库日志、操作系统dmesg,mcelog、CGROUP),存储,处理,告警,产生报告,API
GUC开关,控制某些采集指标()
采集什么东西、频率
存储结构设计
API(生成报告、查看历史告警、清除历史存储、调整采集频率、设置GUC开关、增加告警、调整告警阈值)
报告内容设计
监控指标 诊断报告
每项指标对应的版本 每项指标需要采集的数据 指标是否需要内核改进 指标依赖的插件
用户开关项
http://pgstatsinfo.sourceforge.net/
http://pgstatsinfo.sourceforge.net/documents/statsinfo_old/files/report_sample.txt
http://pgstatsinfo.sourceforge.net/documents/reporter10/html/files/report_sample.html#alerts
https://github.com/digoal/blog/blob/master/201709/20170913_01.md
https://raw.githubusercontent.com/digoal/pgsql_admin_script/master/generate_report.sh
https://github.com/digoal/pgsql_admin_script/issues
https://github.com/digoal/blog/blob/master/201606/20160628_01.md
https://github.com/digoal/blog/blob/master/201611/20161123_01.md
指标1(当前值),活跃会话、空闲在事务中会话、总会话、坐标顶层(max_connection)
全局
select state,count() from pg_stat_activity group by 1;
明细
select usename,datname,state,count() from pg_stat_activity group by 1,2,3;
区分: 活跃, state=’active’ 空闲在事务中, state=’idle in transaction’ 总会话,sum(count(*))
最大连接数 select current_setting(‘max_connections’);
状态 switch (beentry->st_state) { case STATE_IDLE: values[4] = CStringGetTextDatum(“idle”); break; case STATE_RUNNING: values[4] = CStringGetTextDatum(“active”); break; case STATE_IDLEINTRANSACTION: values[4] = CStringGetTextDatum(“idle in transaction”); break; case STATE_FASTPATH: values[4] = CStringGetTextDatum(“fastpath function call”); break; case STATE_IDLEINTRANSACTION_ABORTED: values[4] = CStringGetTextDatum(“idle in transaction (aborted)”); break; case STATE_DISABLED: values[4] = CStringGetTextDatum(“disabled”); break; case STATE_UNDEFINED: nulls[4] = true; break; }
指标2(当前值),等待状态会话
全局 select count() from pg_stat_activity where wait_event_type is not null and wait_event_type<>’Activity’; 明细,区分不同的等待类型 select wait_event_type,wait_event,count() from pg_stat_activity where wait_event_type is not null and wait_event_type<>’Activity’ group by 1,2;
指标2(当前值),长事务(包括2PC) 的最早时间。(打印最早的时间)
select min(xact_start) from pg_stat_activity where backend_xid is not null or backend_xmin is not null; select min(prepared) from pg_prepared_xacts ; 忽略空值,取最小值
指标3(区间统计值), pg_stat_database
postgres=# \d pg_stat_database
View “pg_catalog.pg_stat_database”
Column | Type | Default
—————-+————————–+———-
datid | oid |
datname | name | – 库名
numbackends | integer |
xact_commit | bigint | – 事务提交数
xact_rollback | bigint | – 事务回滚数
blks_read | bigint | – 未命中读数据块个数(乘以current_setting(‘block_size’)得到字节数)
blks_hit | bigint | – SHARED BUFFER命中读数据块个数
tup_returned | bigint | – 使用全表扫描方法扫描的记录数
tup_fetched | bigint | – 使用索引扫描方法fetch heap tuple的记录数(index only scan如果从vm判断不需要FETCH HEAP的话,不会被计数,只记录实际fetch heap tuple的数目)
tup_inserted | bigint | – 插入了多少行
tup_updated | bigint | – 更新了多少行
tup_deleted | bigint | – 删除了多少行
conflicts | bigint |
temp_files | bigint |
temp_bytes | bigint | – 使用了多少临时文件(字节)
deadlocks | bigint | – 死锁
blk_read_time | double precision | – 读数据块耗时
blk_write_time | double precision | – 写数据块耗时
stats_reset | timestamp with time zone |
全局 select sum(xact_commit), sum(xact_rollback), sum(blks_read), sum(blks_hit), sum(tup_returned), sum(tup_fetched), sum(tup_inserted), sum(tup_updated), sum(tup_deleted), sum(temp_bytes), sum(deadlocks), sum(blk_read_time), sum(blk_write_time) from pg_stat_database; 明细(区分DB) select datname, xact_commit, xact_rollback, blks_read, blks_hit, tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted, temp_bytes, deadlocks, blk_read_time, blk_write_time from pg_stat_database;
指标4(区间统计值),内核新增功能, SQL计数器, slow qps (暂时缺失) 计数器区分update,delete,select,insert,copy,DDL , 区分 database, user
指标5服务器指标(当前值),cpu,iops,mem,network
指标6(区间统计值),pg_stat_statements , top sql
CREATE EXTENSION is not exists pg_stat_statements;