[未完待续] PostgreSQL 监控、诊断功能设计

1 minute read

背景

采集(数据库内部、数据库日志、操作系统dmesg,mcelog、CGROUP),存储,处理,告警,产生报告,API

GUC开关,控制某些采集指标()

采集什么东西、频率

存储结构设计

API(生成报告、查看历史告警、清除历史存储、调整采集频率、设置GUC开关、增加告警、调整告警阈值)

报告内容设计

监控指标 诊断报告

每项指标对应的版本 每项指标需要采集的数据 指标是否需要内核改进 指标依赖的插件

用户开关项

http://pgstatsinfo.sourceforge.net/

http://pgstatsinfo.sourceforge.net/documents/statsinfo_old/files/report_sample.txt

http://pgstatsinfo.sourceforge.net/documents/reporter10/html/files/report_sample.html#alerts

https://github.com/digoal/blog/blob/master/201709/20170913_01.md

https://raw.githubusercontent.com/digoal/pgsql_admin_script/master/generate_report.sh

https://github.com/digoal/pgsql_admin_script/issues

https://github.com/digoal/blog/blob/master/201606/20160628_01.md

https://github.com/digoal/blog/blob/master/201611/20161123_01.md


指标1(当前值),活跃会话、空闲在事务中会话、总会话、坐标顶层(max_connection)
全局 select state,count() from pg_stat_activity group by 1; 明细 select usename,datname,state,count() from pg_stat_activity group by 1,2,3;

区分: 活跃, state=’active’ 空闲在事务中, state=’idle in transaction’ 总会话,sum(count(*))

最大连接数 select current_setting(‘max_connections’);

状态 switch (beentry->st_state) { case STATE_IDLE: values[4] = CStringGetTextDatum(“idle”); break; case STATE_RUNNING: values[4] = CStringGetTextDatum(“active”); break; case STATE_IDLEINTRANSACTION: values[4] = CStringGetTextDatum(“idle in transaction”); break; case STATE_FASTPATH: values[4] = CStringGetTextDatum(“fastpath function call”); break; case STATE_IDLEINTRANSACTION_ABORTED: values[4] = CStringGetTextDatum(“idle in transaction (aborted)”); break; case STATE_DISABLED: values[4] = CStringGetTextDatum(“disabled”); break; case STATE_UNDEFINED: nulls[4] = true; break; }

指标2(当前值),等待状态会话

全局 select count() from pg_stat_activity where wait_event_type is not null and wait_event_type<>’Activity’; 明细,区分不同的等待类型 select wait_event_type,wait_event,count() from pg_stat_activity where wait_event_type is not null and wait_event_type<>’Activity’ group by 1,2;

指标2(当前值),长事务(包括2PC) 的最早时间。(打印最早的时间)

select min(xact_start) from pg_stat_activity where backend_xid is not null or backend_xmin is not null; select min(prepared) from pg_prepared_xacts ; 忽略空值,取最小值

指标3(区间统计值), pg_stat_database

postgres=# \d pg_stat_database View “pg_catalog.pg_stat_database” Column | Type | Default —————-+————————–+———- datid | oid |
datname | name | – 库名 numbackends | integer |
xact_commit | bigint | – 事务提交数 xact_rollback | bigint | – 事务回滚数 blks_read | bigint | – 未命中读数据块个数(乘以current_setting(‘block_size’)得到字节数) blks_hit | bigint | – SHARED BUFFER命中读数据块个数 tup_returned | bigint | – 使用全表扫描方法扫描的记录数 tup_fetched | bigint | – 使用索引扫描方法fetch heap tuple的记录数(index only scan如果从vm判断不需要FETCH HEAP的话,不会被计数,只记录实际fetch heap tuple的数目) tup_inserted | bigint | – 插入了多少行 tup_updated | bigint | – 更新了多少行 tup_deleted | bigint | – 删除了多少行 conflicts | bigint |
temp_files | bigint |
temp_bytes | bigint | – 使用了多少临时文件(字节) deadlocks | bigint | – 死锁 blk_read_time | double precision | – 读数据块耗时 blk_write_time | double precision | – 写数据块耗时 stats_reset | timestamp with time zone |

全局 select sum(xact_commit), sum(xact_rollback), sum(blks_read), sum(blks_hit), sum(tup_returned), sum(tup_fetched), sum(tup_inserted), sum(tup_updated), sum(tup_deleted), sum(temp_bytes), sum(deadlocks), sum(blk_read_time), sum(blk_write_time) from pg_stat_database; 明细(区分DB) select datname, xact_commit, xact_rollback, blks_read, blks_hit, tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted, temp_bytes, deadlocks, blk_read_time, blk_write_time from pg_stat_database;

指标4(区间统计值),内核新增功能, SQL计数器, slow qps (暂时缺失) 计数器区分update,delete,select,insert,copy,DDL , 区分 database, user

指标5服务器指标(当前值),cpu,iops,mem,network

指标6(区间统计值),pg_stat_statements , top sql
CREATE EXTENSION is not exists pg_stat_statements;

Flag Counter

digoal’s 大量PostgreSQL文章入口