注意PostgreSQL 隐式锁请求

4 minute read

背景

我们在使用数据库时,经常遇到锁等待的事情。例如一个用户在更新一条记录时,另一个会话要更新同一条记录就需要等待。

这种锁等待是比较显而易见的,有一些锁等待你可能会觉得匪夷所思。

例如:

会话A:

postgres=# begin;  
BEGIN  
postgres=# alter table test add column c1 int;  -- test表被加了 一个排它锁, 对test表的任何操作都会等待  
ALTER TABLE  

会话B:

select * from pg_get_indexdef('test_pkey'::regclass);  -- 查询索引的定义  

如果test_pkey是test的一个索引,它会进入等待状态。

postgres=# \d+ test  
                        Table "postgres.test"  
 Column |  Type   | Modifiers | Storage  | Stats target | Description   
--------+---------+-----------+----------+--------------+-------------  
 id     | integer | not null  | plain    |              |   
 info   | text    |           | extended |              |   
 c1     | integer |           | plain    |              |   
Indexes:  
    "test_pkey" PRIMARY KEY, btree (id)  

我们来跟踪一下

首先请按照 http://blog.163.com/digoal@126/blog/static/163877040201422083228624

提供的方法开启LOCK DEBUG

会话A

postgres=# begin;  
BEGIN  
postgres=# alter table test add column c1 int;  
ALTER TABLE  

会话B

postgres=# set client_min_messages=debug;  
postgres=# set trace_locks=on;  
postgres=# select * from pg_get_indexdef('test_pkey'::regclass);  
LOG:  statement: select * from pg_get_indexdef('test_pkey'::regclass);  
LOG:  LockAcquire: lock [13003,42430] AccessShareLock  
LOG:  LockAcquire: found: lock(0x7f030ad3a750) id(13003,42430,0,0,0,1) grantMask(100) req(0,0,0,0,0,0,0)=1 grant(0,0,0,0,0,0,0)=1 wait(0) type(AccessShareLock)  
LOG:  LockAcquire: new: proclock(0x7f030add4cd0) lock(0x7f030ad3a750) method(1) proc(0x7f030b029900) hold(0)  
LOG:  LockCheckConflicts: conflicting: proclock(0x7f030add4cd0) lock(0x7f030ad3a750) method(1) proc(0x7f030b029900) hold(0)  
LOG:  WaitOnLock: sleeping on lock: lock(0x7f030ad3a750) id(13003,42430,0,0,0,1) grantMask(100) req(1,0,0,0,0,0,0)=2 grant(0,0,0,0,0,0,0)=1 wait(0) type(AccessShareLock)  

等待中。

使用 https://github.com/digoal/pgsql_admin_script/blob/master/generate_report.sh 提供的方法可以查询到锁等待如下:

postgres=# with t_wait as                       
(select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a,transactionid,b.query,b.xact_start,b.query_start,b.usename,b.datname from pg_locks a,pg_stat_activity b where a.pid=b.pid and not a.granted),  
t_run as   
(select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a,transactionid,b.query,b.xact_start,b.query_start,b.usename,b.datname from pg_locks a,pg_stat_activity b where a.pid=b.pid and a.granted)   
select r.locktype,r.mode r_mode,r.usename r_user,r.datname r_db,r.relation::regclass,r.pid r_pid,r.xact_start r_xact_start,r.query_start r_query_start,now()-r.query_start r_locktime,r.query r_query,  
w.mode w_mode,w.pid w_pid,w.xact_start w_xact_start,w.query_start w_query_start,now()-w.query_start w_locktime,w.query w_query    
from t_wait w,t_run r where  
  r.locktype is not distinct from w.locktype and  
  r.database is not distinct from w.database and  
  r.relation is not distinct from w.relation and  
  r.page is not distinct from w.page and  
  r.tuple is not distinct from w.tuple and  
  r.classid is not distinct from w.classid and  
  r.objid is not distinct from w.objid and  
  r.objsubid is not distinct from w.objsubid and  
  r.transactionid is not distinct from w.transactionid and  
  r.pid <> w.pid  
  order by f_lock_level(w.mode)+f_lock_level(r.mode) desc,r.xact_start;  
-[ RECORD 1 ]-+------------------------------------------------------  
locktype      | relation  
r_mode        | AccessExclusiveLock  
r_user        | postgres  
r_db          | postgres  
relation      | test  
r_pid         | 22718  
r_xact_start  | 2015-11-05 10:22:38.080657+08  
r_query_start | 2015-11-05 10:22:40.538386+08  
r_locktime    | 00:00:37.099566  
r_query       | alter table test add column c1 int;  
w_mode        | AccessShareLock  
w_pid         | 22642  
w_xact_start  | 2015-11-05 10:07:18.0088+08  
w_query_start | 2015-11-05 10:22:43.915212+08  
w_locktime    | 00:00:33.72274  
w_query       | select * from pg_get_indexdef('test_pkey'::regclass);  

此时提交或回归会话A,然后会话B可以继续:

LOG:  WaitOnLock: wakeup on lock: lock(0x7f030ad3a600) id(13003,42430,0,0,0,1) grantMask(2) req(1,0,0,0,0,0,0)=1 grant(1,0,0,0,0,0,0)=1 wait(0) type(AccessShareLock)  
LOG:  LockAcquire: granted: proclock(0x7f030add4928) lock(0x7f030ad3a600) method(1) proc(0x7f030b029900) hold(2)  
LOG:  LockAcquire: granted: lock(0x7f030ad3a600) id(13003,42430,0,0,0,1) grantMask(2) req(1,0,0,0,0,0,0)=1 grant(1,0,0,0,0,0,0)=1 wait(0) type(AccessShareLock)  
LOG:  LockRelease: lock [13003,42430] AccessShareLock  
LOG:  LockRelease: found: lock(0x7f030ad3a600) id(13003,42430,0,0,0,1) grantMask(2) req(1,0,0,0,0,0,0)=1 grant(1,0,0,0,0,0,0)=1 wait(0) type(AccessShareLock)  
LOG:  LockRelease: found: proclock(0x7f030add4928) lock(0x7f030ad3a600) method(1) proc(0x7f030b029900) hold(2)  
LOG:  UnGrantLock: updated: lock(0x7f030ad3a600) id(13003,42430,0,0,0,1) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(AccessShareLock)  
LOG:  UnGrantLock: updated: proclock(0x7f030add4928) lock(0x7f030ad3a600) method(1) proc(0x7f030b029900) hold(0)  
LOG:  CleanUpLock: deleting: proclock(0x7f030add4928) lock(0x7f030ad3a600) method(1) proc(0x7f030b029900) hold(0)  
LOG:  CleanUpLock: deleting: lock(0x7f030ad3a600) id(13003,42430,0,0,0,1) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(INVALID)  

正常情况下,查询索引定义如果没有被堵塞,观察到的锁是这样的:

LOG:  statement: select * from pg_get_indexdef('test_pkey'::regclass);  
LOG:  LockAcquire: lock [13003,42430] AccessShareLock  
LOG:  LockAcquire: new: lock(0x7f030ad3aa98) id(13003,42430,0,0,0,1) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(AccessShareLock)  
LOG:  LockAcquire: new: proclock(0x7f030add4df0) lock(0x7f030ad3aa98) method(1) proc(0x7f030b029900) hold(0)  
LOG:  LockCheckConflicts: no conflict: proclock(0x7f030add4df0) lock(0x7f030ad3aa98) method(1) proc(0x7f030b029900) hold(0)  

注意这里请求的锁如下13003,42430分别表示数据库OID和对象OID,对象OID对应的是表,而不是索引。

LOG:  GrantLock: lock(0x7f030ad3aa98) id(13003,42430,0,0,0,1) grantMask(2) req(1,0,0,0,0,0,0)=1 grant(1,0,0,0,0,0,0)=1 wait(0) type(AccessShareLock)  
LOG:  LockRelease: lock [13003,42430] AccessShareLock  
LOG:  LockRelease: found: lock(0x7f030ad3aa98) id(13003,42430,0,0,0,1) grantMask(2) req(1,0,0,0,0,0,0)=1 grant(1,0,0,0,0,0,0)=1 wait(0) type(AccessShareLock)  
LOG:  LockRelease: found: proclock(0x7f030add4df0) lock(0x7f030ad3aa98) method(1) proc(0x7f030b029900) hold(2)  
LOG:  UnGrantLock: updated: lock(0x7f030ad3aa98) id(13003,42430,0,0,0,1) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(AccessShareLock)  
LOG:  UnGrantLock: updated: proclock(0x7f030add4df0) lock(0x7f030ad3aa98) method(1) proc(0x7f030b029900) hold(0)  
LOG:  CleanUpLock: deleting: proclock(0x7f030add4df0) lock(0x7f030ad3aa98) method(1) proc(0x7f030b029900) hold(0)  
LOG:  CleanUpLock: deleting: lock(0x7f030ad3aa98) id(13003,42430,0,0,0,1) grantMask(0) req(0,0,0,0,0,0,0)=0 grant(0,0,0,0,0,0,0)=0 wait(0) type(INVALID)  

所以,这里堵塞的根本原因是获取索引定义需要请求表的AccessShareLock,所以和表上DDL请求的排他锁冲突了。

还有很多获得对象定义的函数,使用时需要注意。

 pg_catalog | pg_get_constraintdef        | text             | oid                   | normal  
 pg_catalog | pg_get_constraintdef        | text             | oid, boolean          | normal  
 pg_catalog | pg_get_function_arg_default | text             | oid, integer          | normal  
 pg_catalog | pg_get_functiondef          | text             | oid                   | normal  
 pg_catalog | pg_get_indexdef             | text             | oid                   | normal  
 pg_catalog | pg_get_indexdef             | text             | oid, integer, boolean | normal  
 pg_catalog | pg_get_ruledef              | text             | oid                   | normal  
 pg_catalog | pg_get_ruledef              | text             | oid, boolean          | normal  
 pg_catalog | pg_get_triggerdef           | text             | oid                   | normal  
 pg_catalog | pg_get_triggerdef           | text             | oid, boolean          | normal  
 pg_catalog | pg_get_viewdef              | text             | oid                   | normal  
 pg_catalog | pg_get_viewdef              | text             | oid, boolean          | normal  
 pg_catalog | pg_get_viewdef              | text             | oid, integer          | normal  
 pg_catalog | pg_get_viewdef              | text             | text                  | normal  
 pg_catalog | pg_get_viewdef              | text             | text, boolean         | normal  

例如:

获取规则定义也要请求表的AccessShareLock。

postgres=# create rule r1 as on delete to u1 do also select 1;  
CREATE RULE  

会话A:

postgres=# begin;  
postgres=# alter table u1 add column c2 int;  
ALTER TABLE  

会话B:

postgres=# select pg_get_ruledef(42458);  
LOG:  statement: select pg_get_ruledef(42458);  
LOG:  LockAcquire: lock [13003,25572] AccessShareLock  
LOG:  LockAcquire: found: lock(0x7f030ad3a0c0) id(13003,25572,0,0,0,1) grantMask(102) req(1,0,0,0,0,0,0)=2 grant(1,0,0,0,0,0,0)=2 wait(0) type(AccessShareLock)  
LOG:  LockAcquire: new: proclock(0x7f030add42b0) lock(0x7f030ad3a0c0) method(1) proc(0x7f030b029340) hold(0)  
LOG:  LockCheckConflicts: conflicting: proclock(0x7f030add42b0) lock(0x7f030ad3a0c0) method(1) proc(0x7f030b029340) hold(0)  

这里的25572对应的是u1.

LOG:  WaitOnLock: sleeping on lock: lock(0x7f030ad3a0c0) id(13003,25572,0,0,0,1) grantMask(102) req(2,0,0,0,0,0,0)=3 grant(1,0,0,0,0,0,0)=2 wait(0) type(AccessShareLock)  

pg_get_viewdef也有这个问题,需要获得视图引用的表的AccessShareLock。

LOG:  LockAcquire: lock [13003,42446] AccessShareLock  
LOG:  LockAcquire: lock [13003,42446] AccessShareLock  
LOG:  LockAcquire: lock [13003,42446] AccessShareLock  
LOG:  LockAcquire: lock [13003,17229] AccessShareLock  
LOG:  LockAcquire: found: lock(0x7f030ad3b5c0) id(13003,17229,0,0,0,1) grantMask(100) req(1,0,0,0,0,0,0)=2 grant(0,0,0,0,0,0,0)=1 wait(1) type(AccessShareLock)  
LOG:  LockAcquire: new: proclock(0x7f030add42b0) lock(0x7f030ad3b5c0) method(1) proc(0x7f030b029340) hold(0)  
LOG:  LockCheckConflicts: conflicting: proclock(0x7f030add42b0) lock(0x7f030ad3b5c0) method(1) proc(0x7f030b029340) hold(0)  
LOG:  WaitOnLock: sleeping on lock: lock(0x7f030ad3b5c0) id(13003,17229,0,0,0,1) grantMask(100) req(2,0,0,0,0,0,0)=3 grant(0,0,0,0,0,0,0)=1 wait(1) type(AccessShareLock)  

另外再提供一个需要注意的点,PG对未获得,但是在等待中的锁也在冲突列表中。

例如用户1对A表在做一个比较大的查询,另一个用户2需要对A表执行DDL,那么显然用户2的DDL无法获得排它锁在等待用户1的状态。此后,用户3或其他用户发起对A的查询请求也会被用户2堵塞,这是非常危险的,如果用户1不释放锁,那么用户2就会一直等待,同时会堵塞所有对A表的任何请求。

所以建议用户在执行DDL操作时,加一个锁超时的参数,防止出现以上情况。

set lock_timeout='1s';  

还有一种autocommit的场景下,建议对需要持有大锁的SQL在执行前加上语句超时,防止长时间持锁或等待锁,在某些场景甚至可能造成拥塞,例如短连接的场景,用户可能不断发起请求,把数据库连接用完。

set statement_timeout ='1s';  

以上锁DEBUG信息都可以参考src/include/storage/lock.h获得详细的描述。

参考

1. http://blog.163.com/digoal@126/blog/static/163877040201422083228624

2. http://www.postgresql.org/docs/9.5/static/runtime-config-developer.html

3. https://github.com/digoal/pgsql_admin_script/blob/master/generate_report.sh

4. src/include/storage/lock.h

Flag Counter

digoal’s 大量PostgreSQL文章入口