fast random data query & delete use ctid in postgresql

14 minute read

背景

朋友公司有一个需求, 需要生成一笔随机的唯一数据, 随机取出来使用, 用完下次不能重复使用.

使用PostgreSQL来实现的话, 有以下方法, 性能OK.

创建测试表, 存放一堆唯一值.

postgres=# create table tbl (id int);  
CREATE TABLE  

唯一值随机插入, 取数据时按照数据块倒序取出, 这么做的好处是vacuum时可以直接回收这部分空间.

postgres=# select generate_series(1,10) order by random();  
 generate_series   
-----------------  
               1  
               9  
               4  
               7  
               3  
               6  
               8  
               2  
              10  
               5  
(10 rows)  
postgres=# \timing  
Timing is on.  

随机的插入数据

postgres=# insert into tbl select generate_series(1,10000000) order by random();  
INSERT 0 10000000  
Time: 42204.425 ms  

从数据来看 , 已经随机插入了.

postgres=# select * from tbl limit 10;  
   id      
---------  
 9318426  
 4366165  
 4661718  
 8491396  
 9413591  
 9845650  
 8830805  
  999712  
 7944907  
 2487468  
(10 rows)  

在ctid上创建索引, 取数据时使用这个索引, 倒序从最后的数据块开始取数据.

postgres=# create index idx_tbl_ctid on tbl(ctid);  
CREATE INDEX  
Time: 18824.496 ms  

例如:

postgres=# select ctid,* from tbl order by ctid desc limit 5;  
    ctid    |   id      
------------+---------  
 (11001,91) | 1121211  
 (11001,90) | 1157632  
 (11001,89) | 3024993  
 (11001,88) | 6147265  
 (11001,87) | 3057019  
(5 rows)  
Time: 0.675 ms  

为了防止多个进程重复取数据, 使用这种方法.

postgres=# with t as(select ctid from tbl order by ctid desc limit 5) delete from tbl where ctid in (select ctid from t) returning *;  
   id      
---------  
 3057019  
 6147265  
 3024993  
 1157632  
 1121211  
(5 rows)  
DELETE 5  
Time: 1.136 ms  

测试并行取数据.

测试方法, 将数据插入另一张表

postgres=# with t as(select ctid from tbl order by ctid desc limit 5),t1 as (delete from tbl where ctid in (select ctid from t) returning *) insert into test select * from t1 ;  
INSERT 0 5  
Time: 1.225 ms  
postgres=# select * from test;  
   id      
---------  
 9879376  
 8194987  
 4095242  
 4486826  
 5762760  
(5 rows)  
Time: 0.400 ms  

使用pgbench 测试, 16个并行取数据进程, 每次取5条.

postgres@localhost-> vi test.sql  
with t as(select ctid from tbl order by ctid desc limit 5),t1 as (delete from tbl where ctid in (select ctid from t) returning *) insert into test select * from t1;  

测试完成后, 查询test表, 看看有没有重复数据就知道这种方法是否靠谱了.

性能见下 :

postgres@localhost-> pgbench -M prepared -n -r -f ./test.sql -c 16 -j 4 -T 30  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 30 s  
number of transactions actually processed: 225225  
tps = 7505.450512 (including connections establishing)  
tps = 7508.798021 (excluding connections establishing)  
statement latencies in milliseconds:  
        2.128700        with t as(select ctid from tbl order by ctid desc limit 5),t1 as (delete from tbl where ctid in (select ctid from t) returning *) insert into test select * from t1;  

经查没有重复数据, 方法靠谱

postgres=# select count(*),count(distinct id) from test;  
 count  | count    
--------+--------  
 127625 | 127625  
(1 row)  

使用这种方式, 垃圾回收将回收文件.

postgres=# select relpages from pg_class where relname='tbl';  
 relpages   
----------  
    10442  
(1 row)  
  
postgres=# \q  
postgres@localhost-> pgbench -M prepared -n -r -f ./test.sql -c 16 -j 4 -T 30  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 30 s  
number of transactions actually processed: 221771  
tps = 7390.878883 (including connections establishing)  
tps = 7393.894430 (excluding connections establishing)  
statement latencies in milliseconds:  
        2.161958        with t as(select ctid from tbl order by ctid desc limit 5) delete from tbl where ctid in (select ctid from t) returning *;  
postgres=# vacuum analyze tbl;  
VACUUM  
postgres=# select relpages from pg_class where relname='tbl';  
 relpages   
----------  
    10301  
(1 row)  

使用多个表的话, 性能还有提升.

如下 :

postgres=# create table tbl0(id int);  
CREATE TABLE  
postgres=# create table tbl1(id int);  
CREATE TABLE  
postgres=# create table tbl2(id int);  
CREATE TABLE  
postgres=# create table tbl3(id int);  
CREATE TABLE  
postgres=# create table tbl4(id int);  
CREATE TABLE  
postgres=# create table tbl5(id int);  
CREATE TABLE  
postgres=# create table tbl6(id int);  
CREATE TABLE  
postgres=# create table tbl7(id int);  
CREATE TABLE  
postgres=# create table tbl8(id int);  
CREATE TABLE  
postgres=# create table tbl9(id int);  
CREATE TABLE  
postgres=# create table tbl10(id int);  
CREATE TABLE  
postgres=# create table tbl11(id int);  
CREATE TABLE  
postgres=# create table tbl12(id int);  
CREATE TABLE  
postgres=# create table tbl13(id int);  
CREATE TABLE  
postgres=# create table tbl14(id int);  
CREATE TABLE  
postgres=# create table tbl15(id int);  
CREATE TABLE  
  
insert into tbl0 select generate_series(1,1000000) order by random();  
insert into tbl1 select generate_series(1000001,2000000) order by random();  
insert into tbl2 select generate_series(2000001,3000000) order by random();  
insert into tbl3 select generate_series(3000001,4000000) order by random();  
insert into tbl4 select generate_series(4000001,5000000) order by random();  
insert into tbl5 select generate_series(5000001,6000000) order by random();  
insert into tbl6 select generate_series(6000001,7000000) order by random();  
insert into tbl7 select generate_series(7000001,8000000) order by random();  
insert into tbl8 select generate_series(8000001,9000000) order by random();  
insert into tbl9 select generate_series(9000001,10000000) order by random();  
insert into tbl10 select generate_series(10000001,11000000) order by random();  
insert into tbl11 select generate_series(11000001,12000000) order by random();  
insert into tbl12 select generate_series(12000001,13000000) order by random();  
insert into tbl13 select generate_series(13000001,14000000) order by random();  
insert into tbl14 select generate_series(14000001,15000000) order by random();  
insert into tbl15 select generate_series(15000001,16000000) order by random();  
  
create index idx_tbl0_ctid on tbl0(ctid desc);  
create index idx_tbl1_ctid on tbl1(ctid desc);  
create index idx_tbl2_ctid on tbl2(ctid desc);  
create index idx_tbl3_ctid on tbl3(ctid desc);  
create index idx_tbl4_ctid on tbl4(ctid desc);  
create index idx_tbl5_ctid on tbl5(ctid desc);  
create index idx_tbl6_ctid on tbl6(ctid desc);  
create index idx_tbl7_ctid on tbl7(ctid desc);  
create index idx_tbl8_ctid on tbl8(ctid desc);  
create index idx_tbl9_ctid on tbl9(ctid desc);  
create index idx_tbl10_ctid on tbl10(ctid desc);  
create index idx_tbl11_ctid on tbl11(ctid desc);  
create index idx_tbl12_ctid on tbl12(ctid desc);  
create index idx_tbl13_ctid on tbl13(ctid desc);  
create index idx_tbl14_ctid on tbl14(ctid desc);  
create index idx_tbl15_ctid on tbl15(ctid desc);  
  
postgres@localhost-> ll tbl*.sql  
-rw-rw-r-- 1 postgres postgres 125 Jan 30 03:28 tbl0.sql  
-rw-rw-r-- 1 postgres postgres 127 Jan 30 03:32 tbl10.sql  
-rw-rw-r-- 1 postgres postgres 127 Jan 30 03:32 tbl11.sql  
-rw-rw-r-- 1 postgres postgres 127 Jan 30 03:32 tbl12.sql  
-rw-rw-r-- 1 postgres postgres 127 Jan 30 03:32 tbl13.sql  
-rw-rw-r-- 1 postgres postgres 127 Jan 30 03:33 tbl14.sql  
-rw-rw-r-- 1 postgres postgres 127 Jan 30 03:33 tbl15.sql  
-rw-rw-r-- 1 postgres postgres 125 Jan 30 03:29 tbl1.sql  
-rw-rw-r-- 1 postgres postgres 125 Jan 30 03:29 tbl2.sql  
-rw-rw-r-- 1 postgres postgres 125 Jan 30 03:32 tbl3.sql  
-rw-rw-r-- 1 postgres postgres 125 Jan 30 03:32 tbl4.sql  
-rw-rw-r-- 1 postgres postgres 125 Jan 30 03:32 tbl5.sql  
-rw-rw-r-- 1 postgres postgres 125 Jan 30 03:32 tbl6.sql  
-rw-rw-r-- 1 postgres postgres 125 Jan 30 03:32 tbl7.sql  
-rw-rw-r-- 1 postgres postgres 125 Jan 30 03:32 tbl8.sql  
-rw-rw-r-- 1 postgres postgres 125 Jan 30 03:32 tbl9.sql  
  
postgres@localhost-> cat tbl*.sql  
with t as(select ctid from tbl0 order by ctid desc limit 5) delete from tbl0 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl10 order by ctid desc limit 5) delete from tbl10 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl11 order by ctid desc limit 5) delete from tbl11 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl12 order by ctid desc limit 5) delete from tbl12 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl13 order by ctid desc limit 5) delete from tbl13 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl14 order by ctid desc limit 5) delete from tbl14 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl15 order by ctid desc limit 5) delete from tbl15 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl1 order by ctid desc limit 5) delete from tbl1 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl2 order by ctid desc limit 5) delete from tbl2 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl3 order by ctid desc limit 5) delete from tbl3 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl4 order by ctid desc limit 5) delete from tbl4 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl5 order by ctid desc limit 5) delete from tbl5 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl6 order by ctid desc limit 5) delete from tbl6 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl7 order by ctid desc limit 5) delete from tbl7 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl8 order by ctid desc limit 5) delete from tbl8 where ctid in (select ctid from t) returning *;  
with t as(select ctid from tbl9 order by ctid desc limit 5) delete from tbl9 where ctid in (select ctid from t) returning *;  
  
  
postgres@localhost-> cat bench.sh   
#!/bin/bash  
  
pgbench -M prepared -n -r -f ./tbl0.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl1.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl2.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl3.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl4.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl5.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl6.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl7.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl8.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl9.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl10.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl11.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl12.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl13.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl14.sql -c 1 -j 1 -T 30 &  
pgbench -M prepared -n -r -f ./tbl15.sql -c 1 -j 1 -T 30 &  
  
  
postgres@localhost-> . ./bench.sh   
postgres@localhost-> jobs  
[1]   Running                 pgbench -M prepared -n -r -f ./tbl0.sql -c 1 -j 1 -T 30 &  
[2]   Running                 pgbench -M prepared -n -r -f ./tbl1.sql -c 1 -j 1 -T 30 &  
[3]   Running                 pgbench -M prepared -n -r -f ./tbl2.sql -c 1 -j 1 -T 30 &  
[4]   Running                 pgbench -M prepared -n -r -f ./tbl3.sql -c 1 -j 1 -T 30 &  
[5]   Running                 pgbench -M prepared -n -r -f ./tbl4.sql -c 1 -j 1 -T 30 &  
[6]   Running                 pgbench -M prepared -n -r -f ./tbl5.sql -c 1 -j 1 -T 30 &  
[7]   Running                 pgbench -M prepared -n -r -f ./tbl6.sql -c 1 -j 1 -T 30 &  
[8]   Running                 pgbench -M prepared -n -r -f ./tbl7.sql -c 1 -j 1 -T 30 &  
[9]   Running                 pgbench -M prepared -n -r -f ./tbl8.sql -c 1 -j 1 -T 30 &  
[10]   Running                 pgbench -M prepared -n -r -f ./tbl9.sql -c 1 -j 1 -T 30 &  
[11]   Running                 pgbench -M prepared -n -r -f ./tbl10.sql -c 1 -j 1 -T 30 &  
[12]   Running                 pgbench -M prepared -n -r -f ./tbl11.sql -c 1 -j 1 -T 30 &  
[13]   Running                 pgbench -M prepared -n -r -f ./tbl12.sql -c 1 -j 1 -T 30 &  
[14]   Running                 pgbench -M prepared -n -r -f ./tbl13.sql -c 1 -j 1 -T 30 &  
[15]-  Running                 pgbench -M prepared -n -r -f ./tbl14.sql -c 1 -j 1 -T 30 &  
[16]+  Running                 pgbench -M prepared -n -r -f ./tbl15.sql -c 1 -j 1 -T 30 &  
  
postgres@localhost-> transaction type: Custom query  
transaction type: Custom query  
scaling factor: 1  
scaling factor: 1  
query mode: prepared  
query mode: prepared  
number of clients: 1  
number of clients: 1  
number of threads: 1  
number of threads: 1  
duration: 30 s  
duration: 30 s  
number of transactions actually processed: 18323  
number of transactions actually processed: 18744  
tps = 610.744476 (including connections establishing)  
tps = 624.776154 (including connections establishing)  
tps = 610.882388 (excluding connections establishing)  
tps = 624.974076 (excluding connections establishing)  
statement latencies in milliseconds:  
statement latencies in milliseconds:  
        1.635022        with t as(select ctid from tbl5 order by ctid desc limit 5) delete from tbl5 where ctid in (select ctid from t) returning *;  
        1.598122        with t as(select ctid from tbl0 order by ctid desc limit 5) delete from tbl0 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18660  
tps = 621.969606 (including connections establishing)  
tps = 622.153610 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.605455        with t as(select ctid from tbl7 order by ctid desc limit 5) delete from tbl7 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18625  
tps = 620.807135 (including connections establishing)  
tps = 621.057183 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.608170        with t as(select ctid from tbl3 order by ctid desc limit 5) delete from tbl3 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of transactions actually processed: 18679  
tps = 622.613140 (including connections establishing)  
tps = 622.763449 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.603809        with t as(select ctid from tbl4 order by ctid desc limit 5) delete from tbl4 where ctid in (select ctid from t) returning *;  
duration: 30 s  
number of transactions actually processed: 18480  
tps = 615.974889 (including connections establishing)  
tps = 616.107203 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.621196        with t as(select ctid from tbl11 order by ctid desc limit 5) delete from tbl11 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18521  
tps = 617.330182 (including connections establishing)  
tps = 617.443250 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.617626        with t as(select ctid from tbl10 order by ctid desc limit 5) delete from tbl10 where ctid in (select ctid from t) returning *;  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18366  
tps = 612.146172 (including connections establishing)  
tps = 612.431254 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.631034        with t as(select ctid from tbl9 order by ctid desc limit 5) delete from tbl9 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18738  
tps = 624.574018 (including connections establishing)  
tps = 624.715406 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.598772        with t as(select ctid from tbl6 order by ctid desc limit 5) delete from tbl6 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18448  
tps = 614.888405 (including connections establishing)  
tps = 615.075231 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.623808        with t as(select ctid from tbl12 order by ctid desc limit 5) delete from tbl12 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18875  
tps = 629.134497 (including connections establishing)  
tps = 629.315227 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.587065        with t as(select ctid from tbl2 order by ctid desc limit 5) delete from tbl2 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18449  
tps = 614.934670 (including connections establishing)  
tps = 615.066184 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.623882        with t as(select ctid from tbl8 order by ctid desc limit 5) delete from tbl8 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18633  
tps = 621.077579 (including connections establishing)  
tps = 621.244771 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.607780        with t as(select ctid from tbl13 order by ctid desc limit 5) delete from tbl13 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18573  
tps = 619.008531 (including connections establishing)  
tps = 619.148479 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.613143        with t as(select ctid from tbl1 order by ctid desc limit 5) delete from tbl1 where ctid in (select ctid from t) returning *;  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18717  
tps = 623.839093 (including connections establishing)  
tps = 623.982969 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.600710        with t as(select ctid from tbl14 order by ctid desc limit 5) delete from tbl14 where ctid in (select ctid from t) returning *;  
number of clients: 1  
number of threads: 1  
duration: 30 s  
number of transactions actually processed: 18710  
tps = 623.627170 (including connections establishing)  
tps = 623.762851 (excluding connections establishing)  
statement latencies in milliseconds:  
        1.601275        with t as(select ctid from tbl15 order by ctid desc limit 5) delete from tbl15 where ctid in (select ctid from t) returning *;  

TPS可以达到将近1W.

postgres=# select 623*16;  
 ?column?   
----------  
     9968  
(1 row)  

为了达到更好的吞吐量, 建议程序一次多取一些, 例如一次取20条, 用完再来取, 那么可以减少和数据库的交互次数, 提高整体性能.

PS:

PostgreSQL 支持ctid扫描方法,因此本例可以不需要对ctid建立索引,方法例子请参考

《块级(ctid)扫描在IoT(物联网)极限写和消费读并存场景的应用》

[《在PostgreSQL中实现update delete limit - CTID扫描实践》](../201608/20160827_01.md)

Flag Counter

digoal’s 大量PostgreSQL文章入口