行为、审计日志 (实时索引/实时搜索)建模 - 最佳实践 1

12 minute read

背景

在很多系统中会记录用户的行为日志,行为日志包括浏览行为、社交行为、操作行为等。

典型的应用例如:数据库的SQL审计、企业内部的堡垒机(行为审计)等。

行为、审计日志的量与业务量或者操作量有关,为了满足企业实时查询的需求,通常需要构建搜索引擎,比如使用ES或者使用PostgreSQL的全文检索功能来实现。

如果使用PostgreSQL来构建,有几个优势,可以满足多个需求:

1. 明细存储的需求,除了需要建立索引的字段,明细字段也可以存储在PostgreSQL中。

2. 索引的需求,即建立日志行为字段的全文索引。

3. 多维度索引的需求,除了日志行为字段的索引,还可以建立其他字段的索引,例如时间维度,属性维度的索引。这些索引可以组合使用,满足多个维度的搜索需求。

4. 不需要同步到搜索引擎,满足了实时搜索的需求。

例子

测试机

磁盘,使用空间大、廉价的SATA盘,使用一块SSD作为BCACHE写缓存。

12 * N TB          

目录规划,每块盘一个目录

/disk[1-12]          

操作系统配置

编译PostgreSQL

wget https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2          
          
tar -jxvf postgresql-snapshot.tar.bz2          
cd postgresql-10beta1          
          
export USE_NAMED_POSIX_SEMAPHORES=1          
LIBS=-lpthread CFLAGS="-O3" ./configure --prefix=/home/digoal/pgsql10_8k --with-blocksize=8 --with-wal-segsize=1024          
LIBS=-lpthread CFLAGS="-O3" make world -j 128           
LIBS=-lpthread CFLAGS="-O3" make install-world           

环境变量

vi env_pg10.sh           
          
export PS1="$USER@`/bin/hostname -s`-> "          
export PGPORT=$(($1+1920))          
export PGDATA=/disk$1/digoal/pgdata/pg_root$(($1+1920))          
export LANG=en_US.utf8          
export PGHOME=/home/digoal/pgsql10_8k          
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH          
export LD_RUN_PATH=$LD_LIBRARY_PATH          
export DATE=`date +"%Y%m%d%H%M"`          
export PATH=$PGHOME/bin:$PATH:.          
export MANPATH=$PGHOME/share/man:$MANPATH          
export PGHOST=127.0.0.1          
export PGUSER=postgres          
export PGDATABASE=postgres          
alias rm='rm -i'          
alias ll='ls -lh'          
unalias vi          

初始化集群

创建12个数据库集群,对应到每一块磁盘。可以充分利用磁盘的IO。

vi init.sh          
          
for ((i=1;i<13;i++))          
do          
  . ~/env_pg10.sh $i          
  initdb -D $PGDATA -E SQL_ASCII --locale=C -U postgres          
  echo "local   all             all                                     trust" > $PGDATA/pg_hba.conf          
  echo "host    all             all             127.0.0.1/32            trust" >> $PGDATA/pg_hba.conf          
  echo "host    all             all             ::1/128                 trust" >> $PGDATA/pg_hba.conf          
  echo "host    all             all             0.0.0.0/0               trust" >> $PGDATA/pg_hba.conf          
done          
. ./init.sh        

配置参数

postgresql.auto.conf          
          
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
listen_addresses = '0.0.0.0'
max_connections = 400
unix_socket_directories = '.'
shared_buffers = 32GB
maintenance_work_mem = 2GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 500
bgwriter_lru_multiplier = 5.0
bgwriter_flush_after = 0  
max_parallel_workers_per_gather = 0
fsync=on
synchronous_commit = off
full_page_writes = off
wal_buffers = 128MB
wal_writer_delay = 2s
wal_writer_flush_after = 0  
checkpoint_timeout = 10min
max_wal_size = 32GB
min_wal_size = 4GB
checkpoint_completion_target = 0.5
checkpoint_flush_after = 128kB  
max_wal_senders = 0
random_page_cost = 1.5
effective_cache_size = 48GB
log_destination = 'csvlog'
logging_collector = on
log_truncate_on_rotation = on
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_timezone = 'PRC'
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_freeze_max_age = 500000000
autovacuum_multixact_freeze_max_age = 600000000
vacuum_freeze_table_age = 400000000
vacuum_multixact_freeze_table_age = 400000000
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
autovacuum_work_mem = 2GB
autovacuum_max_workers = 6 
autovacuum_naptime = 20s
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.2
autovacuum_vacuum_cost_delay = 0
autovacuum_vacuum_cost_limit = 0
zero_damaged_pages = on  
cp postgresql.auto.conf /disk1/digoal/pgdata/pg_root1921/          
...          
...          
cp postgresql.auto.conf /disk12/digoal/pgdata/pg_root1932/          

启动集群

将数据库实例绑定到不同的CPU核

vi start.sh          
          
for ((i=1;i<13;i++))          
do          
  . /home/digoal/env_pg10.sh $i          
          
  if [ $i -eq 12 ]; then          
  numactl --physcpubind=56-63 pg_ctl start -o "-c port=$PGPORT"          
  else          
  numactl --physcpubind="$((($i-1)*(5)+(1)))"-"$(($i*5))" pg_ctl start -o "-c port=$PGPORT"          
  fi          
          
done          
. ./start.sh        

停止集群

vi stop.sh           
          
for ((i=1;i<13;i++))          
do          
  . /home/digoal/env_pg10.sh $i          
          
  pg_ctl stop -m fast          
          
done          
. ./stop.sh        

建模

表结构

4个字段,分别存储PK(对应原始明细数据的PK),时间,用户ID,用户行为(tsvector字段)。

检索需求

检索时可能按照时间区间,用户ID,以及分词条件进行检索。

保留时长

日志保留一段时间(例如1个月)后清除。

分区

每个集群中,创建若干个分区表,例如本例使用了12个分区表。

如果条件允许,建议每个小时一个分区表,这样的话可以不建时间索引,查询时间区间的数据使用分区即可。

如果单个用户的数据量很庞大,那么建议按UID再建立哈希或LIST分区,这样的话,按照UID查询,不需要使用索引(可以省去在UID建立索引,甚至省去存储UID这个字段)。

索引

行为字段,全文索引。

用户ID,B-TREE索引。

时间字段,brin块级索引。

模拟数据

时间,时序产生。

用户ID,在一个范围内随机产生。

用户行为数据,长约512字符的字符串,拆分成若干个TOKEN,例如本例为40个长度不等的TOKEN。

初始化数据表

vi init.sql          
          
drop table bptest cascade;          
do language plpgsql $$          
declare          
begin          
  for i in 1..12 loop          
    execute 'drop table bptest'||i||' cascade';          
  end loop;          
end;          
$$;          
          
create table bptest(pk serial8, uid int, ts int8, content tsvector);          
create index idx_bptest_content on bptest using gin (content) with (fastupdate=on,gin_pending_list_limit=2048);          
create index idx_bptest_uid on bptest(uid);          
create index idx_bptest_ts on bptest using brin(ts);          
          
do language plpgsql $$          
declare          
begin          
  for i in 1..12 loop          
    execute 'create table bptest'||i||'(like bptest including all) inherits(bptest)';          
  end loop;          
end;          
$$;          
          
-- 产生若干字符长度的随机字符串          
CREATE OR REPLACE FUNCTION public.gen_rand_str(integer)          
 RETURNS text          
 LANGUAGE sql          
 STRICT          
AS $function$          
  select string_agg(a[random()*6+1],'') from generate_series(1,$1), (select array['a','b','c','d','e','f',' ']) t(a);          
$function$;          
psql -f ./init.sql -p 1921          
...          
...          
psql -f ./init.sql -p 1932          

测试

灌入测试数据,例如每张表插入2亿,一个数据库插入24亿(约6TB),总共插入288亿(约72TB)。

每10条一批灌入。

vi test.sh          
          
for ((i=1;i<13;i++))          
do          
  echo "insert into bptest${i} (uid,ts,content) select random()*100000 , extract(epoch from now()), to_tsvector('english',gen_rand_str(512)) from generate_series(1,10);" > ~/test${i}.sql          
done          
          
for ((i=1;i<13;i++))          
do          
  . /home/digoal/env_pg10.sh ${i}          
          
  for ((x=1;x<13;x++))          
  do          
    nohup pgbench -M prepared -n -r -P 3 -f ./test${x}.sql -c 1 -j 1 -t 200000000 >/tmp/bptest_${i}_${x}.log 2>&1 &          
  done          
done          
          
          
chmod 500 test.sh          

查询测试数据如下,数据非常随机,每条记录的content约40个元素,长度限定在512字符。

select * from bptest1 limit 1;          
          
pk      | 1        
uid     | 849185          
ts      | 1494928859          
content | 'aaaefba':14 'acddcfd':39 'acdeeaadbffdbbecceb':50 'aceeedfbaefbdfcbd':59 'adbbeddbecfdcffaeedcedaeeddaeaaeebfbdcdcecfbbebfcebabceffbfdbfbfa':60           
'adcdf':61 'aead':47 'afddf':70 'ba':8 'bae':37 'bbaacffbcafeffafefdf':38 'bbe':55 'bbecfdf':32 'bcbfd':27 'bdce':45 'bdeccbcdeaabefbeeebcdbfddd':19           
'bed':17 'beedeadccbbbecbfcbf':44 'bfccaeddaddbc':2 'cafdfcf':5 'cbcacefaff':3 'cbcfc':52 'cbfef':63 'ccdcbedb':33 'ccdcd':20 'cd':6 'cfecfeeccabf':42           
'cffb':15 'dabdfddeeabfdcefb':16 'dacdeecfbcefebfabeedfabbaccec':57 'daee':1 'daffcdffadddbaffd':68 'dbcddacefcd':9           
'dbdbcbfadfffbdddaaabdcbcecdbecbbdecffbfcfecbbfebfebcadefecfceadaeffd':11 'dcdf':23 'dd':53 'ddec':31 'debdcdebfffebdbfdeefffbcfbccbececdbeaffffedfbefdcccbbccadedecfbeccccbbb':48           
'deefaeeaabdbbdafcfcbeecc':71 'df':26 'dfcbbcd':46 'e':7,51,56 'eafddcaac':43 'ecbaffa':21 'ecdeeceddbdcbfcabdc':10 'ecedcec':41 'ed':66           
'edcbaecfcdfbcbcdedeebdbfceeeececfac':35 'eeca':25 'eeebafeacfebfdbdbddaacabebabbfbfdefeddefccfbeaefdbf':29 'eefdbfcadebcbbfffaefcaecafbddbdbfcf':13           
'ef':58 'efbdc':67 'efccdddaebfbdaffcdfcbfdcbdeb':54 'efccebdddededdeda':64 'effcbfdfdeebfbbcfaabfd':12 'f':24,28,65 'fbbfccfcbcba':30 'fc':4           
'fcbbdbbaefcefefdf':34 'fd':18 'fdffcbe':69 'fea':62 'feeabdcd':36 'feeadcedecedebaedccffbfddadcfececbefddcbeaedbebfadefedcbd':22 'feffceceaeec':49 'ffaffde':40          

用户全文检索请求,输入4个查询条件,流式返回PK。

建议使用流式返回接口,因为结果集可能非常大。

select pk from bptest1 where uid=$1 and ts between $2 and $3 and content @@ to_tsquery('english', $4);        

压测

./test.sh          

资源使用

dstat

CPU基本耗尽,磁盘的写入也非常的充分

cpu大部分为user的开销,后面使用perf看一下

dstat          
          
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--          
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw           
 92   7   1   0   0   1|1075M 2435M|4048B 2297B|   0     0 | 142k  167k          
 92   7   0   0   0   1|1137M 2075M|2391B 1945B|   0     0 | 135k  161k          
 91   8   0   0   0   0|1182M 2125M|3483B 2845B|   0     0 | 140k  166k          
 91   8   0   0   0   1|1193M 1971M|3788B 1633B|   0     0 | 135k  159k          
 91   8   0   0   0   0|1089M 2305M|2232B 1478B|   0     0 | 139k  159k          
 92   7   0   0   0   1| 986M 2795M|2176B 1568B|   0     0 | 127k  142k          
 92   7   1   0   0   0| 760M 2864M|6028B 1408B|   0     0 | 116k  118k          
 90   8   0   0   0   0|1029M 3057M|1565B 2116B|   0     0 | 132k  150k          
 90   9   1   0   0   1|1000M 3237M|2336B 4850B|   0     0 | 133k  154k          
 90   8   1   1   0   1| 659M 4399M|2872B 7992B|   0     0 | 115k  119k          
 91   7   0   0   0   1| 925M 2996M|1293B 1059B|   0     0 | 122k  127k          
 90   8   1   1   0   1| 996M 3350M| 664B  574B|   0     0 | 133k  148k          
 91   7   1   0   0   1| 948M 2927M|3525B 2500B|   0     0 | 132k  146k          
 90   8   0   1   0   0|1114M 2869M|1751B 2645B|   0     0 | 132k  150k          
 90   8   0   1   0   1|1267M 2408M|3003B 2244B|   0     0 | 137k  167k          
 91   8   0   1   0   1|1086M 2539M| 900B  347B|   0     0 | 133k  154k          
 91   8   0   0   0   1| 998M 2614M|1975B 1757B|   0     0 | 130k  151k          
 91   8   0   0   0   0|1120M 2150M|1466B 4911B|   0     0 | 130k  154k          
 92   7   0   0   0   0|1163M 2387M|1356B  498B|   0     0 | 136k  163k          
 90   8   1   1   0   1| 864M 2656M|2601B 3373B|   0     0 | 130k  143k          
 91   8   0   0   0   1| 987M 2651M|2052B  898B|   0     0 | 135k  154k          
 91   8   0   0   0   0|1073M 2205M|2479B 2319B|   0     0 | 130k  144k          
 90   8   1   1   0   1| 951M 2941M|1390B 1001B|   0     0 | 130k  148k          

磁盘使用率

iostat -x           
          
avg-cpu:  %user   %nice %system %iowait  %steal   %idle          
          85.29    0.49    9.89    1.90    0.00    2.43          
          
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util          
sdb               0.00    92.40   99.40  389.00  2822.40 136435.20   285.13     2.25    4.60   0.65  31.64          
sdc               0.00    33.60  154.20  211.60  4838.40 85700.80   247.51     1.14    3.10   0.63  22.96          
sdd               0.00    63.00  232.40  238.40  7316.80 109648.00   248.44     2.17    4.61   0.99  46.54          
sde               0.00    78.80  269.60  340.80  7980.80 102419.20   180.87     2.53    4.14   0.94  57.62          
sdf               0.00    58.40  283.00  234.20  8204.80 99129.60   207.53     2.30    4.45   0.93  48.20          
sdg               0.00    50.80  207.60  236.60  6652.80 94337.60   227.35     1.42    3.19   0.68  30.34          
sdh               0.00   102.20  109.40  475.20  3489.60 131211.20   230.42     2.60    4.45   0.52  30.40          
sdi               0.00    70.20  107.00  337.00  3228.80 79603.20   186.56     1.35    3.04   0.53  23.38          
sdj               0.00    31.00   70.60  158.80  2534.40 85124.80   382.12     0.82    3.59   0.86  19.72          
sdk               0.20    58.40  190.60  295.80  5587.20 123539.20   265.47     1.74    3.57   0.68  33.28          
sdl               0.00    91.00  162.80  396.40  4441.60 119507.20   221.65     1.98    3.54   0.59  33.26          
sdm               0.00   274.80  103.20  359.20  2296.00 158908.80   348.63     3.81    8.23   1.08  50.06          

perf

大部分的开销是postgres进程消耗的,建议使用以下开关重新编译一下.

《PostgreSQL 源码性能诊断(perf profiling)指南》

top -ag          
          
   PerfTop:    9171 irqs/sec  kernel:63.7%  exact:  0.0% [1000Hz cycles],  (all, 64 CPUs)        
----------------------------------------------------------------------------------------------                           
        
 samples  pcnt function                    DSO        
 _______ _____ ___________________________ _______________________________________        
        
23044.00  4.5% tsCompareString             /home/digoal/pgsql10_8k/bin/postgres         
19821.00  3.9% ExecInterpExpr              /home/digoal/pgsql10_8k/bin/postgres         
12258.00  2.4% gintuple_get_key            /home/digoal/pgsql10_8k/bin/postgres         
12208.00  2.4% pg_detoast_datum_packed     /home/digoal/pgsql10_8k/bin/postgres         
11111.00  2.2% hash_search_with_hash_value /home/digoal/pgsql10_8k/bin/postgres         
10318.00  2.0% memcpy                      /lib64/libc-2.12.so                            
 9078.00  1.8% AllocSetAlloc               /home/digoal/pgsql10_8k/bin/postgres         
 8944.00  1.7% advance_aggregates          /home/digoal/pgsql10_8k/bin/postgres         
 8547.00  1.7% cmpEntryAccumulator         /home/digoal/pgsql10_8k/bin/postgres         
 7311.00  1.4% array_seek                  /home/digoal/pgsql10_8k/bin/postgres         
 6744.00  1.3% gin_cmp_tslexeme            /home/digoal/pgsql10_8k/bin/postgres         
 6650.00  1.3% __closure_wake_up           [bcache]                                     
 6550.00  1.3% appendBinaryStringInfo      /home/digoal/pgsql10_8k/bin/postgres         
 6475.00  1.3% TParserGet                  /home/digoal/pgsql10_8k/bin/postgres         
 5578.00  1.1% ginFindLeafPage             /home/digoal/pgsql10_8k/bin/postgres         
 5543.00  1.1% PyParser_AddToken           /lib64/libpython2.7.so.1.0                  
 5412.00  1.1% array_get_element           /home/digoal/pgsql10_8k/bin/postgres         
 5355.00  1.0% heap_fill_tuple             /home/digoal/pgsql10_8k/bin/postgres         
 4936.00  1.0% entryLocateLeafEntry        /home/digoal/pgsql10_8k/bin/postgres         
 4732.00  0.9% heap_form_minimal_tuple     /home/digoal/pgsql10_8k/bin/postgres         
 4512.00  0.9% rb_insert                   /home/digoal/pgsql10_8k/bin/postgres        

top

top -c -u digoal          
          
top - 19:20:47 up 179 days,  5:38,  8 users,  load average: 183.79, 189.01, 166.41          
Tasks: 2939 total, 159 running, 2780 sleeping,   0 stopped,   0 zombie          
Cpu(s): 87.7%us,  8.9%sy,  0.9%ni,  0.9%id,  1.2%wa,  0.0%hi,  0.5%si,  0.0%st          
Mem:  529321828k total, 512395020k used, 16926808k free,   299780k buffers          
Swap:        0k total,        0k used,        0k free, 482162560k cached          
          
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                   
11258 digoal  20   0 20.7g 2.9g 2.9g R 78.7  0.6   5:11.98 postgres: postgres postgres 127.0.0.1(52848) INSERT                                                            
11253 digoal  20   0 20.7g 3.0g 2.9g R 70.7  0.6   5:14.48 postgres: postgres postgres 127.0.0.1(52843) INSERT                                                            
11264 digoal  20   0 20.7g 3.0g 2.9g R 68.1  0.6   5:14.63 postgres: postgres postgres 127.0.0.1(52854) INSERT                                                            
11263 digoal  20   0 20.7g 3.0g 2.9g R 65.2  0.6   5:14.89 postgres: postgres postgres 127.0.0.1(52853) INSERT                                                            
11250 digoal  20   0 20.7g 3.0g 3.0g R 59.4  0.6   5:16.66 postgres: postgres postgres 127.0.0.1(52840) idle                                                              
11268 digoal  20   0 20.7g 2.9g 2.9g R 53.3  0.6   5:11.36 postgres: postgres postgres 127.0.0.1(52858) INSERT                                                            
11266 digoal  20   0 20.7g 3.0g 3.0g R 52.3  0.6   5:09.00 postgres: postgres postgres 127.0.0.1(52856) INSERT                                                            
11068 digoal  20   0 20.7g 2.6g 2.6g R 51.4  0.5   4:44.47 postgres: postgres postgres 127.0.0.1(45823) INSERT                                                            
11097 digoal  20   0 20.7g 2.6g 2.5g R 49.5  0.5   4:47.85 postgres: postgres postgres 127.0.0.1(45839) INSERT                                                            
11161 digoal  20   0 20.7g 2.6g 2.6g R 49.5  0.5   4:47.87 postgres: postgres postgres 127.0.0.1(44881) INSERT                                                            
11256 digoal  20   0 20.7g 3.0g 3.0g R 49.5  0.6   5:14.69 postgres: postgres postgres 127.0.0.1(52846) INSERT                                                            
10819 digoal  20   0 20.7g 2.7g 2.7g R 48.8  0.5   4:58.17 postgres: postgres postgres 127.0.0.1(47342) INSERT                                                            
11107 digoal  20   0 20.7g 2.7g 2.7g R 48.8  0.5   5:02.00 postgres: postgres postgres 127.0.0.1(59612) INSERT                                                            
11255 digoal  20   0 20.7g 3.0g 3.0g R 48.2  0.6   5:15.68 postgres: postgres postgres 127.0.0.1(52845) INSERT                                                            
11267 digoal  20   0 20.7g 3.0g 3.0g R 47.9  0.6   5:18.82 postgres: postgres postgres 127.0.0.1(52857) INSERT                                                            
11066 digoal  20   0 20.7g 2.6g 2.6g R 46.9  0.5   4:44.97 postgres: postgres postgres 127.0.0.1(45821) INSERT                                                            
11222 digoal  20   0 20.7g 2.6g 2.6g R 45.9  0.5   5:00.43 postgres: postgres postgres 127.0.0.1(40379) idle                                                              
11207 digoal  20   0 20.7g 2.6g 2.6g R 45.6  0.5   5:04.59 postgres: postgres postgres 127.0.0.1(46160) INSERT                                                            
11224 digoal  20   0 20.7g 2.6g 2.6g R 45.3  0.5   5:02.60 postgres: postgres postgres 127.0.0.1(40381) INSERT                                                            
11249 digoal  20   0 20.7g 2.6g 2.6g R 45.3  0.5   4:59.58 postgres: postgres postgres 127.0.0.1(46187) INSERT                                                            
11045 digoal  20   0 20.7g 2.6g 2.6g R 44.6  0.5   4:39.75 postgres: postgres postgres 127.0.0.1(64424) idle                                                              
11064 digoal  20   0 20.7g 2.6g 2.6g R 44.6  0.5   4:44.69 postgres: postgres postgres 127.0.0.1(45819) INSERT                                                            
11145 digoal  20   0 20.7g 2.6g 2.6g S 44.3  0.5   4:46.18 postgres: postgres postgres 127.0.0.1(44876) INSERT                                                            
10865 digoal  20   0 20.7g 2.6g 2.6g R 44.0  0.5   4:59.89 postgres: postgres postgres 127.0.0.1(49769) INSERT                                                            
11080 digoal  20   0 20.7g 2.6g 2.5g R 44.0  0.5   4:43.70 postgres: postgres postgres 127.0.0.1(45825) INSERT                                                            
11247 digoal  20   0 20.7g 2.6g 2.6g R 43.4  0.5   5:01.91 postgres: postgres postgres 127.0.0.1(40391) idle                                                              
11163 digoal  20   0 20.7g 2.6g 2.6g R 42.7  0.5   4:48.34 postgres: postgres postgres 127.0.0.1(44882) idle                                                              
11164 digoal  20   0 20.7g 2.6g 2.6g R 42.4  0.5   4:53.21 postgres: postgres postgres 127.0.0.1(44883) INSERT                                                            
10882 digoal  20   0 20.7g 2.6g 2.6g R 41.8  0.5   5:04.78 postgres: postgres postgres 127.0.0.1(49772) INSERT                                                            
10868 digoal  20   0 20.7g 2.6g 2.6g R 41.4  0.5   5:00.30 postgres: postgres postgres 127.0.0.1(49770) INSERT          

写入tps

换算成单机的写入,约6.5万行/s。

progress: 729.0 s, 55.0 tps, lat 11.610 ms stddev 4.836          
progress: 732.0 s, 59.7 tps, lat 20.071 ms stddev 107.984          
progress: 735.0 s, 57.0 tps, lat 20.492 ms stddev 125.445          
progress: 738.0 s, 38.7 tps, lat 25.891 ms stddev 154.607          
progress: 741.0 s, 41.0 tps, lat 24.405 ms stddev 140.247          
progress: 744.0 s, 43.0 tps, lat 13.550 ms stddev 10.448          
progress: 747.0 s, 60.0 tps, lat 20.691 ms stddev 131.640          
progress: 750.0 s, 60.0 tps, lat 17.394 ms stddev 83.385          
progress: 753.0 s, 44.3 tps, lat 25.510 ms stddev 146.719          
progress: 756.0 s, 25.0 tps, lat 39.819 ms stddev 213.642          
progress: 759.0 s, 50.0 tps, lat 11.439 ms stddev 5.319          
progress: 762.0 s, 60.0 tps, lat 20.979 ms stddev 106.782          
progress: 765.0 s, 60.0 tps, lat 18.778 ms stddev 167.714          
progress: 768.0 s, 58.0 tps, lat 18.017 ms stddev 99.949          
progress: 771.0 s, 51.0 tps, lat 19.636 ms stddev 124.429          

写入性能基本上取决于tsvector字段的元素个数,散列程度,本例每条记录约40个元素。如果元素个数下降一半,性能将提升一倍左右。

postgres=# select array_length(tsvector_to_array(content),1) from bptest1 limit 10;        
 array_length         
--------------        
           40        
           37        
           40        
           45        
           35        
           42        
           38        
           46        
           30        
           40        
(10 rows)        

评估每秒构建了多少个索引条目

1. 全文检索索引条目

每条记录约40个元素,当插入的tps=6.5万时,构建的全文检索条目数约 260万/s。

65000*40 = 2600000        

2. uid索引条目,较小,忽略不计。

3. ts索引条目,使用BRIN块级索引,忽略不计。

性能影响最大,资源消耗最多的就是全文检索索引条目的构建。

查询性能

举例

postgres=# \dt+ bptest1      
                      List of relations      
 Schema |  Name   | Type  |  Owner   |  Size   | Description       
--------+---------+-------+----------+---------+-------------      
 public | bptest1 | table | postgres | 1689 MB |       
(1 row)      
      
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from bptest1 where ts between 1494999617 and 1495999617 and content @@ to_tsquery ('english','abc');      
                                                              QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.bptest1  (cost=175.95..23691.41 rows=20015 width=811) (actual time=7.017..23.376 rows=19755 loops=1)      
   Output: uid, ts, content      
   Recheck Cond: (bptest1.content @@ '''abc'''::tsquery)      
   Filter: ((bptest1.ts >= 1494999617) AND (bptest1.ts <= 1495999617))      
   Heap Blocks: exact=18933      
   Buffers: shared hit=18948      
   ->  Bitmap Index Scan on bptest1_content_idx  (cost=0.00..170.94 rows=20019 width=0) (actual time=3.811..3.811 rows=19755 loops=1)      
         Index Cond: (bptest1.content @@ '''abc'''::tsquery)      
         Buffers: shared hit=15      
 Planning time: 0.097 ms      
 Execution time: 24.517 ms      
(11 rows)       
      
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from bptest1 where ts between 1494999617 and 1495999617 and content @@ to_tsquery ('english','abc & bc');      
                                                            QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.bptest1  (cost=36.27..2598.42 rows=1996 width=811) (actual time=4.577..6.711 rows=2125 loops=1)      
   Output: uid, ts, content      
   Recheck Cond: (bptest1.content @@ '''abc'' & ''bc'''::tsquery)      
   Filter: ((bptest1.ts >= 1494999617) AND (bptest1.ts <= 1495999617))      
   Heap Blocks: exact=2116      
   Buffers: shared hit=2239      
   ->  Bitmap Index Scan on bptest1_content_idx  (cost=0.00..35.77 rows=1997 width=0) (actual time=4.291..4.291 rows=2125 loops=1)      
         Index Cond: (bptest1.content @@ '''abc'' & ''bc'''::tsquery)      
         Buffers: shared hit=123      
 Planning time: 0.125 ms      
 Execution time: 6.849 ms      
(11 rows)      

纯SSD fsync=on 写入性能

1. 写入TPS

7万/s ,构建的全文检索条目数约 280万/s。

性能比较平稳。

纯SATA+SSD bcache fsync=off 写入性能

1. 写入TPS

7.5万/s ,构建的全文检索条目数约 300万/s。

性能比较平稳。

小结

1. 查询聚合

由于日志数据打散分布在多个集群,多个表内,建议使用plproxy进行查询的聚合。

参考

《A Smart PostgreSQL extension plproxy 2.2 practices》

《阿里云ApsaraDB RDS for PostgreSQL 最佳实践 - 4 水平分库 之 节点扩展》

《阿里云ApsaraDB RDS for PostgreSQL 最佳实践 - 3 水平分库 vs 单机 性能》

《阿里云ApsaraDB RDS for PostgreSQL 最佳实践 - 2 教你RDS PG的水平分库》

2. 写入分片

写入分片,可以在业务层完成,随机打散写入。

实际应用时,可以根据需要,切分成更多的分区。

3. 主要的开销是postgres的开销,如果需要详细的分析,建议重新编译postgres

4. gin索引的优化

https://www.postgresql.org/docs/9.6/static/sql-createindex.html

GIN indexes accept different parameters:          
          
1. fastupdate          
          
This setting controls usage of the fast update technique described in Section 63.4.1.           
It is a Boolean parameter: ON enables fast update, OFF disables it.           
(Alternative spellings of ON and OFF are allowed as described in Section 19.1.) The default is ON.          
          
Note: Turning fastupdate off via ALTER INDEX prevents future insertions from going into the list of pending index entries,           
but does not in itself flush previous entries.           
          
You might want to VACUUM the table or call gin_clean_pending_list function afterward to ensure the pending list is emptied.          
          
2. gin_pending_list_limit          
          
Custom gin_pending_list_limit parameter.           
This value is specified in kilobytes.          

gin_pending_list_limit的目的是延迟合并,因为一条记录中可能涉及较多的GIN KEY,如果实时更新,GIN索引的写入量会非常大,性能受到影响。

本例gin_pending_list_limit设置为2MB,tps比较平缓,如果设置过大,当CPU资源不足时,抖动会比较严重。

用户可以根据实际测试,设置合理的gin_pending_list_limit值。

5. 如果把PostgreSQL完全当成索引库使用,并且允许数据丢失,那么可以使用fsync=off的开关,(检查点fsync对IO的影响比较大,本例使用的是SATA盘,将会导致较大的性能抖动)。

postgresql.auto.conf      
      
fsync = off      
zero_damaged_pages = on        

如果有ha的话,丢失的风险又会更小。(但是服务器CRASH后,需要重建备库,这么大的量,还是挺恐怖的。)

建议用更多的数据库实例,每个实例的大小可控(例如 < 2TB),重建的时间也相对可控。

6. 为了达到更好的响应速度(RT),建议明细和索引分开存放,明细要求写入RT低,索引可以存在一定的延迟。 并且索引与明细数据的可靠性要求也不一样。

Flag Counter

digoal’s 大量PostgreSQL文章入口