PostgreSQL 11 1万亿 tpcb 性能测试 on 阿里云ECS + ESSD + zfs/lvm2条带 + block_size=32K

32 minute read

背景

最近的几个PostgreSQL OLTP与OLAP的测试:

《PostgreSQL 11 tpcc 测试(103万tpmC on ECS) - use sysbench-tpcc by Percona-Lab》

《(TPC-H测试 SF=10,SF=200) PostgreSQL 11 vs 10 vs Deepgreen》

《PostgreSQL 100亿 tpcb 性能 on ECS》

《[未完待续] PostgreSQL on 阿里云ECS+ESSD - 1000亿 tpcb、1000W tpcc 测试》

覆盖面:

1、SF=10, SF=200 TPCH

2、1000W TPCC

3、100亿 TPCB

4、1000亿 TPCB

5、1万亿 TPCB(约125TB 单表。本文要测试的)

本文使用的是16块ESSD云盘,测试时,使用了两套文件系统,ZFS与EXT4,都使用到了条带。

环境

1、ecs,CentOS 7.4 x64

2、CPU

lscpu  
  
Architecture:          x86_64  
CPU op-mode(s):        32-bit, 64-bit  
Byte Order:            Little Endian  
CPU(s):                64  
On-line CPU(s) list:   0-63  
Thread(s) per core:    2  
Core(s) per socket:    32  
Socket(s):             1  
NUMA node(s):          1  
Vendor ID:             GenuineIntel  
CPU family:            6  
Model:                 85  
Model name:            Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz  
Stepping:              4  
CPU MHz:               2499.996  
BogoMIPS:              4999.99  
Hypervisor vendor:     KVM  
Virtualization type:   full  
L1d cache:             32K  
L1i cache:             32K  
L2 cache:              1024K  
L3 cache:              33792K  
NUMA node0 CPU(s):     0-63  
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1  

3、内核

uname -a  
Linux pg11-320tb-zfs 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux  

4、内存

free -g  
              total        used        free      shared  buff/cache   available  
Mem:            503           2         500           0           0         498  
Swap:             0           0           0  

5、时钟

echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource  

6、块设备

lsblk  
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT  
vda    253:0    0  200G  0 disk   
└─vda1 253:1    0  200G  0 part /  
vdb    253:16   0   20T  0 disk   
vdc    253:32   0   20T  0 disk   
vdd    253:48   0   20T  0 disk   
vde    253:64   0   20T  0 disk   
vdf    253:80   0   20T  0 disk   
vdg    253:96   0   20T  0 disk   
vdh    253:112  0   20T  0 disk   
vdi    253:128  0   20T  0 disk   
vdj    253:144  0   20T  0 disk   
vdk    253:160  0   20T  0 disk   
vdl    253:176  0   20T  0 disk   
vdm    253:192  0   20T  0 disk   
vdn    253:208  0   20T  0 disk   
vdo    253:224  0   20T  0 disk   
vdp    253:240  0   20T  0 disk   
vdq    253:256  0   20T  0 disk   

配置ECS虚拟机OS参数

1、内核参数

vi /etc/sysctl.conf    
    
# add by digoal.zhou        
fs.aio-max-nr = 1048576        
fs.file-max = 76724600        
        
# 可选:kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p                 
# /data01/corefiles 事先建好,权限777,如果是软链接,对应的目录修改为777        
        
kernel.sem = 4096 2147483647 2147483646 512000            
# 信号量, ipcs -l 或 -u 查看,每16个进程一组,每组信号量需要17个信号量。        
        
kernel.shmall = 107374182              
# 所有共享内存段相加大小限制 (建议内存的80%),单位为页。        
kernel.shmmax = 274877906944           
# 最大单个共享内存段大小 (建议为内存一半), >9.2的版本已大幅降低共享内存的使用,单位为字节。        
kernel.shmmni = 819200                 
# 一共能生成多少共享内存段,每个PG数据库集群至少2个共享内存段        
        
net.core.netdev_max_backlog = 10000        
net.core.rmem_default = 262144               
# The default setting of the socket receive buffer in bytes.        
net.core.rmem_max = 4194304                  
# The maximum receive socket buffer size in bytes        
net.core.wmem_default = 262144               
# The default setting (in bytes) of the socket send buffer.        
net.core.wmem_max = 4194304                  
# The maximum send socket buffer size in bytes.        
net.core.somaxconn = 4096        
net.ipv4.tcp_max_syn_backlog = 4096        
net.ipv4.tcp_keepalive_intvl = 20        
net.ipv4.tcp_keepalive_probes = 3        
net.ipv4.tcp_keepalive_time = 60        
net.ipv4.tcp_mem = 8388608 12582912 16777216        
net.ipv4.tcp_fin_timeout = 5        
net.ipv4.tcp_synack_retries = 2        
net.ipv4.tcp_syncookies = 1            
# 开启SYN Cookies。当出现SYN等待队列溢出时,启用cookie来处理,可防范少量的SYN攻击        
net.ipv4.tcp_timestamps = 1            
# 减少time_wait        
net.ipv4.tcp_tw_recycle = 0            
# 如果=1则开启TCP连接中TIME-WAIT套接字的快速回收,但是NAT环境可能导致连接失败,建议服务端关闭它        
net.ipv4.tcp_tw_reuse = 1              
# 开启重用。允许将TIME-WAIT套接字重新用于新的TCP连接        
net.ipv4.tcp_max_tw_buckets = 262144        
net.ipv4.tcp_rmem = 8192 87380 16777216        
net.ipv4.tcp_wmem = 8192 65536 16777216        
        
net.nf_conntrack_max = 1200000        
net.netfilter.nf_conntrack_max = 1200000        
        
vm.dirty_background_bytes = 409600000               
#  系统脏页到达这个值,系统后台刷脏页调度进程 pdflush(或其他) 自动将(dirty_expire_centisecs/100)秒前的脏页刷到磁盘        
#  默认为10%,大内存机器建议调整为直接指定多少字节        
        
vm.dirty_expire_centisecs = 3000                     
#  比这个值老的脏页,将被刷到磁盘。3000表示30秒。        
vm.dirty_ratio = 95                                  
#  如果系统进程刷脏页太慢,使得系统脏页超过内存 95 % 时,则用户进程如果有写磁盘的操作(如fsync, fdatasync等调用),则需要主动把系统脏页刷出。        
#  有效防止用户进程刷脏页,在单机多实例,并且使用CGROUP限制单实例IOPS的情况下非常有效。          
        
vm.dirty_writeback_centisecs = 100                    
#  pdflush(或其他)后台刷脏页进程的唤醒间隔, 100表示1秒。        
        
vm.swappiness = 0        
#  不使用交换分区        
        
vm.mmap_min_addr = 65536        
vm.overcommit_memory = 0             
#  在分配内存时,允许少量over malloc, 如果设置为 1, 则认为总是有足够的内存,内存较少的测试环境可以使用 1 .          
        
vm.overcommit_ratio = 90             
#  当overcommit_memory = 2 时,用于参与计算允许指派的内存大小。        
vm.swappiness = 0                    
#  关闭交换分区        
vm.zone_reclaim_mode = 0             
# 禁用 numa, 或者在vmlinux中禁止.         
net.ipv4.ip_local_port_range = 40000 65535            
# 本地自动分配的TCP, UDP端口号范围        
fs.nr_open=20480000        
# 单个进程允许打开的文件句柄上限        
        
# 以下参数请注意        
# vm.extra_free_kbytes = 4096000        
# vm.min_free_kbytes = 2097152        
# 如果是小内存机器,以上两个值不建议设置        
# vm.nr_hugepages = 66536            
#  建议shared buffer设置超过64GB时 使用大页,页大小 /proc/meminfo Hugepagesize        
# vm.lowmem_reserve_ratio = 1 1 1        
# 对于内存大于64G时,建议设置,否则建议默认值 256 256 32    

2、资源限制

vi /etc/security/limits.conf    
    
# nofile超过1048576的话,一定要先将sysctl的fs.nr_open设置为更大的值,并生效后才能继续设置nofile.        
        
* soft    nofile  1024000        
* hard    nofile  1024000        
* soft    nproc   unlimited        
* hard    nproc   unlimited        
* soft    core    unlimited        
* hard    core    unlimited        
* soft    memlock unlimited        
* hard    memlock unlimited       

3、关闭透明大页,使用精准时钟(可选)

vi /etc/rc.local    
    
touch /var/lock/subsys/local    
         
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then        
   echo never > /sys/kernel/mm/transparent_hugepage/enabled        
fi      
  
# 时钟  
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource  
  
su - postgres -c "pg_ctl start"      

部署 PostgreSQL 11

rpm -ivh https://download.postgresql.org/pub/repos/yum/11/redhat/rhel-7-x86_64/pgdg-centos11-11-2.noarch.rpm    
    
rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm     
    
yum -y install coreutils glib2 lrzsz dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc gcc-c++ make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex  openjade bzip2 git iotop lvm2 perf      
    
yum install -y postgresql11*    

块设备部署策略1 - zfs

zfsonlinux

《[未完待续] PostgreSQL on ECS 高效率持续备份设计 - By ZFS on Linux》

《PostgreSQL OLTP on ZFS 性能优化》

1、zfs yum配置

yum install -y http://download.zfsonlinux.org/epel/zfs-release.el7_4.noarch.rpm  

2、当前内核对应的kernel-devel

http://vault.centos.org

uname -a  
Linux pg11-320tb-zfs 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux  
rpm -ivh http://vault.centos.org/7.4.1708/updates/x86_64/Packages/kernel-devel-3.10.0-693.2.2.el7.x86_64.rpm  

3、安装zfs

yum install -y zfs   

查看日志,是否有报错,正常情况下没有报错

测试是否可用

modprobe zfs  
  
zfs get -o all  

系统启动将自动加载zfs

创建zpool

essd底层三副本,无需再使用zfs的RAID功能。

parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdj mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdk mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdl mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdm mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdn mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdo mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdp mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdq mklabel gpt mkpart primary 1MiB 100%FREE  
zpool create zp1 -f -o ashift=13 vdb1 vdc1 vdd1 vde1 vdf1 vdg1 vdh1 vdi1 vdj1 vdk1 vdl1 vdm1 vdn1 vdo1 vdp1 vdq1  
zpool get all zp1  
NAME  PROPERTY                       VALUE                          SOURCE  
zp1   size                           318T                           -  
zp1   capacity                       0%                             -  
zp1   altroot                        -                              default  
zp1   health                         ONLINE                         -  
zp1   guid                           12407519490197584982           -  
zp1   version                        -                              default  
zp1   bootfs                         -                              default  
zp1   delegation                     on                             default  
zp1   autoreplace                    off                            default  
zp1   cachefile                      -                              default  
zp1   failmode                       wait                           default  
zp1   listsnapshots                  off                            default  
zp1   autoexpand                     off                            default  
zp1   dedupditto                     0                              default  
zp1   dedupratio                     1.00x                          -  
zp1   free                           318T                           -  
zp1   allocated                      960K                           -  
zp1   readonly                       off                            -  
zp1   ashift                         13                             local  
zp1   comment                        -                              default  
zp1   expandsize                     -                              -  
zp1   freeing                        0                              -  
zp1   fragmentation                  0%                             -  
zp1   leaked                         0                              -  
zp1   multihost                      off                            default  
zp1   feature@async_destroy          enabled                        local  
zp1   feature@empty_bpobj            enabled                        local  
zp1   feature@lz4_compress           active                         local  
zp1   feature@multi_vdev_crash_dump  enabled                        local  
zp1   feature@spacemap_histogram     active                         local  
zp1   feature@enabled_txg            active                         local  
zp1   feature@hole_birth             active                         local  
zp1   feature@extensible_dataset     active                         local  
zp1   feature@embedded_data          active                         local  
zp1   feature@bookmarks              enabled                        local  
zp1   feature@filesystem_limits      enabled                        local  
zp1   feature@large_blocks           enabled                        local  
zp1   feature@large_dnode            enabled                        local  
zp1   feature@sha512                 enabled                        local  
zp1   feature@skein                  enabled                        local  
zp1   feature@edonr                  enabled                        local  
zp1   feature@userobj_accounting     active                         local  

创建zfs

zfs create -o mountpoint=/data01 -o recordsize=8K -o atime=off -o primarycache=metadata -o logbias=throughput -o secondarycache=none zp1/data01  
  
  
zfs set canmount=off zp1  

优化两个参数

cd /sys/module/zfs/parameters/  
echo 1 > zfs_prefetch_disable  
echo 15 > zfs_arc_shrink_shift   

查看当前参数

zfs get all zp1/data01  
NAME        PROPERTY              VALUE                  SOURCE  
zp1/data01  type                  filesystem             -  
zp1/data01  creation              Wed Sep 19 10:26 2018  -  
zp1/data01  used                  192K                   -  
zp1/data01  available             308T                   -  
zp1/data01  referenced            192K                   -  
zp1/data01  compressratio         1.00x                  -  
zp1/data01  mounted               yes                    -  
zp1/data01  quota                 none                   default  
zp1/data01  reservation           none                   default  
zp1/data01  recordsize            8K                     local  
zp1/data01  mountpoint            /data01                local  
zp1/data01  sharenfs              off                    default  
zp1/data01  checksum              on                     default  
zp1/data01  compression           off                    default  
zp1/data01  atime                 off                    local  
zp1/data01  devices               on                     default  
zp1/data01  exec                  on                     default  
zp1/data01  setuid                on                     default  
zp1/data01  readonly              off                    default  
zp1/data01  zoned                 off                    default  
zp1/data01  snapdir               hidden                 default  
zp1/data01  aclinherit            restricted             default  
zp1/data01  createtxg             81                     -  
zp1/data01  canmount              on                     default  
zp1/data01  xattr                 on                     default  
zp1/data01  copies                1                      default  
zp1/data01  version               5                      -  
zp1/data01  utf8only              off                    -  
zp1/data01  normalization         none                   -  
zp1/data01  casesensitivity       sensitive              -  
zp1/data01  vscan                 off                    default  
zp1/data01  nbmand                off                    default  
zp1/data01  sharesmb              off                    default  
zp1/data01  refquota              none                   default  
zp1/data01  refreservation        none                   default  
zp1/data01  guid                  3373300831209850945    -  
zp1/data01  primarycache          metadata               local  
zp1/data01  secondarycache        none                   default  
zp1/data01  usedbysnapshots       0B                     -  
zp1/data01  usedbydataset         192K                   -  
zp1/data01  usedbychildren        0B                     -  
zp1/data01  usedbyrefreservation  0B                     -  
zp1/data01  logbias               throughput             local  
zp1/data01  dedup                 off                    default  
zp1/data01  mlslabel              none                   default  
zp1/data01  sync                  standard               default  
zp1/data01  dnodesize             legacy                 default  
zp1/data01  refcompressratio      1.00x                  -  
zp1/data01  written               192K                   -  
zp1/data01  logicalused           76K                    -  
zp1/data01  logicalreferenced     76K                    -  
zp1/data01  volmode               default                default  
zp1/data01  filesystem_limit      none                   default  
zp1/data01  snapshot_limit        none                   default  
zp1/data01  filesystem_count      none                   default  
zp1/data01  snapshot_count        none                   default  
zp1/data01  snapdev               hidden                 default  
zp1/data01  acltype               off                    default  
zp1/data01  context               none                   default  
zp1/data01  fscontext             none                   default  
zp1/data01  defcontext            none                   default  
zp1/data01  rootcontext           none                   default  
zp1/data01  relatime              off                    default  
zp1/data01  redundant_metadata    all                    default  
zp1/data01  overlay               off                    default  

初始化数据库

1、目录

mkdir /data01/pg11  
  
chown postgres:postgres /data01/pg11  

2、环境变量

su - postgres  
  
vi .bash_profile  
  
export PS1="$USER@`/bin/hostname -s`-> "        
export PGPORT=1921        
export PGDATA=/data01/pg11/pg_root$PGPORT        
export LANG=en_US.utf8        
export PGHOME=/usr/pgsql-11      
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH        
export DATE=`date +"%Y%m%d%H%M"`      
export PATH=$PGHOME/bin:$PATH:.        
export MANPATH=$PGHOME/share/man:$MANPATH        
export PGHOST=$PGDATA        
export PGUSER=postgres        
export PGDATABASE=postgres        
alias rm='rm -i'        
alias ll='ls -lh'        
unalias vi   

3、初始化

initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024   

huge page配置

zfs 可以绕过文件系统cache,所以建议一配置较大postgresql shared buffer,并使用huge page

《PostgreSQL Huge Page 使用建议 - 大内存主机、实例注意》

300GB/2MB=153600  
sysctl -w vm.nr_hugepages=159600     
  
echo "vm.nr_hugepages=159600" >> /etc/sysctl.conf  

postgresql.auto.conf

listen_addresses = '0.0.0.0'    
port = 1921    
max_connections = 2000    
superuser_reserved_connections = 3    
unix_socket_directories = '., /var/run/postgresql, /tmp'    
tcp_keepalives_idle = 60    
tcp_keepalives_interval = 10    
tcp_keepalives_count = 10    
huge_pages = on  # 使用huge page  
shared_buffers = 300GB    
max_prepared_transactions = 2000    
work_mem = 32MB    
maintenance_work_mem = 2GB    
dynamic_shared_memory_type = posix    
vacuum_cost_delay = 0    
bgwriter_delay = 10ms    
bgwriter_lru_maxpages = 1000    
bgwriter_lru_multiplier = 10.0    
effective_io_concurrency = 0    
max_worker_processes = 128    
max_parallel_workers = 64    
max_parallel_maintenance_workers = 64   
max_parallel_workers_per_gather = 0    
parallel_leader_participation = on    
min_parallel_table_scan_size=0  
min_parallel_index_scan_size=0  
parallel_setup_cost=0  
parallel_tuple_cost=0  
wal_level = minimal      
synchronous_commit = off    
full_page_writes=off       #  zfs内置了checksum,cow. 关闭 fpw . 如果BLOCKDEV能保证8K原子写时,也可以关闭   
wal_writer_delay = 10ms    
checkpoint_timeout = 30min    
max_wal_size = 600GB    
min_wal_size = 150GB    
checkpoint_completion_target = 0.1    
max_wal_senders = 0    
effective_cache_size = 200GB    
log_destination = 'csvlog'    
logging_collector = on    
log_directory = 'log'    
log_filename = 'postgresql-%a.log'    
log_truncate_on_rotation = on    
log_rotation_age = 1d    
log_rotation_size = 0    
log_checkpoints = on     
log_connections = on    
log_disconnections = on    
log_error_verbosity = verbose     
log_line_prefix = '%m [%p] '    
log_timezone = 'PRC'    
log_autovacuum_min_duration = 0    
autovacuum_max_workers = 16    
autovacuum_freeze_max_age = 1200000000    
autovacuum_multixact_freeze_max_age = 1400000000    
autovacuum_vacuum_cost_delay = 0ms    
vacuum_freeze_table_age = 1150000000    
vacuum_multixact_freeze_table_age = 1150000000    
datestyle = 'iso, mdy'    
timezone = 'PRC'    
lc_messages = 'C'    
lc_monetary = 'C'    
lc_numeric = 'C'    
lc_time = 'C'    
default_text_search_config = 'pg_catalog.english'    
jit = off    
cpu_tuple_cost=0.00018884145574257426      
cpu_index_tuple_cost = 0.00433497085216479990      
cpu_operator_cost = 0.00216748542608239995      
seq_page_cost=0.014329      
random_page_cost = 0.016   
  

restart 数据库

pg_ctl restart -m fast  

zfs on linux 性能问题

本例测试时,(vacuum 时很慢,可以看出问题 pgbench -i -s 1000 可复现)

vacuum 进程D状态,stack如下

[<ffffffffc0174132>] cv_wait_common+0xb2/0x150 [spl]
[<ffffffffc0174208>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffc073c42b>] zio_wait+0x10b/0x1b0 [zfs]
[<ffffffffc0687124>] dmu_buf_hold_array_by_dnode+0x154/0x4a0 [zfs]
[<ffffffffc06885f2>] dmu_read_uio_dnode+0x52/0x100 [zfs]
[<ffffffffc06886ec>] dmu_read_uio_dbuf+0x4c/0x70 [zfs]
[<ffffffffc07242d5>] zfs_read+0x135/0x3f0 [zfs]
[<ffffffffc0743990>] zpl_read_common_iovec.constprop.9+0x80/0xd0 [zfs]
[<ffffffffc0743aa6>] zpl_aio_read+0xc6/0xf0 [zfs]
[<ffffffff812001ad>] do_sync_read+0x8d/0xd0
[<ffffffff81200bac>] vfs_read+0x9c/0x170
[<ffffffff81201a6f>] SyS_read+0x7f/0xe0
[<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

块设备部署策略1 - lvm2 , ext4

1、停库,消除zfs

pg_ctl stop -m immediate  
  
zfs destroy zp1/data01  
zpool destroy zp1  

2、清理块设备头信息

wipefs -f -a /dev/vd[b-q]  
  
  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdb  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdc  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdd  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vde  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdf  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdg  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdh  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdi  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdj  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdk  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdl  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdm  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdn  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdo  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdp  
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdq  
  
  
parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdj mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdk mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdl mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdm mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdn mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdo mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdp mklabel gpt mkpart primary 1MiB 100%FREE  
parted -a optimal -s /dev/vdq mklabel gpt mkpart primary 1MiB 100%FREE  
  
  
wipefs -f -a /dev/vd[b-q]1  

3、创建PV

pvcreate /dev/vd[b-q]1  

4、创建vg

vgcreate -A y -s 128M vgdata01 /dev/vd[b-q]1    

5、创建逻辑卷,配置条带

lvcreate -A y -i 16 -I 8 -L 4TiB -n lv03 vgdata01    
lvcreate -A y -i 16 -I 8 -L 220TiB -n lv01 vgdata01    
lvcreate -A y -i 16 -I 8 -l 100%FREE -n lv02 vgdata01    

6、查看

[root@pg11-320tb-zfs ~]# pvs  
  PV         VG       Fmt  Attr PSize   PFree  
  /dev/vdb   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdc   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdd   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vde   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdf   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdg   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdh   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdi   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdj   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdk   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdl   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdm   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdn   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdo   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdp   vgdata01 lvm2 a--  <20.00t    0   
  /dev/vdq   vgdata01 lvm2 a--  <20.00t    0   
[root@pg11-320tb-zfs ~]# vgs  
  VG       #PV #LV #SN Attr   VSize    VFree  
  vgdata01  16   3   0 wz--n- <320.00t    0   
[root@pg11-320tb-zfs ~]# lvs  
  LV   VG       Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert  
  lv01 vgdata01 -wi-a----- 220.00t                                                      
  lv02 vgdata01 -wi-a----- <96.00t                                                      
  lv03 vgdata01 -wi-a-----   4.00t   

7、创建ext4文件系统,配置条带

mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv01    
mkfs.ext4 /dev/mapper/vgdata01-lv02 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv02    
mkfs.ext4 /dev/mapper/vgdata01-lv03 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv03    

8、配置挂载

vi /etc/fstab     
LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0  
LABEL=lv02 /data02 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0  
LABEL=lv03 /data03 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0  
    
    
mkdir /data01    
mkdir /data02    
mkdir /data03    
    
    
mount -a    
[root@pg11-320tb-zfs ~]# df -h  
Filesystem                 Size  Used Avail Use% Mounted on  
/dev/vda1                  197G  2.1G  185G   2% /  
devtmpfs                   252G     0  252G   0% /dev  
tmpfs                      252G     0  252G   0% /dev/shm  
tmpfs                      252G  596K  252G   1% /run  
tmpfs                      252G     0  252G   0% /sys/fs/cgroup  
tmpfs                       51G     0   51G   0% /run/user/0  
/dev/mapper/vgdata01-lv01  220T   20K  220T   1% /data01  
/dev/mapper/vgdata01-lv02   96T   20K   96T   1% /data02  
/dev/mapper/vgdata01-lv03  4.0T   89M  4.0T   1% /data03  

9、创建数据库数据、表空间、WAL日志目录

[root@pg11-320tb-zfs ~]# mkdir /data01/pg11  
[root@pg11-320tb-zfs ~]# mkdir /data02/pg11  
[root@pg11-320tb-zfs ~]# mkdir /data03/pg11  
[root@pg11-320tb-zfs ~]# chown postgres:postgres /data0*/pg11  

10、配置环境变量

su - postgres  
  
export PS1="$USER@`/bin/hostname -s`-> "  
export PGPORT=1921  
export PGDATA=/data01/pg11/pg_root$PGPORT  
export LANG=en_US.utf8  
export PGHOME=/usr/pgsql-11  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH  
export DATE=`date +"%Y%m%d%H%M"`  
export PATH=$PGHOME/bin:$PATH:.  
export MANPATH=$PGHOME/share/man:$MANPATH  
export PGHOST=$PGDATA  
export PGUSER=postgres  
export PGDATABASE=postgres  
alias rm='rm -i'  
alias ll='ls -lh'  
unalias vi  

11、初始化数据库

initdb -D $PGDATA -X /data03/pg11/pg_wal1921 -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024   

12、创建表空间

mkdir /data01/pg11/tbs1  
mkdir /data02/pg11/tbs2  
  
  
create tablespace tbs1 location '/data01/pg11/tbs1';  
create tablespace tbs2 location '/data02/pg11/tbs2';  

1万亿 tpcb test

初始化数据

nohup pgbench -i -s 10000000 -I dtg -n --tablespace=tbs1 >./init.log 2>&1 &  

耗时810688秒,约123.3万行/s

999999100000 of 1000000000000 tuples (99%) done (elapsed 810688.06 s, remaining 0.73 s)
999999200000 of 1000000000000 tuples (99%) done (elapsed 810688.14 s, remaining 0.65 s)
999999300000 of 1000000000000 tuples (99%) done (elapsed 810688.22 s, remaining 0.57 s)
999999400000 of 1000000000000 tuples (99%) done (elapsed 810688.30 s, remaining 0.49 s)
999999500000 of 1000000000000 tuples (99%) done (elapsed 810688.38 s, remaining 0.41 s)
999999600000 of 1000000000000 tuples (99%) done (elapsed 810688.46 s, remaining 0.32 s)
999999700000 of 1000000000000 tuples (99%) done (elapsed 810688.54 s, remaining 0.24 s)
999999800000 of 1000000000000 tuples (99%) done (elapsed 810688.61 s, remaining 0.16 s)
999999900000 of 1000000000000 tuples (99%) done (elapsed 810688.69 s, remaining 0.08 s)
1000000000000 of 1000000000000 tuples (100%) done (elapsed 810688.77 s, remaining 0.00 s)
done.

数据占用空间120TB

postgres=# \l+
                                                                 List of databases
   Name    |  Owner   | Encoding  | Collate |   Ctype    |   Access privileges   |  Size  | Tablespace |                Description                 
-----------+----------+-----------+---------+------------+-----------------------+--------+------------+--------------------------------------------
 postgres  | postgres | SQL_ASCII | C       | en_US.utf8 |                       | 120 TB | pg_default | default administrative connection database
 template0 | postgres | SQL_ASCII | C       | en_US.utf8 | =c/postgres          +| 15 MB  | pg_default | unmodifiable empty database
           |          |           |         |            | postgres=CTc/postgres |        |            | 
 template1 | postgres | SQL_ASCII | C       | en_US.utf8 | =c/postgres          +| 15 MB  | pg_default | default template for new databases
           |          |           |         |            | postgres=CTc/postgres |        |            | 
(3 rows)



postgres=# \dt+
                          List of relations
 Schema |       Name       | Type  |  Owner   |  Size   | Description 
--------+------------------+-------+----------+---------+-------------
 public | pgbench_accounts | table | postgres | 120 TB  | 
 public | pgbench_branches | table | postgres | 344 MB  | 
 public | pgbench_history  | table | postgres | 0 bytes | 
 public | pgbench_tellers  | table | postgres | 4201 MB | 
 public | t                | table | postgres | 804 MB  | 
(5 rows)


postgres=# \db+
                                        List of tablespaces
    Name    |  Owner   |      Location       | Access privileges | Options |  Size   | Description 
------------+----------+---------------------+-------------------+---------+---------+-------------
 pg_default | postgres |                     |                   |         | 850 MB  | 
 pg_global  | postgres |                     |                   |         | 2206 kB | 
 tbs1       | postgres | /data01/digoal/tbs1 |                   |         | 120 TB  | 
 tbs2       | postgres | /data02/digoal/tbs2 |                   |         | 213 MB  | 
(4 rows)

新增、删除字段测试,秒级

新增、删除字段都只需要改元数据,秒级完成。

PostgreSQL新增带default值的字段,也是秒级完成。

《PostgreSQL 11 preview - 添加非空默认值不需要 rewrite table - online add column with default value》

postgres=# \timing
Timing is on.
postgres=# select * from pgbench_accounts limit 10;
 aid | bid | abalance |                                        filler                                        
-----+-----+----------+--------------------------------------------------------------------------------------
   1 |   1 |        0 |                                                                                     
   2 |   1 |        0 |                                                                                     
   3 |   1 |        0 |                                                                                     
   4 |   1 |        0 |                                                                                     
   5 |   1 |        0 |                                                                                     
   6 |   1 |        0 |                                                                                     
   7 |   1 |        0 |                                                                                     
   8 |   1 |        0 |                                                                                     
   9 |   1 |        0 |                                                                                     
  10 |   1 |        0 |                                                                                     
(10 rows)

Time: 498.051 ms

-- 秒级添加字段

postgres=# alter table pgbench_accounts add column col1 text;
ALTER TABLE
Time: 1254.611 ms (00:01.255)
  
-- 秒级添加非空默认值字段
postgres=# alter table pgbench_accounts add column col2 text default 'hello digoal';
ALTER TABLE
Time: 1253.689 ms (00:01.254)


postgres=# select * from pgbench_accounts limit 10;
 aid | bid | abalance |                                        filler                                        | col1 |     col2     
-----+-----+----------+--------------------------------------------------------------------------------------+------+--------------
   1 |   1 |        0 |                                                                                      |      | hello digoal
   2 |   1 |        0 |                                                                                      |      | hello digoal
   3 |   1 |        0 |                                                                                      |      | hello digoal
   4 |   1 |        0 |                                                                                      |      | hello digoal
   5 |   1 |        0 |                                                                                      |      | hello digoal
   6 |   1 |        0 |                                                                                      |      | hello digoal
   7 |   1 |        0 |                                                                                      |      | hello digoal
   8 |   1 |        0 |                                                                                      |      | hello digoal
   9 |   1 |        0 |                                                                                      |      | hello digoal
  10 |   1 |        0 |                                                                                      |      | hello digoal
(10 rows)

Time: 502.608 ms
postgres=# explain analyze select * from pgbench_accounts limit 10;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.00 rows=10 width=168) (actual time=0.022..0.026 rows=10 loops=1)
   ->  Seq Scan on pgbench_accounts  (cost=0.00..184941625.46 rows=673387096855 width=168) (actual time=0.020..0.022 rows=10 loops=1)
 Planning Time: 0.057 ms
 Execution Time: 0.041 ms
(4 rows)

Time: 0.453 ms

-- 分析表
postgres=# analyze pgbench_accounts ;
ANALYZE
Time: 67373.884 ms (01:07.374)

删除字段,毫秒级
postgres=# alter table pgbench_accounts drop column col1;
ALTER TABLE
Time: 7.610 ms
postgres=# alter table pgbench_accounts drop column col2;
ALTER TABLE
Time: 0.546 ms

创建索引

加载初始化数据结束后,创建索引

1、修改并行度

psql  
  
analyze;  
alter table pgbench_accounts set (parallel_workers=64);  
alter table pgbench_tellers set (parallel_workers=64);  
alter table pgbench_branches set (parallel_workers=64);  

2、创建索引

nohup pgbench -i -s 10000000 -I p -n --index-tablespace=tbs2 >./init_pkey.log 2>&1 &  

开启了64并行度,开始时的消耗资源情况

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.20    0.00   38.52   16.31    0.00   41.97

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vdb            2783.00  6362.00 3972.00  169.00 49328.00 52360.00    49.11     5.35    1.29    0.93    9.91   0.24  98.90
vdc            2760.00  6361.00 4020.00  167.00 49296.00 51764.00    48.27     5.19    1.23    0.88    9.59   0.24  98.70
vdd            2747.00  6361.00 4010.00  169.00 49332.00 51860.00    48.43     5.18    1.24    0.88    9.64   0.24  98.80
vde            2757.00  6362.00 4032.00  169.00 49344.00 51864.00    48.18     5.13    1.22    0.87    9.54   0.23  98.50
vdf            2732.00  6360.00 4012.00  191.00 49336.00 52532.00    48.47     5.21    1.24    0.85    9.50   0.24  99.10
vdg            2716.00  6361.00 4039.00  191.00 49320.00 52036.00    47.92     5.28    1.25    0.86    9.35   0.23  99.20
vdh            2742.00  6361.00 4038.00  191.00 49340.00 52032.00    47.94     5.46    1.29    0.91    9.26   0.23  98.80
vdi            2749.00  6361.00 4041.00  193.00 49328.00 52544.00    48.12     5.35    1.26    0.88    9.40   0.23  99.20
vdj            2776.00  6385.00 3953.00  169.00 49344.00 52020.00    49.18     5.41    1.31    0.93   10.12   0.24  99.10
vdk            2767.00  6384.00 3999.00  171.00 49328.00 52028.00    48.61     5.52    1.32    0.96    9.76   0.24  99.10
vdl            2775.00  6386.00 3984.00  172.00 49328.00 52032.00    48.78     5.56    1.34    0.97    9.94   0.24  99.10
vdm            2759.00  6385.00 4039.00  172.00 49296.00 52416.00    48.31     5.58    1.32    0.95    9.87   0.23  98.90
vdn            2776.00  6369.00 3967.00  163.00 49352.00 51840.00    49.00     5.48    1.32    0.95   10.33   0.24  99.70
vdo            2776.00  6370.00 3978.00  163.00 49352.00 52220.00    49.06     5.42    1.31    0.93   10.56   0.24  99.30
vdp            2782.00  6370.00 4004.00  162.00 49356.00 51840.00    48.58     5.60    1.34    0.97   10.44   0.24  99.70
vdq            2759.00  6370.00 4033.00  161.00 49352.00 51828.00    48.25     5.61    1.34    0.97   10.48   0.24  99.50
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00 108392.00 104717.00 790740.00 837136.00    15.28  1293.57    5.82    0.94   10.87   0.00 100.90
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00


top - 07:41:20 up 12 days, 22:02,  2 users,  load average: 61.15, 26.97, 10.80
Tasks: 607 total,  39 running, 568 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.4 us, 54.3 sy,  0.0 ni, 21.2 id, 21.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 52819500+total,  2192660 free, 33286761+used, 19313472+buff/cache
KiB Swap:        0 total,        0 free,        0 used. 19267680+avail Mem 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                 
65466 digoal    20   0  0.296t  25764   2044 R  68.4  0.0   1:11.05 postgres: parallel worker for PID 65299  
65450 digoal    20   0  0.296t  25756   2076 D  63.2  0.0   1:10.51 postgres: parallel worker for PID 65299  
65460 digoal    20   0  0.296t  25764   2048 R  63.2  0.0   1:10.37 postgres: parallel worker for PID 65299  
65469 digoal    20   0  0.296t  25752   2056 R  63.2  0.0   1:10.48 postgres: parallel worker for PID 65299  
65474 digoal    20   0  0.296t  25764   2052 R  63.2  0.0   1:10.36 postgres: parallel worker for PID 65299  
65478 digoal    20   0  0.296t  25764   2060 R  63.2  0.0   1:10.64 postgres: parallel worker for PID 65299  
65479 digoal    20   0  0.296t  25752   2056 R  63.2  0.0   1:10.47 postgres: parallel worker for PID 65299  
65484 digoal    20   0  0.296t  25760   2056 R  63.2  0.0   1:10.63 postgres: parallel worker for PID 65299  
65485 digoal    20   0  0.296t  25748   2068 R  63.2  0.0   1:11.10 postgres: parallel worker for PID 65299  



Total DISK READ :     834.93 M/s | Total DISK WRITE :    1006.90 M/s
Actual DISK READ:     835.23 M/s | Actual DISK WRITE:     994.87 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                                                                      
65461 be/4 digoal     13.90 M/s   12.74 M/s  0.00 % 51.43 % postgres: parallel worker for PID 65299
65457 be/4 digoal     13.86 M/s   12.95 M/s  0.00 % 51.25 % postgres: parallel worker for PID 65299
65505 be/4 digoal     14.82 M/s   13.05 M/s  0.00 % 51.09 % postgres: parallel worker for PID 65299
65466 be/4 digoal     12.57 M/s   13.48 M/s  0.00 % 50.72 % postgres: parallel worker for PID 65299
65462 be/4 digoal     13.16 M/s   13.23 M/s  0.00 % 50.70 % postgres: parallel worker for PID 65299
65491 be/4 digoal      8.85 M/s   12.99 M/s  0.00 % 50.59 % postgres: parallel worker for PID 65299
65451 be/4 digoal     12.33 M/s   13.48 M/s  0.00 % 50.57 % postgres: parallel worker for PID 65299
65477 be/4 digoal     12.37 M/s   13.20 M/s  0.00 % 50.38 % postgres: parallel worker for PID 65299
65459 be/4 digoal      8.45 M/s   19.33 M/s  0.00 % 50.27 % postgres: parallel worker for PID 65299
65488 be/4 digoal     12.34 M/s   12.74 M/s  0.00 % 50.21 % postgres: parallel worker for PID 65299
65495 be/4 digoal     13.83 M/s   13.26 M/s  0.00 % 50.19 % postgres: parallel worker for PID 65299
65450 be/4 digoal      9.20 M/s   19.45 M/s  0.00 % 50.14 % postgres: parallel worker for PID 65299
65503 be/4 digoal     14.02 M/s   19.66 M/s  0.00 % 50.13 % postgres: parallel worker for PID 65299

等待事件情况

postgres=# select wait_event,wait_event_type,count(*) from pg_stat_activity where wait_event is not null group by 1,2;
     wait_event      | wait_event_type | count 
---------------------+-----------------+-------
 BufFileRead         | IO              |    59
 BufFileWrite        | IO              |     1
 CheckpointerMain    | Activity        |     1
 BgWriterHibernate   | Activity        |     1
 AutoVacuumMain      | Activity        |     1
 LogicalLauncherMain | Activity        |     1
 WalWriterMain       | Activity        |     1
(7 rows)
  
-[ RECORD 7 ]----+---------------------------------------------------------------------------------
datid            | 13220
datname          | postgres
pid              | 65448
usesysid         | 10
usename          | postgres
application_name | pgbench
client_addr      | 
client_hostname  | 
client_port      | 
backend_start    | 2018-10-02 07:38:46.003833+08
xact_start       | 2018-10-02 07:38:46.003114+08
query_start      | 2018-10-02 07:38:46.003114+08
state_change     | 2018-10-02 07:38:46.00439+08
wait_event_type  | IO
wait_event       | BufFileRead
state            | active
backend_xid      | 
backend_xmin     | 598
query            | alter table pgbench_accounts add primary key (aid) using index tablespace "tbs2"
backend_type     | parallel worker

临时空间使用情况,大概19TB,worker工作结束后,开始合并索引。

digoal@pg11-320tb-zfs-> cd $PGDATA/base
digoal@pg11-320tb-zfs-> du -sh *
16M     1
16M     13219
16M     13220
19T     pgsql_tmp

每个并行的worker进程完成自己的任务后,开始合并索引,合并速度

Total DISK READ :     116.21 M/s | Total DISK WRITE :     169.91 M/s
Actual DISK READ:     116.21 M/s | Actual DISK WRITE:     197.28 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                                                                       
65299 be/4 digoal    116.21 M/s  169.91 M/s  0.00 %  8.13 % postgres: postgres postgres [local] ALTER TABLE
65298 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % pgbench -i -s 10000000 -I p -n --index-tablespace=tbs2
51030 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres                 
51032 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: logger
51034 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: checkpointer
51035 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: background writer
51036 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: walwriter
51037 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: autovacuum launcher
51038 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: stats collector
51039 be/4 digoal      0.00 B/s    0.00 B/s  0.00 %  0.00 % postgres: logical replication launcher

索引占用空间 20 TB

postgres=# \di+
                                      List of relations
 Schema |         Name          | Type  |  Owner   |      Table       |  Size   | Description 
--------+-----------------------+-------+----------+------------------+---------+-------------
 public | pgbench_accounts_pkey | index | postgres | pgbench_accounts | 20 TB   | 
 public | pgbench_branches_pkey | index | postgres | pgbench_branches | 213 MB  | 
 public | pgbench_tellers_pkey  | index | postgres | pgbench_tellers  | 2125 MB | 
(3 rows)

索引创建耗时

7130分钟

开始
2018-10-02 07:51:00

结束
2018-10-07 06:41:08

压测脚本

只读

vi ro.sql  
  

\set aid random_gaussian(1, :range, 10.0)  
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=100000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=1000000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=10000000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=100000000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=1000000000000

读写

vi rw.sql  
  
\set aid random_gaussian(1, :range, 10.0)  
\set bid random(1, 1 * :scale)  
\set tid random(1, 10 * :scale)  
\set delta random(-5000, 5000)  
BEGIN;  
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
END;  
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=100000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=1000000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=10000000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=100000000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=1000000000000

1、1万亿 tpcb 只读测试 - 1万亿数据活跃

TPS:

QPS:

transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 7737610
latency average = 2.977 ms
latency stddev = 35.840 ms
tps = 21492.371917 (including connections establishing)
tps = 21495.359217 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random_gaussian(1, :range, 10.0)
         2.975  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

1、1万亿 tpcb 读写测试 - 1万亿数据活跃

TPS:

QPS:

transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 5812634
latency average = 3.963 ms
latency stddev = 48.480 ms
tps = 16143.312370 (including connections establishing)
tps = 16145.557184 (excluding connections establishing)
statement latencies in milliseconds:
         0.003  \set aid random_gaussian(1, :range, 10.0)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.025  BEGIN;
         3.511  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.063  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.155  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.119  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.047  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.041  END;

2、1万亿 tpcb 只读测试 - 1000亿数据活跃

TPS:

QPS:

transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 8317050
latency average = 2.770 ms
latency stddev = 34.609 ms
tps = 23101.921465 (including connections establishing)
tps = 23105.640572 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random_gaussian(1, :range, 10.0)
         2.766  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

2、1万亿 tpcb 读写测试 - 1000亿数据活跃

TPS:

QPS:

transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 6526147
latency average = 3.529 ms
latency stddev = 50.638 ms
tps = 18126.367839 (including connections establishing)
tps = 18134.592726 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random_gaussian(1, :range, 10.0)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.025  BEGIN;
         3.102  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.061  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.159  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.091  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.046  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.041  END;

3、1万亿 tpcb 只读测试 - 100亿数据活跃

TPS:

QPS:

transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 212661629
latency average = 0.108 ms
latency stddev = 12.493 ms
tps = 590692.703049 (including connections establishing)
tps = 590774.219034 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set aid random_gaussian(1, :range, 10.0)
         0.107  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

3、1万亿 tpcb 读写测试 - 100亿数据活跃

TPS:

QPS:

transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 27002477
latency average = 0.853 ms
latency stddev = 39.222 ms
tps = 75002.036277 (including connections establishing)
tps = 75012.139249 (excluding connections establishing)
statement latencies in milliseconds:
         0.003  \set aid random_gaussian(1, :range, 10.0)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.044  BEGIN;
         0.211  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.076  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.198  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.115  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.072  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.134  END;

4、1万亿 tpcb 只读测试 - 10亿数据活跃

TPS:

QPS:

transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 374399291
latency average = 0.061 ms
latency stddev = 5.647 ms
tps = 1039962.270864 (including connections establishing)
tps = 1040949.958600 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random_gaussian(1, :range, 10.0)
         0.061  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

4、1万亿 tpcb 读写测试 - 10亿数据活跃

TPS:

QPS:

transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 29574604
latency average = 0.779 ms
latency stddev = 16.197 ms
tps = 82148.432097 (including connections establishing)
tps = 82160.286498 (excluding connections establishing)
statement latencies in milliseconds:
         0.003  \set aid random_gaussian(1, :range, 10.0)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.043  BEGIN;
         0.144  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.074  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.207  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.106  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.070  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.131  END;

5、1万亿 tpcb 只读测试 - 1亿数据活跃

TPS: 1068052

QPS: 1068052

transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 384510720
latency average = 0.060 ms
latency stddev = 4.332 ms
tps = 1068052.373377 (including connections establishing)
tps = 1068206.696327 (excluding connections establishing)
statement latencies in milliseconds:
         0.002  \set aid random_gaussian(1, :range, 10.0)
         0.059  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

5、1万亿 tpcb 读写测试 - 1亿数据活跃

TPS:

QPS:

transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 28314309
latency average = 0.814 ms
latency stddev = 16.406 ms
tps = 78647.191352 (including connections establishing)
tps = 78658.751759 (excluding connections establishing)
statement latencies in milliseconds:
         0.003  \set aid random_gaussian(1, :range, 10.0)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.001  \set delta random(-5000, 5000)
         0.043  BEGIN;
         0.184  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.076  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.217  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.096  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.069  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         0.125  END;

性能小结

环境:阿里云 ECS + 320T ESSD

表SIZE: 120 TB 写入耗时 810688秒,约123.3万行/s

索引SIZE: 20 TB 创建耗时 427800秒

索引深度: 4级(BLOCK_SIZE=32K, 每一页可以存储的item比8k多,所以1万亿的索引层级比1000亿低(8K))

单表数据量 TEST CASE QPS TPS
1万亿 tpcb 活跃数据1亿 只读 1068052 1068052
1万亿 tpcb 活跃数据10亿 只读 1039962 1039962
1万亿 tpcb 活跃数据100亿 只读 590692 590692
1万亿 tpcb 活跃数据1000亿 只读 23101 23101
1万亿 tpcb 活跃数据10000亿 只读 21492 21492
1万亿 tpcb 活跃数据1亿 读写 393235 78647
1万亿 tpcb 活跃数据10亿 读写 410740 82148
1万亿 tpcb 活跃数据100亿 读写 375010 75002
1万亿 tpcb 活跃数据1000亿 读写 90630 18126
1万亿 tpcb 活跃数据10000亿 读写 80715 16143

添加字段(含default值)耗时:1.25 秒。

删除字段耗时:1 毫秒。

附录 - pgbench_accounts 分区, 并行加载测试数据, 动态查询

1万亿单表,会带来什么问题?

1、单表125TB,创建索引耗时增加。PG 11 引入并行创建索引,解决。

2、单表125TB,垃圾回收时间拉长。PG 12 使用zheap引擎彻底杜绝。

3、单表125TB,FREEZE耗时拉长,甚至可能无法在20亿个事务内完成。PG未来版本,使用超过32位的XID,彻底解决。

4、单表125TB,必须放在单个目录下,可能导致文件系统上限(INODE,容量等上限)。

5、单表125TB,要做一些数据清理时不方便,如果有时间维度老化概念,用分区表,可以更好的管理冷热数据,例如pg_pathman。

pgbench转换为分区表。

1、建议使用pg_pathman,性能损失低。内置分区功能,目前还有性能问题。

《PostgreSQL 9.x, 10, 11 hash分区表 用法举例》

《PostgreSQL 10 内置分区 vs pg_pathman perf profiling》

《分区表锁粒度差异 - pg_pathman VS native partition table》

《PostgreSQL 查询涉及分区表过多导致的性能问题 - 性能诊断与优化(大量BIND, spin lock, SLEEP进程)》

使用内部分区,建议使用动态SQL,避免BIND问题。

分区demo

《PostgreSQL pgbench tpcb 数据生成与SQL部分源码解读》

《PostgreSQL pgbench tpcb 海量数据库测试 - 分区表测试优化》

装载数据

1、表

pgbench -i -I dt --tablespace=tbs1 -s 10000000  

2、分区

create table p (like pgbench_accounts) partition by RANGE ( aid ) tablespace tbs1;  
  
do language plpgsql $$                                                           
declare  
  i_rows_perpartition int8 := 244140625;  
begin  
  for i in 0..4096 loop  
    execute format ('create table pgbench_accounts%s partition of p for values from (%s) to (%s) tablespace tbs1', i, i*i_rows_perpartition, (i+1)*i_rows_perpartition);  
  end loop;  
end;  
$$;  
  
drop table pgbench_accounts;  
  
alter table p rename to pgbench_accounts;  
  
-- alter table pgbench_accounts add constraint pk_pgbench_accounts_aid primary key (aid) using index tablespace tbs2;  

3、加载任务

drop table task;  
create table task(id int primary key);  
insert into task select i from generate_series(0,4095) t(i);  

4、初始化记录

create table init_accounts(aid int8);  
insert into init_accounts select generate_series(0,244140624);  

5、并行状态UDF

create or replace function tpcb_init_accounts() returns void as $$  
declare  
  v_id int;  
begin  
  with tmp as (select * from task limit 1 for update skip locked),  
    tmp1 as (delete from task using tmp where task.id=tmp.id)  
    select id into v_id from tmp;  
    
  if found then  
    execute format ('insert into pgbench_accounts%s select aid+%s*244140625::int8, ((aid+%s*244140625::int8)-1)/100000 + 1, 0 from init_accounts on conflict do nothing', v_id, v_id, v_id);  
  end if;  
end;  
$$ language plpgsql strict;  

6、并行装载数据

vi test.sql  
select tpcb_init_accounts();  
  
nohup pgbench -M prepared -n -r -f ./test.sql -c 64 -j 64 -t 100 >./init.log 2>&1 &  

初始化索引

1、任务表

drop table task;  
create table task(id int primary key);  
insert into task select i from generate_series(0,4095) t(i);  

2、并行创建索引UDF

create or replace function tpcb_init_accounts_pkey() returns void as $$  
declare  
  v_id int;  
begin  
  with tmp as (select * from task limit 1 for update skip locked),  
    tmp1 as (delete from task using tmp where task.id=tmp.id)  
    select id into v_id from tmp;  
    
  if found then  
    execute format ('analyze pgbench_accounts%s', v_id);  
    execute format ('alter table pgbench_accounts%s add constraint pk_pgbench_accounts%s_aid primary key (aid) using index tablespace tbs2', v_id, v_id);  
  end if;  
end;  
$$ language plpgsql strict;  

3、并行创建索引

vi test.sql  
select tpcb_init_accounts_pkey();  
  
nohup pgbench -M prepared -n -r -f ./test.sql -c 64 -j 64 -t 100 >./init.log 2>&1 &  

小结

1、8K的block size,单表最大32TB,(由于ctid的block num是32BIT的寻址,所以8K block算出来的最大容量就是32TB,本文测试的单表1万亿,已经超过了32TB,所以需要选择更大的BLOCK SIZE才行,32K即可到达256TB单表)。

编译时加上--with-blocksize=

./configure --with-blocksize=32

2、这么大的数据库怎么高效的备份,时间点恢复?

全量备份:1、ZFS快照,将快照发送到备份机(万兆网可以把网卡带宽跑满)。2、使用pg_basebackup备份全量。3、使用pg_rman备份全量。4、使用云盘快照备份全量。

增量备份:1、ZFS快照,将快照增量发送到备份机。2、pg_basebackup只能备份全量。3、使用pg_rman备份增量(通过BLOCK LSN号区分上一次备份以来修改过的数据块)。4、使用云盘快照备份增量。

归档备份:备份wal文件归档。

时间点恢复: 1、zfs快照克隆+归档恢复到时间点。 2、全量恢复+归档恢复到时间点。4、全量+增量+归档恢复到时间点。

3、此次测试tpcb,并发64时,前十几秒bind耗费的时间较多。

4、建议使用pg_pathman对大表进行分区,多大的表需要进行分区?

《HTAP数据库 PostgreSQL 场景与性能测试之 45 - (OLTP) 数据量与性能的线性关系(10亿+无衰减), 暨单表多大需要分区》

《PostgreSQL 9.5+ 高效分区表实现 - pg_pathman》

参考

《PostgreSQL 11 tpcc 测试(103万tpmC on ECS) - use sysbench-tpcc by Percona-Lab》

《(TPC-H测试 SF=10,SF=200) PostgreSQL 11 vs 10 vs Deepgreen》

《PostgreSQL 100亿 tpcb 性能 on ECS》

《[未完待续] PostgreSQL on 阿里云ECS+ESSD - 1000亿 tpcb、1000W tpcc 测试》

Flag Counter

digoal’s 大量PostgreSQL文章入口