PostgreSQL 11 1万亿 tpcb 性能测试 on 阿里云ECS + ESSD + zfs/lvm2条带 + block_size=32K
背景
最近的几个PostgreSQL OLTP与OLAP的测试:
《PostgreSQL 11 tpcc 测试(103万tpmC on ECS) - use sysbench-tpcc by Percona-Lab》
《(TPC-H测试 SF=10,SF=200) PostgreSQL 11 vs 10 vs Deepgreen》
《PostgreSQL 100亿 tpcb 性能 on ECS》
《[未完待续] PostgreSQL on 阿里云ECS+ESSD - 1000亿 tpcb、1000W tpcc 测试》
覆盖面:
1、SF=10, SF=200 TPCH
2、1000W TPCC
3、100亿 TPCB
4、1000亿 TPCB
5、1万亿 TPCB(约125TB 单表。本文要测试的)
本文使用的是16块ESSD云盘,测试时,使用了两套文件系统,ZFS与EXT4,都使用到了条带。
环境
1、ecs,CentOS 7.4 x64
2、CPU
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Stepping: 4
CPU MHz: 2499.996
BogoMIPS: 4999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0-63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1
3、内核
uname -a
Linux pg11-320tb-zfs 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
4、内存
free -g
total used free shared buff/cache available
Mem: 503 2 500 0 0 498
Swap: 0 0 0
5、时钟
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
6、块设备
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 253:0 0 200G 0 disk
└─vda1 253:1 0 200G 0 part /
vdb 253:16 0 20T 0 disk
vdc 253:32 0 20T 0 disk
vdd 253:48 0 20T 0 disk
vde 253:64 0 20T 0 disk
vdf 253:80 0 20T 0 disk
vdg 253:96 0 20T 0 disk
vdh 253:112 0 20T 0 disk
vdi 253:128 0 20T 0 disk
vdj 253:144 0 20T 0 disk
vdk 253:160 0 20T 0 disk
vdl 253:176 0 20T 0 disk
vdm 253:192 0 20T 0 disk
vdn 253:208 0 20T 0 disk
vdo 253:224 0 20T 0 disk
vdp 253:240 0 20T 0 disk
vdq 253:256 0 20T 0 disk
配置ECS虚拟机OS参数
1、内核参数
vi /etc/sysctl.conf
# add by digoal.zhou
fs.aio-max-nr = 1048576
fs.file-max = 76724600
# 可选:kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p
# /data01/corefiles 事先建好,权限777,如果是软链接,对应的目录修改为777
kernel.sem = 4096 2147483647 2147483646 512000
# 信号量, ipcs -l 或 -u 查看,每16个进程一组,每组信号量需要17个信号量。
kernel.shmall = 107374182
# 所有共享内存段相加大小限制 (建议内存的80%),单位为页。
kernel.shmmax = 274877906944
# 最大单个共享内存段大小 (建议为内存一半), >9.2的版本已大幅降低共享内存的使用,单位为字节。
kernel.shmmni = 819200
# 一共能生成多少共享内存段,每个PG数据库集群至少2个共享内存段
net.core.netdev_max_backlog = 10000
net.core.rmem_default = 262144
# The default setting of the socket receive buffer in bytes.
net.core.rmem_max = 4194304
# The maximum receive socket buffer size in bytes
net.core.wmem_default = 262144
# The default setting (in bytes) of the socket send buffer.
net.core.wmem_max = 4194304
# The maximum send socket buffer size in bytes.
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_keepalive_intvl = 20
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_mem = 8388608 12582912 16777216
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syncookies = 1
# 开启SYN Cookies。当出现SYN等待队列溢出时,启用cookie来处理,可防范少量的SYN攻击
net.ipv4.tcp_timestamps = 1
# 减少time_wait
net.ipv4.tcp_tw_recycle = 0
# 如果=1则开启TCP连接中TIME-WAIT套接字的快速回收,但是NAT环境可能导致连接失败,建议服务端关闭它
net.ipv4.tcp_tw_reuse = 1
# 开启重用。允许将TIME-WAIT套接字重新用于新的TCP连接
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216
net.nf_conntrack_max = 1200000
net.netfilter.nf_conntrack_max = 1200000
vm.dirty_background_bytes = 409600000
# 系统脏页到达这个值,系统后台刷脏页调度进程 pdflush(或其他) 自动将(dirty_expire_centisecs/100)秒前的脏页刷到磁盘
# 默认为10%,大内存机器建议调整为直接指定多少字节
vm.dirty_expire_centisecs = 3000
# 比这个值老的脏页,将被刷到磁盘。3000表示30秒。
vm.dirty_ratio = 95
# 如果系统进程刷脏页太慢,使得系统脏页超过内存 95 % 时,则用户进程如果有写磁盘的操作(如fsync, fdatasync等调用),则需要主动把系统脏页刷出。
# 有效防止用户进程刷脏页,在单机多实例,并且使用CGROUP限制单实例IOPS的情况下非常有效。
vm.dirty_writeback_centisecs = 100
# pdflush(或其他)后台刷脏页进程的唤醒间隔, 100表示1秒。
vm.swappiness = 0
# 不使用交换分区
vm.mmap_min_addr = 65536
vm.overcommit_memory = 0
# 在分配内存时,允许少量over malloc, 如果设置为 1, 则认为总是有足够的内存,内存较少的测试环境可以使用 1 .
vm.overcommit_ratio = 90
# 当overcommit_memory = 2 时,用于参与计算允许指派的内存大小。
vm.swappiness = 0
# 关闭交换分区
vm.zone_reclaim_mode = 0
# 禁用 numa, 或者在vmlinux中禁止.
net.ipv4.ip_local_port_range = 40000 65535
# 本地自动分配的TCP, UDP端口号范围
fs.nr_open=20480000
# 单个进程允许打开的文件句柄上限
# 以下参数请注意
# vm.extra_free_kbytes = 4096000
# vm.min_free_kbytes = 2097152
# 如果是小内存机器,以上两个值不建议设置
# vm.nr_hugepages = 66536
# 建议shared buffer设置超过64GB时 使用大页,页大小 /proc/meminfo Hugepagesize
# vm.lowmem_reserve_ratio = 1 1 1
# 对于内存大于64G时,建议设置,否则建议默认值 256 256 32
2、资源限制
vi /etc/security/limits.conf
# nofile超过1048576的话,一定要先将sysctl的fs.nr_open设置为更大的值,并生效后才能继续设置nofile.
* soft nofile 1024000
* hard nofile 1024000
* soft nproc unlimited
* hard nproc unlimited
* soft core unlimited
* hard core unlimited
* soft memlock unlimited
* hard memlock unlimited
3、关闭透明大页,使用精准时钟(可选)
vi /etc/rc.local
touch /var/lock/subsys/local
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
# 时钟
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
su - postgres -c "pg_ctl start"
部署 PostgreSQL 11
rpm -ivh https://download.postgresql.org/pub/repos/yum/11/redhat/rhel-7-x86_64/pgdg-centos11-11-2.noarch.rpm
rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install coreutils glib2 lrzsz dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc gcc-c++ make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex openjade bzip2 git iotop lvm2 perf
yum install -y postgresql11*
块设备部署策略1 - zfs
zfsonlinux
《[未完待续] PostgreSQL on ECS 高效率持续备份设计 - By ZFS on Linux》
1、zfs yum配置
yum install -y http://download.zfsonlinux.org/epel/zfs-release.el7_4.noarch.rpm
2、当前内核对应的kernel-devel
http://vault.centos.org
uname -a
Linux pg11-320tb-zfs 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
rpm -ivh http://vault.centos.org/7.4.1708/updates/x86_64/Packages/kernel-devel-3.10.0-693.2.2.el7.x86_64.rpm
3、安装zfs
yum install -y zfs
查看日志,是否有报错,正常情况下没有报错
测试是否可用
modprobe zfs
zfs get -o all
系统启动将自动加载zfs
创建zpool
essd底层三副本,无需再使用zfs的RAID功能。
parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdj mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdk mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdl mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdm mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdn mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdo mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdp mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdq mklabel gpt mkpart primary 1MiB 100%FREE
zpool create zp1 -f -o ashift=13 vdb1 vdc1 vdd1 vde1 vdf1 vdg1 vdh1 vdi1 vdj1 vdk1 vdl1 vdm1 vdn1 vdo1 vdp1 vdq1
zpool get all zp1
NAME PROPERTY VALUE SOURCE
zp1 size 318T -
zp1 capacity 0% -
zp1 altroot - default
zp1 health ONLINE -
zp1 guid 12407519490197584982 -
zp1 version - default
zp1 bootfs - default
zp1 delegation on default
zp1 autoreplace off default
zp1 cachefile - default
zp1 failmode wait default
zp1 listsnapshots off default
zp1 autoexpand off default
zp1 dedupditto 0 default
zp1 dedupratio 1.00x -
zp1 free 318T -
zp1 allocated 960K -
zp1 readonly off -
zp1 ashift 13 local
zp1 comment - default
zp1 expandsize - -
zp1 freeing 0 -
zp1 fragmentation 0% -
zp1 leaked 0 -
zp1 multihost off default
zp1 feature@async_destroy enabled local
zp1 feature@empty_bpobj enabled local
zp1 feature@lz4_compress active local
zp1 feature@multi_vdev_crash_dump enabled local
zp1 feature@spacemap_histogram active local
zp1 feature@enabled_txg active local
zp1 feature@hole_birth active local
zp1 feature@extensible_dataset active local
zp1 feature@embedded_data active local
zp1 feature@bookmarks enabled local
zp1 feature@filesystem_limits enabled local
zp1 feature@large_blocks enabled local
zp1 feature@large_dnode enabled local
zp1 feature@sha512 enabled local
zp1 feature@skein enabled local
zp1 feature@edonr enabled local
zp1 feature@userobj_accounting active local
创建zfs
zfs create -o mountpoint=/data01 -o recordsize=8K -o atime=off -o primarycache=metadata -o logbias=throughput -o secondarycache=none zp1/data01
zfs set canmount=off zp1
优化两个参数
cd /sys/module/zfs/parameters/
echo 1 > zfs_prefetch_disable
echo 15 > zfs_arc_shrink_shift
查看当前参数
zfs get all zp1/data01
NAME PROPERTY VALUE SOURCE
zp1/data01 type filesystem -
zp1/data01 creation Wed Sep 19 10:26 2018 -
zp1/data01 used 192K -
zp1/data01 available 308T -
zp1/data01 referenced 192K -
zp1/data01 compressratio 1.00x -
zp1/data01 mounted yes -
zp1/data01 quota none default
zp1/data01 reservation none default
zp1/data01 recordsize 8K local
zp1/data01 mountpoint /data01 local
zp1/data01 sharenfs off default
zp1/data01 checksum on default
zp1/data01 compression off default
zp1/data01 atime off local
zp1/data01 devices on default
zp1/data01 exec on default
zp1/data01 setuid on default
zp1/data01 readonly off default
zp1/data01 zoned off default
zp1/data01 snapdir hidden default
zp1/data01 aclinherit restricted default
zp1/data01 createtxg 81 -
zp1/data01 canmount on default
zp1/data01 xattr on default
zp1/data01 copies 1 default
zp1/data01 version 5 -
zp1/data01 utf8only off -
zp1/data01 normalization none -
zp1/data01 casesensitivity sensitive -
zp1/data01 vscan off default
zp1/data01 nbmand off default
zp1/data01 sharesmb off default
zp1/data01 refquota none default
zp1/data01 refreservation none default
zp1/data01 guid 3373300831209850945 -
zp1/data01 primarycache metadata local
zp1/data01 secondarycache none default
zp1/data01 usedbysnapshots 0B -
zp1/data01 usedbydataset 192K -
zp1/data01 usedbychildren 0B -
zp1/data01 usedbyrefreservation 0B -
zp1/data01 logbias throughput local
zp1/data01 dedup off default
zp1/data01 mlslabel none default
zp1/data01 sync standard default
zp1/data01 dnodesize legacy default
zp1/data01 refcompressratio 1.00x -
zp1/data01 written 192K -
zp1/data01 logicalused 76K -
zp1/data01 logicalreferenced 76K -
zp1/data01 volmode default default
zp1/data01 filesystem_limit none default
zp1/data01 snapshot_limit none default
zp1/data01 filesystem_count none default
zp1/data01 snapshot_count none default
zp1/data01 snapdev hidden default
zp1/data01 acltype off default
zp1/data01 context none default
zp1/data01 fscontext none default
zp1/data01 defcontext none default
zp1/data01 rootcontext none default
zp1/data01 relatime off default
zp1/data01 redundant_metadata all default
zp1/data01 overlay off default
初始化数据库
1、目录
mkdir /data01/pg11
chown postgres:postgres /data01/pg11
2、环境变量
su - postgres
vi .bash_profile
export PS1="$USER@`/bin/hostname -s`-> "
export PGPORT=1921
export PGDATA=/data01/pg11/pg_root$PGPORT
export LANG=en_US.utf8
export PGHOME=/usr/pgsql-11
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
3、初始化
initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024
huge page配置
zfs 可以绕过文件系统cache,所以建议一配置较大postgresql shared buffer,并使用huge page
《PostgreSQL Huge Page 使用建议 - 大内存主机、实例注意》
300GB/2MB=153600
sysctl -w vm.nr_hugepages=159600
echo "vm.nr_hugepages=159600" >> /etc/sysctl.conf
postgresql.auto.conf
listen_addresses = '0.0.0.0'
port = 1921
max_connections = 2000
superuser_reserved_connections = 3
unix_socket_directories = '., /var/run/postgresql, /tmp'
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 10
huge_pages = on # 使用huge page
shared_buffers = 300GB
max_prepared_transactions = 2000
work_mem = 32MB
maintenance_work_mem = 2GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 10.0
effective_io_concurrency = 0
max_worker_processes = 128
max_parallel_workers = 64
max_parallel_maintenance_workers = 64
max_parallel_workers_per_gather = 0
parallel_leader_participation = on
min_parallel_table_scan_size=0
min_parallel_index_scan_size=0
parallel_setup_cost=0
parallel_tuple_cost=0
wal_level = minimal
synchronous_commit = off
full_page_writes=off # zfs内置了checksum,cow. 关闭 fpw . 如果BLOCKDEV能保证8K原子写时,也可以关闭
wal_writer_delay = 10ms
checkpoint_timeout = 30min
max_wal_size = 600GB
min_wal_size = 150GB
checkpoint_completion_target = 0.1
max_wal_senders = 0
effective_cache_size = 200GB
log_destination = 'csvlog'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_line_prefix = '%m [%p] '
log_timezone = 'PRC'
log_autovacuum_min_duration = 0
autovacuum_max_workers = 16
autovacuum_freeze_max_age = 1200000000
autovacuum_multixact_freeze_max_age = 1400000000
autovacuum_vacuum_cost_delay = 0ms
vacuum_freeze_table_age = 1150000000
vacuum_multixact_freeze_table_age = 1150000000
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
jit = off
cpu_tuple_cost=0.00018884145574257426
cpu_index_tuple_cost = 0.00433497085216479990
cpu_operator_cost = 0.00216748542608239995
seq_page_cost=0.014329
random_page_cost = 0.016
restart 数据库
pg_ctl restart -m fast
zfs on linux 性能问题
本例测试时,(vacuum 时很慢,可以看出问题 pgbench -i -s 1000
可复现)
vacuum 进程D状态,stack如下
[<ffffffffc0174132>] cv_wait_common+0xb2/0x150 [spl]
[<ffffffffc0174208>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffc073c42b>] zio_wait+0x10b/0x1b0 [zfs]
[<ffffffffc0687124>] dmu_buf_hold_array_by_dnode+0x154/0x4a0 [zfs]
[<ffffffffc06885f2>] dmu_read_uio_dnode+0x52/0x100 [zfs]
[<ffffffffc06886ec>] dmu_read_uio_dbuf+0x4c/0x70 [zfs]
[<ffffffffc07242d5>] zfs_read+0x135/0x3f0 [zfs]
[<ffffffffc0743990>] zpl_read_common_iovec.constprop.9+0x80/0xd0 [zfs]
[<ffffffffc0743aa6>] zpl_aio_read+0xc6/0xf0 [zfs]
[<ffffffff812001ad>] do_sync_read+0x8d/0xd0
[<ffffffff81200bac>] vfs_read+0x9c/0x170
[<ffffffff81201a6f>] SyS_read+0x7f/0xe0
[<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
块设备部署策略1 - lvm2 , ext4
1、停库,消除zfs
pg_ctl stop -m immediate
zfs destroy zp1/data01
zpool destroy zp1
2、清理块设备头信息
wipefs -f -a /dev/vd[b-q]
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdb
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdc
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdd
dd bs=1024 count=1000 if=/dev/zero of=/dev/vde
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdf
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdg
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdh
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdi
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdj
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdk
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdl
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdm
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdn
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdo
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdp
dd bs=1024 count=1000 if=/dev/zero of=/dev/vdq
parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdj mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdk mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdl mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdm mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdn mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdo mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdp mklabel gpt mkpart primary 1MiB 100%FREE
parted -a optimal -s /dev/vdq mklabel gpt mkpart primary 1MiB 100%FREE
wipefs -f -a /dev/vd[b-q]1
3、创建PV
pvcreate /dev/vd[b-q]1
4、创建vg
vgcreate -A y -s 128M vgdata01 /dev/vd[b-q]1
5、创建逻辑卷,配置条带
lvcreate -A y -i 16 -I 8 -L 4TiB -n lv03 vgdata01
lvcreate -A y -i 16 -I 8 -L 220TiB -n lv01 vgdata01
lvcreate -A y -i 16 -I 8 -l 100%FREE -n lv02 vgdata01
6、查看
[root@pg11-320tb-zfs ~]# pvs
PV VG Fmt Attr PSize PFree
/dev/vdb vgdata01 lvm2 a-- <20.00t 0
/dev/vdc vgdata01 lvm2 a-- <20.00t 0
/dev/vdd vgdata01 lvm2 a-- <20.00t 0
/dev/vde vgdata01 lvm2 a-- <20.00t 0
/dev/vdf vgdata01 lvm2 a-- <20.00t 0
/dev/vdg vgdata01 lvm2 a-- <20.00t 0
/dev/vdh vgdata01 lvm2 a-- <20.00t 0
/dev/vdi vgdata01 lvm2 a-- <20.00t 0
/dev/vdj vgdata01 lvm2 a-- <20.00t 0
/dev/vdk vgdata01 lvm2 a-- <20.00t 0
/dev/vdl vgdata01 lvm2 a-- <20.00t 0
/dev/vdm vgdata01 lvm2 a-- <20.00t 0
/dev/vdn vgdata01 lvm2 a-- <20.00t 0
/dev/vdo vgdata01 lvm2 a-- <20.00t 0
/dev/vdp vgdata01 lvm2 a-- <20.00t 0
/dev/vdq vgdata01 lvm2 a-- <20.00t 0
[root@pg11-320tb-zfs ~]# vgs
VG #PV #LV #SN Attr VSize VFree
vgdata01 16 3 0 wz--n- <320.00t 0
[root@pg11-320tb-zfs ~]# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
lv01 vgdata01 -wi-a----- 220.00t
lv02 vgdata01 -wi-a----- <96.00t
lv03 vgdata01 -wi-a----- 4.00t
7、创建ext4文件系统,配置条带
mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv01
mkfs.ext4 /dev/mapper/vgdata01-lv02 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv02
mkfs.ext4 /dev/mapper/vgdata01-lv03 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv03
8、配置挂载
vi /etc/fstab
LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0
LABEL=lv02 /data02 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0
LABEL=lv03 /data03 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0
mkdir /data01
mkdir /data02
mkdir /data03
mount -a
[root@pg11-320tb-zfs ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 197G 2.1G 185G 2% /
devtmpfs 252G 0 252G 0% /dev
tmpfs 252G 0 252G 0% /dev/shm
tmpfs 252G 596K 252G 1% /run
tmpfs 252G 0 252G 0% /sys/fs/cgroup
tmpfs 51G 0 51G 0% /run/user/0
/dev/mapper/vgdata01-lv01 220T 20K 220T 1% /data01
/dev/mapper/vgdata01-lv02 96T 20K 96T 1% /data02
/dev/mapper/vgdata01-lv03 4.0T 89M 4.0T 1% /data03
9、创建数据库数据、表空间、WAL日志目录
[root@pg11-320tb-zfs ~]# mkdir /data01/pg11
[root@pg11-320tb-zfs ~]# mkdir /data02/pg11
[root@pg11-320tb-zfs ~]# mkdir /data03/pg11
[root@pg11-320tb-zfs ~]# chown postgres:postgres /data0*/pg11
10、配置环境变量
su - postgres
export PS1="$USER@`/bin/hostname -s`-> "
export PGPORT=1921
export PGDATA=/data01/pg11/pg_root$PGPORT
export LANG=en_US.utf8
export PGHOME=/usr/pgsql-11
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
11、初始化数据库
initdb -D $PGDATA -X /data03/pg11/pg_wal1921 -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024
12、创建表空间
mkdir /data01/pg11/tbs1
mkdir /data02/pg11/tbs2
create tablespace tbs1 location '/data01/pg11/tbs1';
create tablespace tbs2 location '/data02/pg11/tbs2';
1万亿 tpcb test
初始化数据
nohup pgbench -i -s 10000000 -I dtg -n --tablespace=tbs1 >./init.log 2>&1 &
耗时810688秒,约123.3万行/s
999999100000 of 1000000000000 tuples (99%) done (elapsed 810688.06 s, remaining 0.73 s)
999999200000 of 1000000000000 tuples (99%) done (elapsed 810688.14 s, remaining 0.65 s)
999999300000 of 1000000000000 tuples (99%) done (elapsed 810688.22 s, remaining 0.57 s)
999999400000 of 1000000000000 tuples (99%) done (elapsed 810688.30 s, remaining 0.49 s)
999999500000 of 1000000000000 tuples (99%) done (elapsed 810688.38 s, remaining 0.41 s)
999999600000 of 1000000000000 tuples (99%) done (elapsed 810688.46 s, remaining 0.32 s)
999999700000 of 1000000000000 tuples (99%) done (elapsed 810688.54 s, remaining 0.24 s)
999999800000 of 1000000000000 tuples (99%) done (elapsed 810688.61 s, remaining 0.16 s)
999999900000 of 1000000000000 tuples (99%) done (elapsed 810688.69 s, remaining 0.08 s)
1000000000000 of 1000000000000 tuples (100%) done (elapsed 810688.77 s, remaining 0.00 s)
done.
数据占用空间120TB
postgres=# \l+
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
-----------+----------+-----------+---------+------------+-----------------------+--------+------------+--------------------------------------------
postgres | postgres | SQL_ASCII | C | en_US.utf8 | | 120 TB | pg_default | default administrative connection database
template0 | postgres | SQL_ASCII | C | en_US.utf8 | =c/postgres +| 15 MB | pg_default | unmodifiable empty database
| | | | | postgres=CTc/postgres | | |
template1 | postgres | SQL_ASCII | C | en_US.utf8 | =c/postgres +| 15 MB | pg_default | default template for new databases
| | | | | postgres=CTc/postgres | | |
(3 rows)
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------------+-------+----------+---------+-------------
public | pgbench_accounts | table | postgres | 120 TB |
public | pgbench_branches | table | postgres | 344 MB |
public | pgbench_history | table | postgres | 0 bytes |
public | pgbench_tellers | table | postgres | 4201 MB |
public | t | table | postgres | 804 MB |
(5 rows)
postgres=# \db+
List of tablespaces
Name | Owner | Location | Access privileges | Options | Size | Description
------------+----------+---------------------+-------------------+---------+---------+-------------
pg_default | postgres | | | | 850 MB |
pg_global | postgres | | | | 2206 kB |
tbs1 | postgres | /data01/digoal/tbs1 | | | 120 TB |
tbs2 | postgres | /data02/digoal/tbs2 | | | 213 MB |
(4 rows)
新增、删除字段测试,秒级
新增、删除字段都只需要改元数据,秒级完成。
PostgreSQL新增带default值的字段,也是秒级完成。
《PostgreSQL 11 preview - 添加非空默认值不需要 rewrite table - online add column with default value》
postgres=# \timing
Timing is on.
postgres=# select * from pgbench_accounts limit 10;
aid | bid | abalance | filler
-----+-----+----------+--------------------------------------------------------------------------------------
1 | 1 | 0 |
2 | 1 | 0 |
3 | 1 | 0 |
4 | 1 | 0 |
5 | 1 | 0 |
6 | 1 | 0 |
7 | 1 | 0 |
8 | 1 | 0 |
9 | 1 | 0 |
10 | 1 | 0 |
(10 rows)
Time: 498.051 ms
-- 秒级添加字段
postgres=# alter table pgbench_accounts add column col1 text;
ALTER TABLE
Time: 1254.611 ms (00:01.255)
-- 秒级添加非空默认值字段
postgres=# alter table pgbench_accounts add column col2 text default 'hello digoal';
ALTER TABLE
Time: 1253.689 ms (00:01.254)
postgres=# select * from pgbench_accounts limit 10;
aid | bid | abalance | filler | col1 | col2
-----+-----+----------+--------------------------------------------------------------------------------------+------+--------------
1 | 1 | 0 | | | hello digoal
2 | 1 | 0 | | | hello digoal
3 | 1 | 0 | | | hello digoal
4 | 1 | 0 | | | hello digoal
5 | 1 | 0 | | | hello digoal
6 | 1 | 0 | | | hello digoal
7 | 1 | 0 | | | hello digoal
8 | 1 | 0 | | | hello digoal
9 | 1 | 0 | | | hello digoal
10 | 1 | 0 | | | hello digoal
(10 rows)
Time: 502.608 ms
postgres=# explain analyze select * from pgbench_accounts limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.00 rows=10 width=168) (actual time=0.022..0.026 rows=10 loops=1)
-> Seq Scan on pgbench_accounts (cost=0.00..184941625.46 rows=673387096855 width=168) (actual time=0.020..0.022 rows=10 loops=1)
Planning Time: 0.057 ms
Execution Time: 0.041 ms
(4 rows)
Time: 0.453 ms
-- 分析表
postgres=# analyze pgbench_accounts ;
ANALYZE
Time: 67373.884 ms (01:07.374)
删除字段,毫秒级
postgres=# alter table pgbench_accounts drop column col1;
ALTER TABLE
Time: 7.610 ms
postgres=# alter table pgbench_accounts drop column col2;
ALTER TABLE
Time: 0.546 ms
创建索引
加载初始化数据结束后,创建索引
1、修改并行度
psql
analyze;
alter table pgbench_accounts set (parallel_workers=64);
alter table pgbench_tellers set (parallel_workers=64);
alter table pgbench_branches set (parallel_workers=64);
2、创建索引
nohup pgbench -i -s 10000000 -I p -n --index-tablespace=tbs2 >./init_pkey.log 2>&1 &
开启了64并行度,开始时的消耗资源情况
avg-cpu: %user %nice %system %iowait %steal %idle
3.20 0.00 38.52 16.31 0.00 41.97
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
vdb 2783.00 6362.00 3972.00 169.00 49328.00 52360.00 49.11 5.35 1.29 0.93 9.91 0.24 98.90
vdc 2760.00 6361.00 4020.00 167.00 49296.00 51764.00 48.27 5.19 1.23 0.88 9.59 0.24 98.70
vdd 2747.00 6361.00 4010.00 169.00 49332.00 51860.00 48.43 5.18 1.24 0.88 9.64 0.24 98.80
vde 2757.00 6362.00 4032.00 169.00 49344.00 51864.00 48.18 5.13 1.22 0.87 9.54 0.23 98.50
vdf 2732.00 6360.00 4012.00 191.00 49336.00 52532.00 48.47 5.21 1.24 0.85 9.50 0.24 99.10
vdg 2716.00 6361.00 4039.00 191.00 49320.00 52036.00 47.92 5.28 1.25 0.86 9.35 0.23 99.20
vdh 2742.00 6361.00 4038.00 191.00 49340.00 52032.00 47.94 5.46 1.29 0.91 9.26 0.23 98.80
vdi 2749.00 6361.00 4041.00 193.00 49328.00 52544.00 48.12 5.35 1.26 0.88 9.40 0.23 99.20
vdj 2776.00 6385.00 3953.00 169.00 49344.00 52020.00 49.18 5.41 1.31 0.93 10.12 0.24 99.10
vdk 2767.00 6384.00 3999.00 171.00 49328.00 52028.00 48.61 5.52 1.32 0.96 9.76 0.24 99.10
vdl 2775.00 6386.00 3984.00 172.00 49328.00 52032.00 48.78 5.56 1.34 0.97 9.94 0.24 99.10
vdm 2759.00 6385.00 4039.00 172.00 49296.00 52416.00 48.31 5.58 1.32 0.95 9.87 0.23 98.90
vdn 2776.00 6369.00 3967.00 163.00 49352.00 51840.00 49.00 5.48 1.32 0.95 10.33 0.24 99.70
vdo 2776.00 6370.00 3978.00 163.00 49352.00 52220.00 49.06 5.42 1.31 0.93 10.56 0.24 99.30
vdp 2782.00 6370.00 4004.00 162.00 49356.00 51840.00 48.58 5.60 1.34 0.97 10.44 0.24 99.70
vdq 2759.00 6370.00 4033.00 161.00 49352.00 51828.00 48.25 5.61 1.34 0.97 10.48 0.24 99.50
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 108392.00 104717.00 790740.00 837136.00 15.28 1293.57 5.82 0.94 10.87 0.00 100.90
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
top - 07:41:20 up 12 days, 22:02, 2 users, load average: 61.15, 26.97, 10.80
Tasks: 607 total, 39 running, 568 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.4 us, 54.3 sy, 0.0 ni, 21.2 id, 21.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 52819500+total, 2192660 free, 33286761+used, 19313472+buff/cache
KiB Swap: 0 total, 0 free, 0 used. 19267680+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
65466 digoal 20 0 0.296t 25764 2044 R 68.4 0.0 1:11.05 postgres: parallel worker for PID 65299
65450 digoal 20 0 0.296t 25756 2076 D 63.2 0.0 1:10.51 postgres: parallel worker for PID 65299
65460 digoal 20 0 0.296t 25764 2048 R 63.2 0.0 1:10.37 postgres: parallel worker for PID 65299
65469 digoal 20 0 0.296t 25752 2056 R 63.2 0.0 1:10.48 postgres: parallel worker for PID 65299
65474 digoal 20 0 0.296t 25764 2052 R 63.2 0.0 1:10.36 postgres: parallel worker for PID 65299
65478 digoal 20 0 0.296t 25764 2060 R 63.2 0.0 1:10.64 postgres: parallel worker for PID 65299
65479 digoal 20 0 0.296t 25752 2056 R 63.2 0.0 1:10.47 postgres: parallel worker for PID 65299
65484 digoal 20 0 0.296t 25760 2056 R 63.2 0.0 1:10.63 postgres: parallel worker for PID 65299
65485 digoal 20 0 0.296t 25748 2068 R 63.2 0.0 1:11.10 postgres: parallel worker for PID 65299
Total DISK READ : 834.93 M/s | Total DISK WRITE : 1006.90 M/s
Actual DISK READ: 835.23 M/s | Actual DISK WRITE: 994.87 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
65461 be/4 digoal 13.90 M/s 12.74 M/s 0.00 % 51.43 % postgres: parallel worker for PID 65299
65457 be/4 digoal 13.86 M/s 12.95 M/s 0.00 % 51.25 % postgres: parallel worker for PID 65299
65505 be/4 digoal 14.82 M/s 13.05 M/s 0.00 % 51.09 % postgres: parallel worker for PID 65299
65466 be/4 digoal 12.57 M/s 13.48 M/s 0.00 % 50.72 % postgres: parallel worker for PID 65299
65462 be/4 digoal 13.16 M/s 13.23 M/s 0.00 % 50.70 % postgres: parallel worker for PID 65299
65491 be/4 digoal 8.85 M/s 12.99 M/s 0.00 % 50.59 % postgres: parallel worker for PID 65299
65451 be/4 digoal 12.33 M/s 13.48 M/s 0.00 % 50.57 % postgres: parallel worker for PID 65299
65477 be/4 digoal 12.37 M/s 13.20 M/s 0.00 % 50.38 % postgres: parallel worker for PID 65299
65459 be/4 digoal 8.45 M/s 19.33 M/s 0.00 % 50.27 % postgres: parallel worker for PID 65299
65488 be/4 digoal 12.34 M/s 12.74 M/s 0.00 % 50.21 % postgres: parallel worker for PID 65299
65495 be/4 digoal 13.83 M/s 13.26 M/s 0.00 % 50.19 % postgres: parallel worker for PID 65299
65450 be/4 digoal 9.20 M/s 19.45 M/s 0.00 % 50.14 % postgres: parallel worker for PID 65299
65503 be/4 digoal 14.02 M/s 19.66 M/s 0.00 % 50.13 % postgres: parallel worker for PID 65299
等待事件情况
postgres=# select wait_event,wait_event_type,count(*) from pg_stat_activity where wait_event is not null group by 1,2;
wait_event | wait_event_type | count
---------------------+-----------------+-------
BufFileRead | IO | 59
BufFileWrite | IO | 1
CheckpointerMain | Activity | 1
BgWriterHibernate | Activity | 1
AutoVacuumMain | Activity | 1
LogicalLauncherMain | Activity | 1
WalWriterMain | Activity | 1
(7 rows)
-[ RECORD 7 ]----+---------------------------------------------------------------------------------
datid | 13220
datname | postgres
pid | 65448
usesysid | 10
usename | postgres
application_name | pgbench
client_addr |
client_hostname |
client_port |
backend_start | 2018-10-02 07:38:46.003833+08
xact_start | 2018-10-02 07:38:46.003114+08
query_start | 2018-10-02 07:38:46.003114+08
state_change | 2018-10-02 07:38:46.00439+08
wait_event_type | IO
wait_event | BufFileRead
state | active
backend_xid |
backend_xmin | 598
query | alter table pgbench_accounts add primary key (aid) using index tablespace "tbs2"
backend_type | parallel worker
临时空间使用情况,大概19TB,worker工作结束后,开始合并索引。
digoal@pg11-320tb-zfs-> cd $PGDATA/base
digoal@pg11-320tb-zfs-> du -sh *
16M 1
16M 13219
16M 13220
19T pgsql_tmp
每个并行的worker进程完成自己的任务后,开始合并索引,合并速度
Total DISK READ : 116.21 M/s | Total DISK WRITE : 169.91 M/s
Actual DISK READ: 116.21 M/s | Actual DISK WRITE: 197.28 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
65299 be/4 digoal 116.21 M/s 169.91 M/s 0.00 % 8.13 % postgres: postgres postgres [local] ALTER TABLE
65298 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % pgbench -i -s 10000000 -I p -n --index-tablespace=tbs2
51030 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres
51032 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: logger
51034 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: checkpointer
51035 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: background writer
51036 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: walwriter
51037 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: autovacuum launcher
51038 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: stats collector
51039 be/4 digoal 0.00 B/s 0.00 B/s 0.00 % 0.00 % postgres: logical replication launcher
索引占用空间 20 TB
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-----------------------+-------+----------+------------------+---------+-------------
public | pgbench_accounts_pkey | index | postgres | pgbench_accounts | 20 TB |
public | pgbench_branches_pkey | index | postgres | pgbench_branches | 213 MB |
public | pgbench_tellers_pkey | index | postgres | pgbench_tellers | 2125 MB |
(3 rows)
索引创建耗时
7130分钟
开始
2018-10-02 07:51:00
结束
2018-10-07 06:41:08
压测脚本
只读
vi ro.sql
\set aid random_gaussian(1, :range, 10.0)
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=100000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=1000000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=10000000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=100000000000
pgbench -M prepared -n -r -P 1 -f ./ro.sql -c 64 -j 64 -T 360 -D range=1000000000000
读写
vi rw.sql
\set aid random_gaussian(1, :range, 10.0)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=100000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=1000000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=10000000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=100000000000
pgbench -M prepared -n -r -P 1 -f ./rw.sql -c 64 -j 64 -T 360 -D scale=10000000 -D range=1000000000000
1、1万亿 tpcb 只读测试 - 1万亿数据活跃
TPS:
QPS:
transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 7737610
latency average = 2.977 ms
latency stddev = 35.840 ms
tps = 21492.371917 (including connections establishing)
tps = 21495.359217 (excluding connections establishing)
statement latencies in milliseconds:
0.002 \set aid random_gaussian(1, :range, 10.0)
2.975 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
1、1万亿 tpcb 读写测试 - 1万亿数据活跃
TPS:
QPS:
transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 5812634
latency average = 3.963 ms
latency stddev = 48.480 ms
tps = 16143.312370 (including connections establishing)
tps = 16145.557184 (excluding connections establishing)
statement latencies in milliseconds:
0.003 \set aid random_gaussian(1, :range, 10.0)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.025 BEGIN;
3.511 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.063 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.155 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.119 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.047 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.041 END;
2、1万亿 tpcb 只读测试 - 1000亿数据活跃
TPS:
QPS:
transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 8317050
latency average = 2.770 ms
latency stddev = 34.609 ms
tps = 23101.921465 (including connections establishing)
tps = 23105.640572 (excluding connections establishing)
statement latencies in milliseconds:
0.002 \set aid random_gaussian(1, :range, 10.0)
2.766 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
2、1万亿 tpcb 读写测试 - 1000亿数据活跃
TPS:
QPS:
transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 6526147
latency average = 3.529 ms
latency stddev = 50.638 ms
tps = 18126.367839 (including connections establishing)
tps = 18134.592726 (excluding connections establishing)
statement latencies in milliseconds:
0.002 \set aid random_gaussian(1, :range, 10.0)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.025 BEGIN;
3.102 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.061 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.159 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.091 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.046 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.041 END;
3、1万亿 tpcb 只读测试 - 100亿数据活跃
TPS:
QPS:
transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 212661629
latency average = 0.108 ms
latency stddev = 12.493 ms
tps = 590692.703049 (including connections establishing)
tps = 590774.219034 (excluding connections establishing)
statement latencies in milliseconds:
0.001 \set aid random_gaussian(1, :range, 10.0)
0.107 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
3、1万亿 tpcb 读写测试 - 100亿数据活跃
TPS:
QPS:
transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 27002477
latency average = 0.853 ms
latency stddev = 39.222 ms
tps = 75002.036277 (including connections establishing)
tps = 75012.139249 (excluding connections establishing)
statement latencies in milliseconds:
0.003 \set aid random_gaussian(1, :range, 10.0)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.001 \set delta random(-5000, 5000)
0.044 BEGIN;
0.211 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.076 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.198 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.115 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.072 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.134 END;
4、1万亿 tpcb 只读测试 - 10亿数据活跃
TPS:
QPS:
transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 374399291
latency average = 0.061 ms
latency stddev = 5.647 ms
tps = 1039962.270864 (including connections establishing)
tps = 1040949.958600 (excluding connections establishing)
statement latencies in milliseconds:
0.002 \set aid random_gaussian(1, :range, 10.0)
0.061 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
4、1万亿 tpcb 读写测试 - 10亿数据活跃
TPS:
QPS:
transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 29574604
latency average = 0.779 ms
latency stddev = 16.197 ms
tps = 82148.432097 (including connections establishing)
tps = 82160.286498 (excluding connections establishing)
statement latencies in milliseconds:
0.003 \set aid random_gaussian(1, :range, 10.0)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.001 \set delta random(-5000, 5000)
0.043 BEGIN;
0.144 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.074 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.207 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.106 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.070 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.131 END;
5、1万亿 tpcb 只读测试 - 1亿数据活跃
TPS: 1068052
QPS: 1068052
transaction type: ./ro.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 384510720
latency average = 0.060 ms
latency stddev = 4.332 ms
tps = 1068052.373377 (including connections establishing)
tps = 1068206.696327 (excluding connections establishing)
statement latencies in milliseconds:
0.002 \set aid random_gaussian(1, :range, 10.0)
0.059 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
5、1万亿 tpcb 读写测试 - 1亿数据活跃
TPS:
QPS:
transaction type: ./rw.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 360 s
number of transactions actually processed: 28314309
latency average = 0.814 ms
latency stddev = 16.406 ms
tps = 78647.191352 (including connections establishing)
tps = 78658.751759 (excluding connections establishing)
statement latencies in milliseconds:
0.003 \set aid random_gaussian(1, :range, 10.0)
0.001 \set bid random(1, 1 * :scale)
0.001 \set tid random(1, 10 * :scale)
0.001 \set delta random(-5000, 5000)
0.043 BEGIN;
0.184 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.076 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.217 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.096 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.069 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.125 END;
性能小结
环境:阿里云 ECS + 320T ESSD
表SIZE: 120 TB 写入耗时 810688秒,约123.3万行/s
索引SIZE: 20 TB 创建耗时 427800秒
索引深度: 4级(BLOCK_SIZE=32K, 每一页可以存储的item比8k多,所以1万亿的索引层级比1000亿低(8K))
单表数据量 | TEST CASE | QPS | TPS |
---|---|---|---|
1万亿 | tpcb 活跃数据1亿 只读 | 1068052 | 1068052 |
1万亿 | tpcb 活跃数据10亿 只读 | 1039962 | 1039962 |
1万亿 | tpcb 活跃数据100亿 只读 | 590692 | 590692 |
1万亿 | tpcb 活跃数据1000亿 只读 | 23101 | 23101 |
1万亿 | tpcb 活跃数据10000亿 只读 | 21492 | 21492 |
1万亿 | tpcb 活跃数据1亿 读写 | 393235 | 78647 |
1万亿 | tpcb 活跃数据10亿 读写 | 410740 | 82148 |
1万亿 | tpcb 活跃数据100亿 读写 | 375010 | 75002 |
1万亿 | tpcb 活跃数据1000亿 读写 | 90630 | 18126 |
1万亿 | tpcb 活跃数据10000亿 读写 | 80715 | 16143 |
添加字段(含default值)耗时:1.25 秒。
删除字段耗时:1 毫秒。
附录 - pgbench_accounts 分区, 并行加载测试数据, 动态查询
1万亿单表,会带来什么问题?
1、单表125TB,创建索引耗时增加。PG 11 引入并行创建索引,解决。
2、单表125TB,垃圾回收时间拉长。PG 12 使用zheap引擎彻底杜绝。
3、单表125TB,FREEZE耗时拉长,甚至可能无法在20亿个事务内完成。PG未来版本,使用超过32位的XID,彻底解决。
4、单表125TB,必须放在单个目录下,可能导致文件系统上限(INODE,容量等上限)。
5、单表125TB,要做一些数据清理时不方便,如果有时间维度老化概念,用分区表,可以更好的管理冷热数据,例如pg_pathman。
pgbench转换为分区表。
1、建议使用pg_pathman,性能损失低。内置分区功能,目前还有性能问题。
《PostgreSQL 9.x, 10, 11 hash分区表 用法举例》
《PostgreSQL 10 内置分区 vs pg_pathman perf profiling》
《分区表锁粒度差异 - pg_pathman VS native partition table》
《PostgreSQL 查询涉及分区表过多导致的性能问题 - 性能诊断与优化(大量BIND, spin lock, SLEEP进程)》
使用内部分区,建议使用动态SQL,避免BIND问题。
分区demo
《PostgreSQL pgbench tpcb 数据生成与SQL部分源码解读》
《PostgreSQL pgbench tpcb 海量数据库测试 - 分区表测试优化》
装载数据
1、表
pgbench -i -I dt --tablespace=tbs1 -s 10000000
2、分区
create table p (like pgbench_accounts) partition by RANGE ( aid ) tablespace tbs1;
do language plpgsql $$
declare
i_rows_perpartition int8 := 244140625;
begin
for i in 0..4096 loop
execute format ('create table pgbench_accounts%s partition of p for values from (%s) to (%s) tablespace tbs1', i, i*i_rows_perpartition, (i+1)*i_rows_perpartition);
end loop;
end;
$$;
drop table pgbench_accounts;
alter table p rename to pgbench_accounts;
-- alter table pgbench_accounts add constraint pk_pgbench_accounts_aid primary key (aid) using index tablespace tbs2;
3、加载任务
drop table task;
create table task(id int primary key);
insert into task select i from generate_series(0,4095) t(i);
4、初始化记录
create table init_accounts(aid int8);
insert into init_accounts select generate_series(0,244140624);
5、并行状态UDF
create or replace function tpcb_init_accounts() returns void as $$
declare
v_id int;
begin
with tmp as (select * from task limit 1 for update skip locked),
tmp1 as (delete from task using tmp where task.id=tmp.id)
select id into v_id from tmp;
if found then
execute format ('insert into pgbench_accounts%s select aid+%s*244140625::int8, ((aid+%s*244140625::int8)-1)/100000 + 1, 0 from init_accounts on conflict do nothing', v_id, v_id, v_id);
end if;
end;
$$ language plpgsql strict;
6、并行装载数据
vi test.sql
select tpcb_init_accounts();
nohup pgbench -M prepared -n -r -f ./test.sql -c 64 -j 64 -t 100 >./init.log 2>&1 &
初始化索引
1、任务表
drop table task;
create table task(id int primary key);
insert into task select i from generate_series(0,4095) t(i);
2、并行创建索引UDF
create or replace function tpcb_init_accounts_pkey() returns void as $$
declare
v_id int;
begin
with tmp as (select * from task limit 1 for update skip locked),
tmp1 as (delete from task using tmp where task.id=tmp.id)
select id into v_id from tmp;
if found then
execute format ('analyze pgbench_accounts%s', v_id);
execute format ('alter table pgbench_accounts%s add constraint pk_pgbench_accounts%s_aid primary key (aid) using index tablespace tbs2', v_id, v_id);
end if;
end;
$$ language plpgsql strict;
3、并行创建索引
vi test.sql
select tpcb_init_accounts_pkey();
nohup pgbench -M prepared -n -r -f ./test.sql -c 64 -j 64 -t 100 >./init.log 2>&1 &
小结
1、8K的block size,单表最大32TB,(由于ctid的block num是32BIT的寻址,所以8K block算出来的最大容量就是32TB,本文测试的单表1万亿,已经超过了32TB,所以需要选择更大的BLOCK SIZE才行,32K即可到达256TB单表)。
编译时加上--with-blocksize=
./configure --with-blocksize=32
2、这么大的数据库怎么高效的备份,时间点恢复?
全量备份:1、ZFS快照,将快照发送到备份机(万兆网可以把网卡带宽跑满)。2、使用pg_basebackup备份全量。3、使用pg_rman备份全量。4、使用云盘快照备份全量。
增量备份:1、ZFS快照,将快照增量发送到备份机。2、pg_basebackup只能备份全量。3、使用pg_rman备份增量(通过BLOCK LSN号区分上一次备份以来修改过的数据块)。4、使用云盘快照备份增量。
归档备份:备份wal文件归档。
时间点恢复: 1、zfs快照克隆+归档恢复到时间点。 2、全量恢复+归档恢复到时间点。4、全量+增量+归档恢复到时间点。
3、此次测试tpcb,并发64时,前十几秒bind耗费的时间较多。
4、建议使用pg_pathman对大表进行分区,多大的表需要进行分区?
《HTAP数据库 PostgreSQL 场景与性能测试之 45 - (OLTP) 数据量与性能的线性关系(10亿+无衰减), 暨单表多大需要分区》
《PostgreSQL 9.5+ 高效分区表实现 - pg_pathman》
参考
《PostgreSQL 11 tpcc 测试(103万tpmC on ECS) - use sysbench-tpcc by Percona-Lab》
《(TPC-H测试 SF=10,SF=200) PostgreSQL 11 vs 10 vs Deepgreen》
《PostgreSQL 100亿 tpcb 性能 on ECS》
《[未完待续] PostgreSQL on 阿里云ECS+ESSD - 1000亿 tpcb、1000W tpcc 测试》