PostgreSQL on xfs 性能优化 - 1

5 minute read

背景

性能优化主要分4块，

1. 逻辑卷优化部分

2. XFS mkfs 优化部分

3. XFS mount 优化部分

4. xfsctl 优化部分

以上几个部分都可以通过man手册查看，了解原理和应用场景后着手优化。

man lvcreate  
man xfs  
man mkfs.xfs  
man mount  
man xfsctl  

例子

1. 逻辑卷优化部分

1.1 创建PV前，将块设备对齐，前面1MB最好不要分配，从2048 sector开始分配。

fdisk -c -u /dev/dfa  
start  2048  
end + (2048*n) - 1  

或者使用parted创建并对齐分区。

1.2 主要指定2个参数，

条带数量，和pv数量一致即可

       -i, --stripes Stripes  
              Gives the number of stripes.  This is equal to the number of physical volumes to scatter the logical volume.  

条带大小，和数据库块大小一致，例如postgresql默认为 8KB。

       -I, --stripesize StripeSize  
              Gives the number of kilobytes for the granularity of the stripes.  
              StripeSize must be 2^n (n = 2 to 9) for metadata in LVM1 format.  For metadata in LVM2 format, the stripe size may be a larger power of 2 but must not exceed the physical extent size.  

创建快照时，指定的参数

chunksize, 最好和数据库的块大小一致, 例如postgresql默认为 8KB。

       -c, --chunksize ChunkSize  
              Power of 2 chunk size for the snapshot logical volume between 4k and 512k.  

例如：

预留1%给xfs的LOG DEV (实际2GB就够了)

#lvcreate -i 3 -I 8 -n lv01 -L 2G vgdata01  
  Logical volume "lv01" created  
#lvcreate -i 3 -I 8 -n lv02 -l 100%FREE vgdata01  
  Logical volume "lv02" created  
#lvs  
  LV   VG       Attr   LSize   Origin Snap%  Move Log Copy%  Convert  
  lv02 vgdata01 -wi-a-  17.29t                                        
  lv01 vgdata01 -wi-a-   2g   

2. XFS mkfs 优化部分

首先要搞清楚XFS的layout。

xfs包含3个section，data, log, realtime files。

默认情况下 log存在data里面，没有realtime。所有的section都是由最小单位block组成，初始化xfs是-b指定block size。

2.1 data

包含 metadata(inode, 目录, 间接块), user file data, non-realtime files

data被拆分成多个allocation group，mkfs.xfs时可以指定group的个数，以及单个group的SIZE。

group越多，可以并行进行的文件和块的allocation就越多。你可以认为单个组的操作是串行的，多个组是并行的。

但是组越多，消耗的CPU会越多，需要权衡。对于并发写很高的场景，可以多一些组，（例如一台主机跑了很多小的数据库，每个数据库都很繁忙的场景下）

2.2 log

存储metadata的log，修改metadata前，必须先记录log，然后才能修改data section中的metadata。

也用于crash后的恢复。

2.3 realtime

被划分为很多个小的extents, 要将文件写入到realtime section中，必须使用xfsctl改一下文件描述符的bit位，并且一定要在数据写入前完成。在realtime中的文件大小是realtime extents的倍数关系。

2.4所以mkfs.xfs时，我们能做的优化是：

data section：

allocation group count数量和AGSIZE相乘等于块设备大小。

AG count数量多少和用户需求的并行度相关。

同时AG SIZE的取值范围是16M到1TB，PostgreSQL 建议1GB左右。

选项 -b size=8192 与数据库块大小一致（但不是所有的xfs版本都支持大于4K的block size，所以如果你发现mount失败并且告知只支持4K以下的BLOCK，那么请重新格式化）

选项 -d agcount=9000,sunit=16,swidth=48

假设有9000个并发写操作，使用9000个allocation groups

(单位512 bytes) 与lvm或RAID块设备的条带大小对齐

与lvm或RAID块设备条带跨度大小对齐，以上对应3*8 例如 -i 3 -I 8。

log section：

最好放在SSD上，速度越快越好。最好不要使用cgroup限制LOG块设备的iops操作。

realtime section:

不需要的话，不需要创建。

agsize绝对不能是条带宽度的倍数。(假设条带数为3，条带大小为8K，则宽度为24K。)

如果根据指定agcount算出的agsize是swidth的倍数，会弹出警告：

例如下面的例子，

agsize=156234 blks 是 swidth=6 blks 的倍数 26039。

给出的建议是减掉一个stripe unit即8K，即156234 blks - sunit 2 blks = 156232 blks。

156232 blks换算成字节数= 1562324096 = 639926272 bytes 或 1562324 = 624928K

#mkfs.xfs -f -b size=4096 -l logdev=/dev/mapper/vgdata01-lv01,size=2136997888,sunit=16 -d agcount=30000,sunit=16,swidth=48 /dev/mapper/vgdata01-lv02  
Warning: AG size is a multiple of stripe width.  This can cause performance  
problems by aligning all AGs on the same disk.  To avoid this, run mkfs with  
an AG size that is one stripe unit smaller, for example 156232.  
meta-data=/dev/mapper/vgdata01-lv02 isize=256    agcount=30000, agsize=156234 blks  
         =                       sectsz=4096  attr=2, projid32bit=1  
         =                       crc=0        finobt=0  
data     =                       bsize=4096   blocks=4686971904, imaxpct=5  
         =                       sunit=2      swidth=6 blks  
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0  
log      =/dev/mapper/vgdata01-lv01 bsize=4096   blocks=521728, version=2  
         =                       sectsz=512   sunit=2 blks, lazy-count=1  
realtime =none                   extsz=4096   blocks=0, rtextents=0  

对于上面这个mkfs.xfs操作，改成以下

#mkfs.xfs -f -b size=4096 -l logdev=/dev/mapper/vgdata01-lv01,size=2136997888,sunit=16 -d agsize=639926272,sunit=16,swidth=48 /dev/mapper/vgdata01-lv02  
或  
#mkfs.xfs -f -b size=4096 -l logdev=/dev/mapper/vgdata01-lv01,size=2136997888,sunit=16 -d agsize=624928k,sunit=16,swidth=48 /dev/mapper/vgdata01-lv02  

输出如下

meta-data=/dev/mapper/vgdata01-lv02 isize=256    agcount=30001, agsize=156232 blks  (约600MB)  
         =                       sectsz=4096  attr=2, projid32bit=1  
         =                       crc=0        finobt=0  
data     =                       bsize=4096   blocks=4686971904, imaxpct=5  
         =                       sunit=2      swidth=6 blks  
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0  
log      =/dev/mapper/vgdata01-lv01 bsize=4096   blocks=521728, version=2  
         =                       sectsz=512   sunit=2 blks, lazy-count=1  
realtime =none                   extsz=4096   blocks=0, rtextents=0  

3. XFS mount 优化部分

nobarrier  
largeio 针对数据仓库，流媒体这种大量连续读的应用  
nolargeio 针对OLTP  
logbsize=262144   指定 log buffer  
logdev=  指定log section对应的块设备，用最快的SSD。  
noatime,nodiratime  
swalloc  条带对齐  

例子

#mount -t xfs -o allocsize=16M,inode64,nobarrier,nolargeio,logbsize=262144,noatime,nodiratime,swalloc,logdev=/dev/mapper/vgdata01-lv01 /dev/mapper/vgdata01-lv02 /data01  

4. xfsctl 优化部分

略

排错

#mount -o noatime,swalloc /dev/mapper/vgdata01-lv01 /data01  
mount: Function not implemented  

原因是用了不支持的块大小

[ 5736.642924] XFS (dm-0): File system with blocksize 8192 bytes. Only pagesize (4096) or less will currently work.  
[ 5736.695146] XFS (dm-0): SB validate failed with error -38.  

排除，使用4 K的block size

#mkfs.xfs -f -b size=4096 -l logdev=/dev/mapper/vgdata01-lv01,size=2136997888,sunit=16 -d agsize=624928k,sunit=16,swidth=48 /dev/mapper/vgdata01-lv02  

重新mount成功。

#mount -t xfs -o allocsize=16M,inode64,nobarrier,nolargeio,logbsize=262144,noatime,nodiratime,swalloc,logdev=/dev/mapper/vgdata01-lv01 /dev/mapper/vgdata01-lv02 /data01  

参考

xfs(5)                                                                  xfs(5)  
  
NAME  
       xfs - layout of the XFS filesystem  
  
DESCRIPTION  
       An  XFS  filesystem  can  reside  on  a  regular  disk  partition  or on a logical volume.  An XFS filesystem has up to three parts: a data section, a log section, and a realtime section.  Using the default  
       mkfs.xfs(8) options, the realtime section is absent, and the log area is contained within the data section.  The log section can be either separate from  the  data  section  or  contained  within  it.   The  
       filesystem sections are divided into a certain number of blocks, whose size is specified at mkfs.xfs(8) time with the -b option.  
  
       The  data  section  contains all the filesystem metadata (inodes, directories, indirect blocks) as well as the user file data for ordinary (non-realtime) files and the log area if the log is internal to the  
       data section.  The data section is divided into a number of allocation groups.  The number and size of the allocation groups are chosen by mkfs.xfs(8) so that there is normally a small number of equal-sized  
       groups.   The number of allocation groups controls the amount of parallelism available in file and block allocation.  It should be increased from the default if there is sufficient memory and a lot of allo-  
       cation activity.  The number of allocation groups should not be set very high, since this can cause large amounts of CPU time to be used by the filesystem, especially when the  filesystem  is  nearly  full.  
       More allocation groups are added (of the original size) when xfs_growfs(8) is run.  
  
       The  log  section  (or  area,  if it is internal to the data section) is used to store changes to filesystem metadata while the filesystem is running until those changes are made to the data section.  It is  
       written sequentially during normal operation and read only during mount.  When mounting a filesystem after a crash, the log is read to complete operations that were in progress at the time of the crash.  
  
       The realtime section is used to store the data of realtime files.  These files had an attribute bit set through xfsctl(3) after file creation, before any data was written to the file.  The realtime  section  
       is divided into a number of extents of fixed size (specified at mkfs.xfs(8) time).  Each file in the realtime section has an extent size that is a multiple of the realtime section extent size.  
  
       Each allocation group contains several data structures.  The first sector contains the superblock.  For allocation groups after the first, the superblock is just a copy and is not updated after mkfs.xfs(8).  
       The next three sectors contain information for block and inode allocation within the allocation group.  Also contained within each allocation group are data structures to  locate  free  blocks  and  inodes;  
       these are located through the header structures.  
  
       Each  XFS filesystem is labeled with a Universal Unique Identifier (UUID).  The UUID is stored in every allocation group header and is used to help distinguish one XFS filesystem from another, therefore you  
       should avoid using dd(1) or other block-by-block copying programs to copy XFS filesystems.  If two XFS filesystems on the same machine have the same UUID, xfsdump(8) may become confused when doing incremen-  
       tal and resumed dumps.  xfsdump(8) and xfsrestore(8) are recommended for making copies of XFS filesystems.  
  
OPERATIONS  
       Some functionality specific to the XFS filesystem is accessible to applications through the xfsctl(3) and by-handle (see open_by_handle(3)) interfaces.  
  
MOUNT OPTIONS  
       Refer to the mount(8) manual entry for descriptions of the individual XFS mount options.  
  
SEE ALSO  
       xfsctl(3), mount(8), mkfs.xfs(8), xfs_info(8), xfs_admin(8), xfsdump(8), xfsrestore(8).  
  
                                                                        xfs(5)   

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

PostgreSQL on xfs 性能优化 - 1

背景

例子

data section：

log section：

realtime section:

排错

参考

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法