ZFS ARC & L2ARC zfs-$ver/module/zfs/arc.c

10 minute read

背景

可调参数, 含义以及默认值见arc.c或/sys/module/zfs/parameter/$parm_name

如果是freebsd或其他原生支持zfs的系统, 调整sysctl.conf.

parm:           zfs_arc_min:Min arc size (ulong)  
parm:           zfs_arc_max:Max arc size (ulong)  
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)  
parm:           zfs_arc_meta_prune:Bytes of meta data to prune (int)  
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)  
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)  
parm:           zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)  
parm:           zfs_disable_dup_eviction:disable duplicate buffer eviction (int)  
parm:           zfs_arc_memory_throttle_disable:disable memory throttle (int)  
parm:           zfs_arc_min_prefetch_lifespan:Min life of prefetch block (int)  
parm:           l2arc_write_max:Max write bytes per interval (ulong)  
parm:           l2arc_write_boost:Extra write bytes during device warmup (ulong)  
parm:           l2arc_headroom:Number of max device writes to precache (ulong)  
parm:           l2arc_headroom_boost:Compressed l2arc_headroom multiplier (ulong)  
parm:           l2arc_feed_secs:Seconds between L2ARC writing (ulong)  
parm:           l2arc_feed_min_ms:Min feed interval in milliseconds (ulong)  
parm:           l2arc_noprefetch:Skip caching prefetched buffers (int)  
parm:           l2arc_nocompress:Skip compressing L2ARC buffers (int)  
parm:           l2arc_feed_again:Turbo L2ARC warmup (int)  
parm:           l2arc_norw:No reads during writes (int)  

L2ARC几点需要注意,

1. L2ARC的内容是l2arc_feed_thread函数主动间歇性的从ARC读取的. 所以ARC里没有的内容, L2ARC也不可能有.

2. L2ARC不存储脏数据, 所以也不需要回写到DISK. 鉴于这个因素, L2ARC不适合频繁变更的场景(如oltp中的频繁更新场景)

3. 如果L2ARC中缓存的数据块在ARC变成脏数据了, 这部分数据会直接从L2ARC丢弃.

4. L2ARC的优化参数(配置到/etc/modprobe.d/zfs.conf或动态变更/sys/module/zfs/parameters/$PARM_NAME)

 *      l2arc_write_max         max write bytes per interval, 一次l2arc feed的最大量.  
 *      l2arc_write_boost       extra write bytes during device warmup  
 *      l2arc_noprefetch        skip caching prefetched buffers  
 *      l2arc_nocompress        skip compressing buffers  
 *      l2arc_headroom          number of max device writes to precache  
 *      l2arc_headroom_boost    when we find compressed buffers during ARC  
 *                              scanning, we multiply headroom by this  
 *                              percentage factor for the next scan cycle,  
 *                              since more compressed buffers are likely to  
 *                              be present  
 *      l2arc_feed_secs         seconds between L2ARC writing, 如果要加快从arc导入l2arc的速度, 可缩短interval  

参见

zfs-0.6.2/module/zfs/arc.c

ARC

/*  
 * DVA-based Adjustable Replacement Cache  
 *  
 * While much of the theory of operation used here is  
 * based on the self-tuning, low overhead replacement cache  
 * presented by Megiddo and Modha at FAST 2003, there are some  
 * significant differences:  
 *  
 * 1. The Megiddo and Modha model assumes any page is evictable.  
 * Pages in its cache cannot be "locked" into memory.  This makes  
 * the eviction algorithm simple: evict the last page in the list.  
 * This also make the performance characteristics easy to reason  
 * about.  Our cache is not so simple.  At any given moment, some  
 * subset of the blocks in the cache are un-evictable because we  
 * have handed out a reference to them.  Blocks are only evictable  
 * when there are no external references active.  This makes  
 * eviction far more problematic:  we choose to evict the evictable  
 * blocks that are the "lowest" in the list.  
 *  
 * There are times when it is not possible to evict the requested  
 * space.  In these circumstances we are unable to adjust the cache  
 * size.  To prevent the cache growing unbounded at these times we  
 * implement a "cache throttle" that slows the flow of new data  
 * into the cache until we can make space available.  
 *  
 * 2. The Megiddo and Modha model assumes a fixed cache size.  
 * Pages are evicted when the cache is full and there is a cache  
 * miss.  Our model has a variable sized cache.  It grows with  
 * high use, but also tries to react to memory pressure from the  
 * operating system: decreasing its size when system memory is  
 * tight.  
 *  
 * 3. The Megiddo and Modha model assumes a fixed page size. All  
 * elements of the cache are therefor exactly the same size.  So  
 * when adjusting the cache size following a cache miss, its simply  
 * a matter of choosing a single page to evict.  In our model, we  
 * have variable sized cache blocks (rangeing from 512 bytes to  
 * 128K bytes).  We therefor choose a set of blocks to evict to make  
 * space for a cache miss that approximates as closely as possible  
 * the space used by the new block.  
 *  
 * See also:  "ARC: A Self-Tuning, Low Overhead Replacement Cache"  
 * by N. Megiddo & D. Modha, FAST 2003  
 */  
  
/*  
 * The locking model:  
 *  
 * A new reference to a cache buffer can be obtained in two  
 * ways: 1) via a hash table lookup using the DVA as a key,  
 * or 2) via one of the ARC lists.  The arc_read() interface  
 * uses method 1, while the internal arc algorithms for  
 * adjusting the cache use method 2.  We therefor provide two  
 * types of locks: 1) the hash table lock array, and 2) the  
 * arc list locks.  
 *  
 * Buffers do not have their own mutexes, rather they rely on the  
 * hash table mutexes for the bulk of their protection (i.e. most  
 * fields in the arc_buf_hdr_t are protected by these mutexes).  
 *  
 * buf_hash_find() returns the appropriate mutex (held) when it  
 * locates the requested buffer in the hash table.  It returns  
 * NULL for the mutex if the buffer was not in the table.  
 *  
 * buf_hash_remove() expects the appropriate hash mutex to be  
 * already held before it is invoked.  
 *  
 * Each arc state also has a mutex which is used to protect the  
 * buffer list associated with the state.  When attempting to  
 * obtain a hash table lock while holding an arc list lock you  
 * must use: mutex_tryenter() to avoid deadlock.  Also note that  
 * the active state mutex must be held before the ghost state mutex.  
 *  
 * Arc buffers may have an associated eviction callback function.  
 * This function will be invoked prior to removing the buffer (e.g.  
 * in arc_do_user_evicts()).  Note however that the data associated  
 * with the buffer may be evicted prior to the callback.  The callback  
 * must be made with *no locks held* (to prevent deadlock).  Additionally,  
 * the users of callbacks must ensure that their private data is  
 * protected from simultaneous callbacks from arc_buf_evict()  
 * and arc_do_user_evicts().  
 *  
 * It as also possible to register a callback which is run when the  
 * arc_meta_limit is reached and no buffers can be safely evicted.  In  
 * this case the arc user should drop a reference on some arc buffers so  
 * they can be reclaimed and the arc_meta_limit honored.  For example,  
 * when using the ZPL each dentry holds a references on a znode.  These  
 * dentries must be pruned before the arc buffer holding the znode can  
 * be safely evicted.  
 *  
 * Note that the majority of the performance stats are manipulated  
 * with atomic operations.  
 *  
 * The L2ARC uses the l2arc_buflist_mtx global mutex for the following:  
 *  
 *      - L2ARC buflist creation  
 *      - L2ARC buflist eviction  
 *      - L2ARC write completion, which walks L2ARC buflists  
 *      - ARC header destruction, as it removes from L2ARC buflists  
 *      - ARC header release, as it removes from L2ARC buflists  
 */  

L2ARC

/*  
 * Level 2 ARC  
 *  
 * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.  
 * It uses dedicated storage devices to hold cached data, which are populated  
 * using large infrequent writes.  The main role of this cache is to boost  
 * the performance of random read workloads.  The intended L2ARC devices  
 * include short-stroked disks, solid state disks, and other media with  
 * substantially faster read latency than disk.  
 *  
 *                 +-----------------------+  
 *                 |         ARC           |  
 *                 +-----------------------+  
 *                    |         ^     ^  
 *                    |         |     |  
 *      l2arc_feed_thread()    arc_read()  
 *                    |         |     |  
 *                    |  l2arc read   |  
 *                    V         |     |  
 *               +---------------+    |  
 *               |     L2ARC     |    |  
 *               +---------------+    |  
 *                   |    ^           |  
 *          l2arc_write() |           |  
 *                   |    |           |  
 *                   V    |           |  
 *                 +-------+      +-------+  
 *                 | vdev  |      | vdev  |  
 *                 | cache |      | cache |  
 *                 +-------+      +-------+  
 *                 +=========+     .-----.  
 *                 :  L2ARC  :    |-_____-|  
 *                 : devices :    | Disks |  
 *                 +=========+    `-_____-'  
 *  
 * Read requests are satisfied from the following sources, in order:  
 *  
 *      1) ARC  
 *      2) vdev cache of L2ARC devices  
 *      3) L2ARC devices  
 *      4) vdev cache of disks  
 *      5) disks  
 *  
 * Some L2ARC device types exhibit extremely slow write performance.  
 * To accommodate for this there are some significant differences between  
 * the L2ARC and traditional cache design:  
 *  
 * 1. There is no eviction path from the ARC to the L2ARC.  Evictions from  
 * the ARC behave as usual, freeing buffers and placing headers on ghost  
 * lists.  The ARC does not send buffers to the L2ARC during eviction as  
 * this would add inflated write latencies for all ARC memory pressure.  
 *  
 * 2. The L2ARC attempts to cache data from the ARC before it is evicted.  
 * It does this by periodically scanning buffers from the eviction-end of  
 * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are  
 * not already there. It scans until a headroom of buffers is satisfied,  
 * which itself is a buffer for ARC eviction. If a compressible buffer is  
 * found during scanning and selected for writing to an L2ARC device, we  
 * temporarily boost scanning headroom during the next scan cycle to make  
 * sure we adapt to compression effects (which might significantly reduce  
 * the data volume we write to L2ARC). The thread that does this is  
 * l2arc_feed_thread(), illustrated below; example sizes are included to  
 * provide a better sense of ratio than this diagram:  
 *  
 *             head -->                        tail  
 *              +---------------------+----------+  
 *      ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC  
 *              +---------------------+----------+   |   o L2ARC eligible  
 *      ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer  
 *              +---------------------+----------+   |  
 *                   15.9 Gbytes      ^ 32 Mbytes    |  
 *                                 headroom          |  
 *                                            l2arc_feed_thread()  
 *                                                   |  
 *                       l2arc write hand <--[oooo]--'  
 *                               |           8 Mbyte  
 *                               |          write max  
 *                               V  
 *                +==============================+  
 *      L2ARC dev |####|#|###|###|    |####| ... |  
 *                +==============================+  
 *                           32 Gbytes  
 *  
 * 3. If an ARC buffer is copied to the L2ARC but then hit instead of  
 * evicted, then the L2ARC has cached a buffer much sooner than it probably  
 * needed to, potentially wasting L2ARC device bandwidth and storage.  It is  
 * safe to say that this is an uncommon case, since buffers at the end of  
 * the ARC lists have moved there due to inactivity.  
 *  
 * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,  
 * then the L2ARC simply misses copying some buffers.  This serves as a  
 * pressure valve to prevent heavy read workloads from both stalling the ARC  
 * with waits and clogging the L2ARC with writes.  This also helps prevent  
 * the potential for the L2ARC to churn if it attempts to cache content too  
 * quickly, such as during backups of the entire pool.  
 *  
 * 5. After system boot and before the ARC has filled main memory, there are  
 * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru  
 * lists can remain mostly static.  Instead of searching from tail of these  
 * lists as pictured, the l2arc_feed_thread() will search from the list heads  
 * for eligible buffers, greatly increasing its chance of finding them.  
 *  
 * The L2ARC device write speed is also boosted during this time so that  
 * the L2ARC warms up faster.  Since there have been no ARC evictions yet,  
 * there are no L2ARC reads, and no fear of degrading read performance  
 * through increased writes.  
 *  
 * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that  
 * the vdev queue can aggregate them into larger and fewer writes.  Each  
 * device is written to in a rotor fashion, sweeping writes through  
 * available space then repeating.  
 *  
 * 7. The L2ARC does not store dirty content.  It never needs to flush  
 * write buffers back to disk based storage.  
 *  
 * 8. If an ARC buffer is written (and dirtied) which also exists in the  
 * L2ARC, the now stale L2ARC buffer is immediately dropped.  
 *  
 * The performance of the L2ARC can be tweaked by a number of tunables, which  
 * may be necessary for different workloads:  
 *  
 *      l2arc_write_max         max write bytes per interval  
 *      l2arc_write_boost       extra write bytes during device warmup  
 *      l2arc_noprefetch        skip caching prefetched buffers  
 *      l2arc_nocompress        skip compressing buffers  
 *      l2arc_headroom          number of max device writes to precache  
 *      l2arc_headroom_boost    when we find compressed buffers during ARC  
 *                              scanning, we multiply headroom by this  
 *                              percentage factor for the next scan cycle,  
 *                              since more compressed buffers are likely to  
 *                              be present  
 *      l2arc_feed_secs         seconds between L2ARC writing  
 *  
 * Tunables may be removed or added as future performance improvements are  
 * integrated, and also may become zpool properties.  
 *  
 * There are three key functions that control how the L2ARC warms up:  
 *  
 *      l2arc_write_eligible()  check if a buffer is eligible to cache  
 *      l2arc_write_size()      calculate how much to write  
 *      l2arc_write_interval()  calculate sleep delay between writes  
 *  
 * These three functions determine what to write, how much, and how quickly  
 * to send writes.  
 */  

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

ZFS ARC & L2ARC zfs-$ver/module/zfs/arc.c

背景

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法