Linux 时钟精度 与 PostgreSQL auto_explain (explain timing on Linux时钟开销估算)
背景
我们在诊断SQL的执行计划时,通常会用explain analyze,analyze有几个开关,其中一个是timing,它会帮你记录下SQL每个NODE的执行时间。
但是这部分是有一定的性能开销的,而且这个开销与操作系统的时钟获取接口有关。
有时,你会发现explain analyze的执行时间远大于真实的执行时间,这是为什么呢?
时钟硬件与时间精确度
常见时钟方法精度 tsc > hpet ( 100 纳秒(十亿分之一秒) ) > acpi_pm ( 300 纳秒(十亿分之一秒) )
Clock hardware and timing accuracy
Collecting accurate timing information is normally done on computers using hardware clocks with various levels of accuracy.
With some hardware the operating systems can pass the system clock time almost directly to programs.
A system clock can also be derived from a chip that simply provides timing interrupts, periodic ticks at some known time interval.
In either case, operating system kernels provide a clock source that hides these details.
But the accuracy of that clock source and how quickly it can return results varies based on the underlying hardware.
Inaccurate time keeping can result in system instability. Test any change to the clock source very carefully.
Operating system defaults are sometimes made to favor reliability over best accuracy. And if you are using a virtual machine, look into the recommended time sources compatible with it.
Virtual hardware faces additional difficulties when emulating timers, and there are often per operating system settings suggested by vendors.
The Time Stamp Counter (TSC) clock source is the most accurate one available on current generation CPUs.
It's the preferred way to track the system time when it's supported by the operating system and the TSC clock is reliable.
There are several ways that TSC can fail to provide an accurate timing source, making it unreliable.
Older systems can have a TSC clock that varies based on the CPU temperature, making it unusable for timing.
以前就有遇到过机器时钟跳来跳去的问题,我记得是IBMX3950的堆叠服务器
Trying to use TSC on some older multicore CPUs can give a reported time that's inconsistent among multiple cores.
This can result in the time going backwards, a problem this program checks for. And even the newest systems can fail to provide accurate TSC timing with very aggressive power saving configurations.
Newer operating systems may check for the known TSC problems and switch to a slower, more stable clock source when they are seen.
If your system supports TSC time but doesn't default to that, it may be disabled for a good reason.
And some operating systems may not detect all the possible problems correctly, or will allow using TSC even in situations where it's known to be inaccurate.
The High Precision Event Timer (HPET) is the preferred timer on systems where it's available and TSC is not accurate.
The timer chip itself is programmable to allow up to 100 nanosecond resolution, but you may not see that much accuracy in your system clock.
Advanced Configuration and Power Interface (ACPI) provides a Power Management (PM) Timer, which Linux refers to as the acpi_pm.
The clock derived from acpi_pm will at best provide 300 nanosecond resolution.
Timers used on older PC hardware include the 8254 Programmable Interval Timer (PIT),
the real-time clock (RTC), the Advanced Programmable Interrupt Controller (APIC) timer,
and the Cyclone timer.
These timers aim for millisecond resolution.
explain analyze代码
当开启了explain analyze timing开关时,会设置instrument_option |= INSTRUMENT_TIMER;
src/backend/commands/explain.c
ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
const instr_time *planduration)
{
DestReceiver *dest;
QueryDesc *queryDesc;
instr_time starttime;
double totaltime = 0;
int eflags;
int instrument_option = 0;
if (es->analyze && es->timing)
instrument_option |= INSTRUMENT_TIMER;
else if (es->analyze)
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
instrument_option |= INSTRUMENT_BUFFERS;
这个设置会直接影响EXECUTE时的时钟统计,这部分还不是很友好,其实没有必要每条TUPLE都统计这部分时间,只要统计进出NODE的时候的时间即可
src/backend/executor/instrument.c
/* Entry to a plan node */
void
InstrStartNode(Instrumentation *instr)
{
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
INSTR_TIME_SET_CURRENT(instr->starttime);
else
elog(ERROR, "InstrStartNode called twice in a row");
}
/* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
instr->bufusage_start = pgBufferUsage;
}
/* Exit from a plan node */
void
InstrStopNode(Instrumentation *instr, double nTuples)
{
instr_time endtime;
/* count the returned tuples */
instr->tuplecount += nTuples;
/* let's update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStopNode called without start");
INSTR_TIME_SET_CURRENT(endtime);
INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
}
/* Add delta of buffer usage since entry to node's totals */
if (instr->need_bufusage)
BufferUsageAccumDiff(&instr->bufusage,
&pgBufferUsage, &instr->bufusage_start);
/* Is this the first tuple of this cycle? */
if (!instr->running)
{
instr->running = true;
instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
}
}
src/include/portability/instr_time.h
#define INSTR_TIME_SET_CURRENT(t) gettimeofday(&(t), NULL)
单条QUERY , EXECUTE可能被多次调用,所以如果要统计时间。
src/backend/executor/execMain.c
void
ExecutorRun(QueryDesc *queryDesc,
ScanDirection direction, uint64 count)
{
if (ExecutorRun_hook)
(*ExecutorRun_hook) (queryDesc, direction, count);
else
standard_ExecutorRun(queryDesc, direction, count);
}
void
standard_ExecutorRun(QueryDesc *queryDesc,
ScanDirection direction, uint64 count)
{
...
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
InstrStartNode(queryDesc->totaltime); // 时钟
...
if (queryDesc->totaltime)
InstrStopNode(queryDesc->totaltime, estate->es_processed);
如果需要处理的记录数非常多,由于频繁调用gettimeofday,就导致了explain analyze timing时间拉长的问题。
例子
1000万纪录的count(*)操作,我们看看实际的执行时间,以及开启analyze timing后,不同时钟硬件的性能影响
create table tbl_time(id int);
insert into tbl_time select generate_series(1,10000000);
\timing
postgres=# SELECT COUNT(*) FROM tbl_time;
count
----------
10000000
(1 row)
Time: 1171.956 ms
使用tsc hpet acpi_pm三种时钟硬件测试性能影响
#cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
1. tsc
#echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
postgres=# explain analyze SELECT COUNT(*) FROM tbl_time;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169247.71..169247.72 rows=1 width=8) (actual time=2113.432..2113.432 rows=1 loops=1)
-> Seq Scan on tbl_time (cost=0.00..144247.77 rows=9999977 width=0) (actual time=0.013..1128.860 rows=10000000 loops=1)
Planning time: 0.062 ms
Execution time: 2113.514 ms
(4 rows)
timing off
postgres=# explain (analyze on,verbose on,timing off,costs on,buffers on,summary on) select count(*) from item1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Aggregate (cost=254.75..254.75 rows=1 width=8) (actual rows=1 loops=1)
Output: count(*)
Buffers: shared hit=1334
-> Seq Scan on public.item1 (cost=0.00..38.00 rows=100000 width=0) (actual rows=100000 loops=1)
Output: i_id, i_im_id, i_name, i_price, i_data
Buffers: shared hit=1334
Planning Time: 0.044 ms
Execution Time: 10.328 ms
(8 rows)
timing on
postgres=# explain (analyze on,verbose on,timing on,costs on,buffers on,summary on) select count(*) from item1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=254.75..254.75 rows=1 width=8) (actual time=16.712..16.712 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=1334
-> Seq Scan on public.item1 (cost=0.00..38.00 rows=100000 width=0) (actual time=0.007..10.242 rows=100000 loops=1)
Output: i_id, i_im_id, i_name, i_price, i_data
Buffers: shared hit=1334
Planning Time: 0.043 ms
Execution Time: 16.730 ms
(8 rows)
2. hpet
#echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
postgres=# explain analyze SELECT COUNT(*) FROM tbl_time;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169247.71..169247.72 rows=1 width=8) (actual time=13968.218..13968.218 rows=1 loops=1)
-> Seq Scan on tbl_time (cost=0.00..144247.77 rows=9999977 width=0) (actual time=0.018..7067.711 rows=10000000 loops=1)
Planning time: 0.059 ms
Execution time: 13968.271 ms
(4 rows)
3. acpi_pm
#echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
postgres=# explain analyze SELECT COUNT(*) FROM tbl_time;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169247.71..169247.72 rows=1 width=8) (actual time=19641.242..19641.243 rows=1 loops=1)
-> Seq Scan on tbl_time (cost=0.00..144247.77 rows=9999977 width=0) (actual time=0.018..9896.285 rows=10000000 loops=1)
Planning time: 0.060 ms
Execution time: 19641.296 ms
(4 rows)
使用pg_test_timing测试不同的时钟硬件在使用gettimeofday时带来的性能影响
pg_test_timing是用来测试不同硬件时钟的损耗的,使用gettimeofday进行测试
下面是测试
1. tsc
pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 47.70 nsec
Histogram of timing durations:
< usec % of total count
1 95.23986 59893249
2 4.75540 2990515
4 0.00384 2414
8 0.00077 485
16 0.00013 79
32 0.00000 3
64 0.00000 1
2. hpet
pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 696.44 nsec
Histogram of timing durations:
< usec % of total count
1 31.81944 1370669
2 67.06767 2889038
4 1.03890 44752
8 0.05959 2567
16 0.01418 611
32 0.00016 7
64 0.00005 2
3. acpi_pm
pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 919.07 nsec
Histogram of timing durations:
< usec % of total count
1 12.25423 399999
2 84.17305 2747553
4 3.45019 112620
8 0.08648 2823
16 0.03468 1132
32 0.00132 43
64 0.00003 1
128 0.00003 1
根据以上测试,可以预估前面EXPLAIN ANALYZE带来的问题
acpi_pm
<1到底是多少不知道了,<1越多,误差越大
10000000*(1*0.1225+2*0.8417)/1000000
hpet
<1到底是多少不知道了,<1越多,误差越大
10000000*(1*0.3182+2*0.6706)/1000000
tsc
不太好估算,因为<1到底是多少不知道了
10000000*(1*0.9524+2*0.0475)/1000000
auto_explain log timing也有类似问题
当开启auto_explain的timing计数后,由于时钟开销的问题,可能严重的影响性能,如本文所示,即使使用TSC时钟,对于扫描记录数很多的时候,性能下降了一倍。
虽然auto_explain有一个超时阈值,但是当你开启了timing的记录后,就会导致它把所有的NODE执行时间都记录下来,因为执行结束前,并不知道总时间会不会超,所以每条QUERY的ANALYZE TIMING都会被开启。
建议如果不是特殊需求,不要开启auto_explain timing选项。
参考
https://www.ibm.com/developerworks/cn/linux/1308_liuming_linuxtime4/
https://www.postgresql.org/docs/9.6/static/pgtesttiming.html
https://www.postgresql.org/docs/9.6/static/auto-explain.html
https://www.postgresql.org/message-id/flat/31856.1400021891%40sss.pgh.pa.us