PostgreSQL 10.0 preview 优化器改进 - 不完整索引支持复合排序

3 minute read

背景

当我们在使用数据库时,排序是一个比较常见的需求,排序有几种方法,使用索引,或者访问堆表然后显示的排序。

当使用索引排序时,索引必须包含排序列,同时必须是驱动列包含排序列。

例如

order by a,b,c,那么可使用索引(a,b,c,*)  
  
但是order by a,b,c能使用索引(a,b)或者(a)吗?  

实际上使用不完整索引,复合排序时,我们可以分为两个阶段,一个节点是按索引扫描,当扫描到一样的索引数据时,再从HEAP读取其他需要排序的列,然后使用quicksort或者其他排序手段对剩余字段进行排序。

这样原本需要再创建一个复合索引来支持复合排序就不必了。

Hackers!  
  
Currently when we need to get ordered result from table we have to choose  
one of two approaches: get results from index in exact order we need or do  
sort of tuples. However, it could be useful to mix both methods: get  
results from index in order which partially meets our requirements and do  
rest of work from heap.  
  
Two attached patches are proof of concept for this approach.  
  
*partial-sort-1.patch*  
  
This patch allows to use index for order-by if order-by clause and index  
has non-empty common prefix. So, index gives right ordering for first n  
order-by columns. In order to provide right order for rest m columns, sort  
node is inserted. This sort node sorts groups of tuples where values of  
first n order-by columns are equal.  
  
See an example.  
  
create table test as (select id, (random()*10000)::int as v1, random() as  
v2 from generate_series(1,1000000) id);  
create index test_v1_idx on test (v1);  
  
We've index by v1 column, but we can get results ordered by v1, v2.  
  
postgres=# select * from test order by v1, v2 limit 10;  
   id   | v1 |         v2  
--------+----+--------------------  
 390371 |  0 | 0.0284479795955122  
 674617 |  0 | 0.0322008323855698  
 881905 |  0 |  0.042586590629071  
 972877 |  0 | 0.0531588457524776  
 364903 |  0 | 0.0594307743012905  
  82333 |  0 | 0.0666455538012087  
 266488 |  0 |  0.072808934841305  
 892215 |  0 | 0.0744258034974337  
  13805 |  0 | 0.0794667331501842  
 338435 |  0 |  0.171817752998322  
(10 rows)  
  
And it's fast using following plan.  
  
                                                                QUERY PLAN  
  
------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=69214.06..69214.08 rows=10 width=16) (actual  
time=0.097..0.099 rows=10 loops=1)  
   ->  Sort  (cost=69214.06..71714.06 rows=1000000 width=16) (actual  
time=0.096..0.097 rows=10 loops=1)  
         Sort Key: v1, v2  
         Sort Method: top-N heapsort  Memory: 25kB  
         ->  Index Scan using test_v1_idx on test  (cost=0.42..47604.42  
rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)  
 Total runtime: 0.125 ms  
(6 rows)  
  
For sure, this approach is effective only when first n order-by columns we  
selected provides enough count of unique values (so, sorted groups are  
small). Patch is only PoC because it doesn't contains any try to estimate  
right cost of using partial sort.  
  
  
*partial-knn-1.patch*  
  
KNN-GiST provides ability to get ordered results from index, but this order  
is based only on index information. For instance, GiST index contains  
bounding rectangles for polygons, and we can't get exact distance to  
polygon from index (similar situation is in PostGIS). In attached patch,  
GiST distance method can set recheck flag (similar to consistent method).  
This flag means that distance method returned lower bound of distance and  
we should recheck it from heap.  
  
See an example.  
  
create table test as (select id, polygon(3+(random()*10)::int,  
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from  
generate_series(1,1000000) id);  
create index test_idx on test using gist (p);  
  
We can get results ordered by distance from polygon to point.  
  
postgres=# select id, p <-> point(0.5,0.5) from test order by p <->  
point(0.5,0.5) limit 10;  
   id   |       ?column?  
--------+----------------------  
 755611 | 0.000405855808916853  
 807562 | 0.000464123777564343  
 437778 | 0.000738524708741959  
 947860 |  0.00076250998760724  
 389843 | 0.000886362723569568  
  17586 | 0.000981960100555216  
 411329 |  0.00145338112316853  
 894191 |  0.00149399559703506  
 391907 |   0.0016647896049741  
 235381 |  0.00167554614889509  
(10 rows)  
  
It's fast using just index scan.  
  
                                                            QUERY PLAN  
  
----------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230  
rows=10 loops=1)  
   ->  Index Scan using test_idx on test  (cost=0.29..157672.29  
rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)  
         Order By: (p <-> '(0.5,0.5)'::point)  
 Total runtime: 0.305 ms  
(4 rows)  
  
This patch is also only PoC because of following:  
1) It's probably wrong at all to get heap tuple from index scan node. This  
work should be done from another node.  
2) Assumption that order-by operator returns float8 comparable with GiST  
distance method result in general case is wrong.  
  
------  
With best regards,  
Alexander Korotkov.  

这个patch的讨论,详见邮件组,本文末尾URL。

PostgreSQL社区的作风非常严谨,一个patch可能在邮件组中讨论几个月甚至几年,根据大家的意见反复的修正,patch合并到master已经非常成熟,所以PostgreSQL的稳定性也是远近闻名的。

参考

https://commitfest.postgresql.org/13/1011/

https://www.postgresql.org/message-id/flat/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com#CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com

Flag Counter

digoal’s 大量PostgreSQL文章入口