# 从相似度算法谈起 - Effective similarity search in PostgreSQL

## 背景

《PostgreSQL 在视频、图片去重，图像搜索业务中的应用》

《弱水三千,只取一瓢,当图像搜索遇见PostgreSQL(Haar wavelet)》

《聊一聊双十一背后的技术 - 毫秒分词算啥, 试试正则和相似度》

《PostgreSQL 全文检索加速 快到没有朋友 - RUM索引接口(潘多拉魔盒)》

《PostgreSQL 文本数据分析实践之 - 相似度分析》

## 从最简单的说起 - 如何计算两个数组的相似度

### 算法介绍

Na, Nb – the number of unique elements in the arrays

Nu – the number of unique elements in the union of sets

Ni – the number of unique elements in the intersection of arrays

1. 最简单的相似度算法如下

• 容易理解

• 速度=N*log(N)

• 当Nb, Na很大时，也可以很好的支持

2. 另一种相似度算法

• 速度=N*log(N)

• 当Nb, Na很大时，也可以很好的支持

• Few elements -> large scatter of similarity (当元素很少时，相似度可能会很分散)

• Frequent elements -> weight below (当元素频繁出现时，没有词频的权重，无法得到合理的相似度)

3. TF/IDF系数，解决以上问题

http://en.wikipedia.org/wiki/Tf*idf

### smlar相似度插件

``````git clone git://sigaev.ru/smlar
cd smlar
USE_PGXS=1 make
USE_PGXS=1 make install
``````

``````smlar.threshold = 0.8  # or any other value >0 and <1
``````

``````psql

test=# CREATE EXTENSION smlar;
CREATE EXTENSION
``````

``````test=# SELECT smlar('{1,4,6}'::int[], '{5,4,6}' );
smlar
----------
0.666667
(1 row)

test=# SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)' );
smlar
----------
0.666667
(1 row)
``````

``````test=# SELECT '{1,4,6,5,7,9}'::int[] % '{1,5,4,6,7,8,9}'::int[] as similar;
similar
---------
t
(1 row)
``````

GiST/GIN support for % operation.

The parameter “similar.type” allows you to specify what kind of formula used to calculate the similarity: cosine (default), overlap or tfidf.

For “tfidf” need to make additional configuration, but I will not consider this in the article (all can be found in the README file).

Now let’s consider an example of using this extension.

``````test=# SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)' );
smlar
----------
0.666667
(1 row)
``````

## 由数组的相似度运算到字符串、图片、….. 的相似度运算

### 字符串相似度

https://www.postgresql.org/docs/9.6/static/pgtrgm.html

``````postgres=# select similarity('hello digoal','hell digoal');
similarity
------------
0.785714
(1 row)
``````

pg_trgm很好用，有很多的索引检索，排序的支持。

《聊一聊双十一背后的技术 - 毫秒分词算啥, 试试正则和相似度》

### 图片相似度

``````CREATE TABLE images (
id serial PRIMARY KEY,
name varchar(50),
image_array integer[]
);

INSERT into images(image_array) VALUES ('{1010257,...,2424257}');

test=# SELECT count(*) from images;
count
--------
200000
(1 row)

test=# EXPLAIN ANALYZE SELECT id FROM images WHERE images.image_array % '{1010259,...,2424252}'::int[];

Aggregate  (cost=14.58..14.59 rows=1 width=0) (actual time=1.785..1.785 rows=1 loops=1)
-&gt;  Seq Scan on images  (cost=0.00..14.50 rows=33 width=0) (actual time=0.115..1.772 rows=20 loops=1)
Filter: (image_array % '{1010259,1011253,...,2423253,2424252}'::integer[])
Total runtime: 5152.819 ms
(4 rows)

CREATE INDEX image_array_gin ON images USING GIN(image_array _int4_sml_ops);

or

CREATE INDEX image_array_gist ON images USING GIST(image_array _int4_sml_ops);
``````

``````test=# EXPLAIN ANALYZE SELECT id FROM images WHERE images.image_array % '{1010259,1011253,...,2423253,2424252}'::int[];

Aggregate  (cost=815.75..815.76 rows=1 width=0) (actual time=320.428..320.428 rows=1 loops=1)
-&gt;  Bitmap Heap Scan on images  (cost=66.42..815.25 rows=200 width=0) (actual time=108.127..304.524 rows=40000 loops=1)
Recheck Cond: (image_array % '{1010259,1011253,...,2424252}'::integer[])
-&gt;  Bitmap Index Scan on image_array_gist  (cost=0.00..66.37 rows=200 width=0) (actual time=90.814..90.814 rows=40000 loops=1)
Index Cond: (image_array % '{1010259,1011253,...,2424252}'::integer[])
Total runtime: 320.487 ms
(6 rows)

test=# SELECT count(*) from images;
count
---------
1000000
(1 row)

test=# EXPLAIN ANALYZE SELECT count(*) FROM images WHERE images.image_array % '{1010259,1011253,...,2423253,2424252}'::int[];

Bitmap Heap Scan on images  (cost=286.64..3969.45 rows=986 width=4) (actual time=504.312..2047.533 rows=200000 loops=1)
Recheck Cond: (image_array % '{1010259,1011253,...,2423253,2424252}'::integer[])
-&gt;  Bitmap Index Scan on image_array_gist  (cost=0.00..286.39 rows=986 width=0) (actual time=446.109..446.109 rows=200000 loops=1)
Index Cond: (image_array % '{1010259,1011253,...,2423253,2424252}'::integer[])
Total runtime: 2152.411 ms
(5 rows)

EXPLAIN ANALYZE SELECT smlar(images.image_array, '{1010259,...,2424252}'::int[]) as similarity FROM images WHERE images.image_array % '{1010259,1011253, ...,2423253,2424252}'::int[] ORDER BY similarity DESC;

Sort  (cost=4020.94..4023.41 rows=986 width=924) (actual time=2888.472..2901.977 rows=200000 loops=1)
Sort Key: (smlar(image_array, '{...,2424252}'::integer[]))
Sort Method: quicksort  Memory: 15520kB
-&gt;  Bitmap Heap Scan on images  (cost=286.64..3971.91 rows=986 width=924) (actual time=474.436..2729.638 rows=200000 loops=1)
Recheck Cond: (image_array % '{...,2424252}'::integer[])
-&gt;  Bitmap Index Scan on image_array_gist  (cost=0.00..286.39 rows=986 width=0) (actual time=421.140..421.140 rows=200000 loops=1)
Index Cond: (image_array % '{...,2424252}'::integer[])
Total runtime: 2912.207 ms
(8 rows)
``````

## 文本的相似度分析

《PostgreSQL 全文检索加速 快到没有朋友 - RUM索引接口(潘多拉魔盒)》

《PostgreSQL 文本数据分析实践之 - 相似度分析》

## 更优秀的图片相似度分析方法

《PostgreSQL 在视频、图片去重，图像搜索业务中的应用》

《弱水三千,只取一瓢,当图像搜索遇见PostgreSQL(Haar wavelet)》

``````float4 smlar(anyarray, anyarray)
- computes similary of two arrays. Arrays should be the same type.

float4 smlar(anyarray, anyarray, bool useIntersect)
-  computes similary of two arrays of composite types. Composite type looks like:
CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);
useIntersect option points to use only intersected elements in denominator
see an exmaples in sql/composite_int4.sql or sql/composite_text.sql

float4 smlar( anyarray a, anyarray b, text formula );
- computes similary of two arrays by given formula, arrays should
be the same type.
Predefined variables in formula:
N.i   - number of common elements in both array (intersection)
N.a   - number of uniqueelements in first array
N.b   - number of uniqueelements in second array
Example:
smlar('{1,4,6}'::int[], '{5,4,6}' )
smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)' )
That calls are equivalent.

anyarray % anyarray
- returns true if similarity of that arrays is greater than limit

float4 show_smlar_limit()  - deprecated
- shows the limit for % operation

float4 set_smlar_limit(float4) - deprecated
- sets the limit for % operation

Use instead of show_smlar_limit/set_smlar_limit GUC variable
smlar.threshold (see below)

text[] tsvector2textarray(tsvector)
- transforms tsvector type to text array

anyarray array_unique(anyarray)
- sort and unique array

float4 inarray(anyarray, anyelement)
- returns zero if second argument does not present in a first one
and 1.0 in opposite case

float4 inarray(anyarray, anyelement, float4, float4)
- returns fourth argument if second argument does not present in
a first one and third argument in opposite case

GUC configuration variables:

smlar.threshold  FLOAT
Array's with similarity lower than threshold are not similar
by % operation

smlar.persistent_cache BOOL
Cache of global stat is stored in transaction-independent memory

smlar.type  STRING
Type of similarity formula: cosine(default), tfidf, overlap

smlar.stattable STRING
Name of table stored set-wide statistic. Table should be
defined as
CREATE TABLE table_name (
value   data_type UNIQUE,
ndoc    int4 (or bigint)  NOT NULL CHECK (ndoc>0)
);
And row with null value means total number of documents.
See an examples in sql/*g.sql files
Note: used on for smlar.type = 'tfidf'

smlar.tf_method STRING
Calculation method for term frequency. Values:
"n"     - simple counting of entries (default)
"log"   - 1 + log(n)
"const" - TF is equal to 1
Note: used on for smlar.type = 'tfidf'

smlar.idf_plus_one BOOL
If false (default), calculate idf as log(d/df),
if true - as log(1+d/df)
Note: used on for smlar.type = 'tfidf'

Module provides several GUC variables smlar.threshold, it's highly
custom_variable_classes = 'smlar'       # list of custom variable class names
smlar.threshold = 0.6  #or any other value > 0 and < 1
and other smlar.* variables

GiST/GIN support for % and  && operations for:
Array Type   |  GIN operator class  | GiST operator class
---------------+----------------------+----------------------
bit[]         | _bit_sml_ops         |
bytea[]       | _bytea_sml_ops       | _bytea_sml_ops
char[]        | _char_sml_ops        | _char_sml_ops
cidr[]        | _cidr_sml_ops        | _cidr_sml_ops
date[]        | _date_sml_ops        | _date_sml_ops
float4[]      | _float4_sml_ops      | _float4_sml_ops
float8[]      | _float8_sml_ops      | _float8_sml_ops
inet[]        | _inet_sml_ops        | _inet_sml_ops
int2[]        | _int2_sml_ops        | _int2_sml_ops
int4[]        | _int4_sml_ops        | _int4_sml_ops
int8[]        | _int8_sml_ops        | _int8_sml_ops
interval[]    | _interval_sml_ops    | _interval_sml_ops
money[]       | _money_sml_ops       |
numeric[]     | _numeric_sml_ops     | _numeric_sml_ops
oid[]         | _oid_sml_ops         | _oid_sml_ops
text[]        | _text_sml_ops        | _text_sml_ops
time[]        | _time_sml_ops        | _time_sml_ops
timestamp[]   | _timestamp_sml_ops   | _timestamp_sml_ops
timestamptz[] | _timestamptz_sml_ops | _timestamptz_sml_ops
timetz[]      | _timetz_sml_ops      | _timetz_sml_ops
varbit[]      | _varbit_sml_ops      |
varchar[]     | _varchar_sml_ops     | _varchar_sml_ops
``````

## 参考

https://github.com/postgrespro/imgsmlr

http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/

https://github.com/postgrespro/pg_trgm_pro

https://www.postgresql.org/docs/9.6/static/pgtrgm.html

Tags:

Updated: