在PostgreSQL中如何生成测试kmean算法的数据
背景
生成Kmeans的测试数据。
例如每10000为界,生成10个种子,每个节点以100内的随机数相加,生成一组测试数据。
postgres=# create table test(id int, rand int);
CREATE TABLE
postgres=# insert into test select id*10000,trunc(random()*100 + id*10000) from generate_series(1,10) t(id), generate_series(1,100000) t1(rand);
INSERT 0 1000000
postgres=# select id,count(*) from test group by id order by 1;
id | count
--------+--------
10000 | 100000
20000 | 100000
30000 | 100000
40000 | 100000
50000 | 100000
60000 | 100000
70000 | 100000
80000 | 100000
90000 | 100000
100000 | 100000
(10 rows)
直接使用kmeans分为10类,不设置种子的话,分得不是很准确。
postgres=# select k,id,count(*) from (select kmeans(array[rand], 10) over () k, id from test) t group by 1,2 order by 1,2;
k | id | count
---+--------+--------
0 | 10000 | 100000
0 | 20000 | 100000
1 | 30000 | 49707
2 | 30000 | 50293
3 | 40000 | 100000
4 | 50000 | 100000
5 | 60000 | 100000
6 | 70000 | 100000
7 | 80000 | 49871
8 | 80000 | 50129
9 | 90000 | 100000
9 | 100000 | 100000
(12 rows)
使用正确的种子后,分类精准。
postgres=# select k,id,count(*) from (select kmeans(array[rand], 10, array[10000,20000,30000,40000,50000,60000,70000,80000,90000,100000]) over () k, id from test) t group by 1,2 order by 1,2;
k | id | count
---+--------+--------
0 | 10000 | 100000
1 | 20000 | 100000
2 | 30000 | 100000
3 | 40000 | 100000
4 | 50000 | 100000
5 | 60000 | 100000
6 | 70000 | 100000
7 | 80000 | 100000
8 | 90000 | 100000
9 | 100000 | 100000
(10 rows)
参考
http://pgxn.org/dist/kmeans/