PostgreSQL Oracle 兼容性 之 - 数据采样与脱敏
背景
数据采样、脱敏是测试时常用的功能,例如使用线上业务建立测试数据库,不能把整个库脱下来,同时需要对敏感数据加密。
Oracle的例子
SELECT COUNT(innerQuery.C1) FROM (
SELECT ? AS C1 FROM RM_SALE_APPORTION SAMPLE BLOCK (?, ?) SEED (?) "RM_SALE_APPORTION"
) innerQuery
SAMPLE [ BLOCK ]
(sample_percent)
[ SEED (seed_value) ]
A variant of the SAMPLE clause is SAMPLE BLOCK, where each block of
records has the same chance of being selected, 20% in our example.
Since records are selected at the block level, this offers a performance improvement for
large tables and should not adversely impact the randomness of the sample.
sample_clause
The sample_clause lets you instruct Oracle to select from a random sample of rows from the table, rather than from the entire table.
BLOCK
BLOCK instructs Oracle to perform random block sampling instead of random row sampling.
sample_percent
sample_percent is a number specifying the percentage of the total row or block count to be included in the sample. The value must be in the range .000001 to (but not including) 100.
Restrictions on Sampling During Queries
You can specify SAMPLE only in a query that selects from a single table. Joins are not supported.
However, you can achieve the same results by using a CREATE TABLE ... AS SELECT query to materialize
a sample of an underlying table and then rewrite the original query to refer to the newly created table sample.
If you wish, you can write additional queries to materialize samples for other tables.
When you specify SAMPLE, Oracle automatically uses cost-based optimization. Rule-based optimization is not supported with this clause.
--------------------------------------------------------------------------------
Caution:
The use of statistically incorrect assumptions when using this feature can lead to incorrect or undesirable results.
--------------------------------------------------------------------------------
PostgreSQL同样提供了采样功能:
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
BERNOULLI 对应 Oracle SAMPLE(),行级采样
SYSTEM 对应 Oracle SAMPLE BLOCK(),块级采样
REPEATABLE 对应 Oracle SEED,采样种子
测试数据
postgres=# create table test(id int primary key, username text, phonenum text, addr text, pwd text, crt_time timestamp);
CREATE TABLE
postgres=# insert into test select id, 'test_'||id, 13900000000+(random()*90000000)::int, '中国杭州xxxxxxxxxxxxxxxxxx'||random(), md5(random()::text), clock_timestamp() from generate_series(1,10000000) t(id);
INSERT 0 10000000
postgres=# select * from test limit 10;
id | username | phonenum | addr | pwd | crt_time
----+----------+-------------+---------------------------------------------+----------------------------------+----------------------------
1 | test_1 | 13950521974 | 中国杭州xxxxxxxxxxxxxxxxxx0.953363882377744 | 885723a5f4938808235c5debaab473ec | 2017-06-02 15:05:55.465132
2 | test_2 | 13975998000 | 中国杭州xxxxxxxxxxxxxxxxxx0.91321265604347 | 7ea01dc02c0fbc965f38d1bf12b303eb | 2017-06-02 15:05:55.46534
3 | test_3 | 13922255548 | 中国杭州xxxxxxxxxxxxxxxxxx0.846756176557392 | 7c2992bdc69312cbb3bb135dd2b98491 | 2017-06-02 15:05:55.46535
4 | test_4 | 13985121895 | 中国杭州xxxxxxxxxxxxxxxxxx0.639280265197158 | 202e32f0f0e3fe669c00678f7acd2485 | 2017-06-02 15:05:55.465355
5 | test_5 | 13982757650 | 中国杭州xxxxxxxxxxxxxxxxxx0.501174578908831 | b6a42fc1ebe9326ad81a81a5896a5c6c | 2017-06-02 15:05:55.465359
6 | test_6 | 13903699864 | 中国杭州xxxxxxxxxxxxxxxxxx0.193029860965908 | f6bc06e5cda459d09141a2c93f317cf2 | 2017-06-02 15:05:55.465363
7 | test_7 | 13929797532 | 中国杭州xxxxxxxxxxxxxxxxxx0.192601112183183 | 75c12a3f14c7ef3e558cef79d84a7e8e | 2017-06-02 15:05:55.465368
8 | test_8 | 13961108182 | 中国杭州xxxxxxxxxxxxxxxxxx0.900682372972369 | 5df33d15cf7726f2fb57df3ed913b306 | 2017-06-02 15:05:55.465371
9 | test_9 | 13978455210 | 中国杭州xxxxxxxxxxxxxxxxxx0.87795089604333 | cbe233f00cdd3c61c67415c1f8691846 | 2017-06-02 15:05:55.465375
10 | test_10 | 13957044022 | 中国杭州xxxxxxxxxxxxxxxxxx0.410478914622217 | cdf2f98b0ff5a973efaca6a82625e283 | 2017-06-02 15:05:55.465379
(10 rows)
采样
9.5以前的版本,高效采样请参考
9.5以及以后的版本,可以使用tablesample语法进行采样(注意,采样过滤器在where条件过滤器的前面)。
语法如下
https://www.postgresql.org/docs/9.6/static/sql-select.html
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
sampling_method指采样方法
argument指参数,例如采样比例。
REPEATABLE(seed) 指采样随机种子,如果种子一样,那么多次采样请求得到的结果是一样的。如果忽略REPEATABLE则每次都是使用新的seed值,得到不同的结果。
例子1,BERNOULLI(百分比)采样,使用全表扫描的采样方法,按采样参数百分比返回。
postgres=# select * from test TABLESAMPLE bernoulli (1);
id | username | phonenum | addr | pwd | crt_time
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------
110 | test_110 | 13967004360 | 中国杭州xxxxxxxxxxxxxxxxxx0.417577873915434 | 437e5c29e12cbafa0563332909436d68 | 2017-06-02 15:05:55.46585
128 | test_128 | 13901119801 | 中国杭州xxxxxxxxxxxxxxxxxx0.63212554808706 | 973dba4b35057d44997eb4744eea691b | 2017-06-02 15:05:55.465938
251 | test_251 | 13916668924 | 中国杭州xxxxxxxxxxxxxxxxxx0.0558807463385165 | 71217eedce421bd0f475c0e4e6eb32a9 | 2017-06-02 15:05:55.466423
252 | test_252 | 13981440056 | 中国杭州xxxxxxxxxxxxxxxxxx0.457073447294533 | 6649c37c0f0287637a4cb80d84b6bde0 | 2017-06-02 15:05:55.466426
423 | test_423 | 13982447202 | 中国杭州xxxxxxxxxxxxxxxxxx0.816960731055588 | 11a8d6d1374cf7565877def6a147f544 | 2017-06-02 15:05:55.46717
......
例子2,SYSTEM(百分比)采样,使用块级采样方法,按采样参数百分比返回(被采样到的数据块,内的所有记录都将被返回)。因此离散度不如BERNOULLI,但是效率高很多。
postgres=# select * from test TABLESAMPLE system (1);
id | username | phonenum | addr | pwd | crt_time
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------
6986 | test_6986 | 13921391589 | 中国杭州xxxxxxxxxxxxxxxxxx0.874497607816011 | e6a5d695aca17de0f6489d740750c758 | 2017-06-02 15:05:55.495697
6987 | test_6987 | 13954425190 | 中国杭州xxxxxxxxxxxxxxxxxx0.374216149561107 | 813fffbf1ee7157c459839987aa7f4b0 | 2017-06-02 15:05:55.495721
6988 | test_6988 | 13901878095 | 中国杭州xxxxxxxxxxxxxxxxxx0.624850326217711 | 5056caaad5e076f82b8caec9d02169f6 | 2017-06-02 15:05:55.495725
6989 | test_6989 | 13940504557 | 中国杭州xxxxxxxxxxxxxxxxxx0.705925882328302 | a5b4062086a3261740c82774616e64ee | 2017-06-02 15:05:55.495729
6990 | test_6990 | 13987358496 | 中国杭州xxxxxxxxxxxxxxxxxx0.981084300205112 | 6ba0b6c9d484e6fb90181dc86cb6598f | 2017-06-02 15:05:55.495734
6991 | test_6991 | 13948658183 | 中国杭州xxxxxxxxxxxxxxxxxx0.6592857837677 | 9a0eadd056eeb6e3c1e2b984777cdf6b | 2017-06-02 15:05:55.495738
6992 | test_6992 | 13934074866 | 中国杭州xxxxxxxxxxxxxxxxxx0.232706854119897 | 84f6649beac3b78a3a1afeb9c3aabccd | 2017-06-02 15:05:55.495741
......
用户还可以通过以下接口自定义采样方法
https://www.postgresql.org/docs/9.6/static/tablesample-method.html
脱敏
脱敏的手段很多,用户对脱敏的需求也可能很多。
常见的例如
1. 隐藏字符串中间的内容,使用*表示,同时保持原始长度
2. 隐藏字符串中间的内容,使用*表示,不保持原始长度
3. 返回加密值
不管什么需求,实际上就是数据的转换,从原始值,转换为目标值。在PostgreSQL中可以通过function实现这样的转换,对不同的需求,编写不同的转换逻辑即可。
例子,将字符串中间部分模糊化,只显示字符串头2个,末尾1个。
select id, substring(username,1,2)||'******'||substring(username,length(username),1),
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),
substring(addr,1,2)||'******'||substring(addr, length(addr),1),
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),
crt_time
from test
TABLESAMPLE bernoulli (1);
id | ?column? | ?column? | ?column? | ?column? | crt_time
---------+-----------+-----------+-------------+-----------+----------------------------
69 | te******9 | 13******5 | 中国******9 | c0******2 | 2017-06-02 15:32:26.261624
297 | te******7 | 13******2 | 中国******1 | d9******6 | 2017-06-02 15:32:26.262558
330 | te******0 | 13******5 | 中国******3 | bd******0 | 2017-06-02 15:32:26.262677
335 | te******5 | 13******5 | 中国******6 | 08******f | 2017-06-02 15:32:26.262721
416 | te******6 | 13******6 | 中国******2 | b3******d | 2017-06-02 15:32:26.26312
460 | te******0 | 13******4 | 中国******8 | e5******f | 2017-06-02 15:32:26.26332
479 | te******9 | 13******1 | 中国******1 | 1d******4 | 2017-06-02 15:32:26.263393
485 | te******5 | 13******0 | 中国******3 | a3******8 | 2017-06-02 15:32:26.263418
692 | te******2 | 13******9 | 中国******4 | 69******8 | 2017-06-02 15:32:26.264326
1087 | te******7 | 13******9 | 中国******3 | 8e******5 | 2017-06-02 15:32:26.266091
1088 | te******8 | 13******8 | 中国******7 | 37******e | 2017-06-02 15:32:26.266095
1116 | te******6 | 13******8 | 中国******2 | 4c******3 | 2017-06-02 15:32:26.266235
1210 | te******0 | 13******4 | 中国******8 | 49******c | 2017-06-02 15:32:26.266671
......
如果需要更复杂的转换,写PostgreSQL的UDF对字段值进行转换即可。
将采样结果抽取到其他平台的方法也很多,例如copy到stdout,或者ETL工具等。
例子
psql test -c "copy (select id, substring(username,1,2)||'******'||substring(username,length(username),1),
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),
substring(addr,1,2)||'******'||substring(addr, length(addr),1),
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),
crt_time
from test
TABLESAMPLE bernoulli (1)
) to stdout" > ./sample_test.log
less sample_test.log
54 te******4 13******4 中国******3 52******b 2017-06-02 15:32:26.261451
58 te******8 13******6 中国******3 23******a 2017-06-02 15:32:26.261584
305 te******5 13******6 中国******9 c0******4 2017-06-02 15:32:26.262587
399 te******9 13******5 中国******4 71******7 2017-06-02 15:32:26.26298
421 te******1 13******0 中国******4 21******3 2017-06-02 15:32:26.263139
677 te******7 13******5 中国******5 e2******7 2017-06-02 15:32:26.264269
874 te******4 13******9 中国******2 a6******9 2017-06-02 15:32:26.265159
参考
https://www.postgresql.org/docs/9.6/static/tablesample-method.html
https://www.postgresql.org/docs/9.6/static/sql-select.html