PostgreSQL Oracle 兼容性 之 - 数据采样与脱敏

4 minute read

背景

数据采样、脱敏是测试时常用的功能,例如使用线上业务建立测试数据库,不能把整个库脱下来,同时需要对敏感数据加密。

Oracle的例子

SELECT COUNT(innerQuery.C1) FROM (	
  SELECT  ? AS C1	FROM RM_SALE_APPORTION SAMPLE BLOCK (?, ?) SEED (?) "RM_SALE_APPORTION"
) innerQuery
  
SAMPLE [ BLOCK ]
       (sample_percent)
       [ SEED (seed_value) ]

A variant of the SAMPLE clause is SAMPLE BLOCK, where each block of
records has the same chance of being selected, 20% in our example. 
Since records are selected at the block level, this offers a performance improvement for
large tables and should not adversely impact the randomness of the sample.

sample_clause
The sample_clause lets you instruct Oracle to select from a random sample of rows from the table, rather than from the entire table.

BLOCK
BLOCK instructs Oracle to perform random block sampling instead of random row sampling.

sample_percent
sample_percent is a number specifying the percentage of the total row or block count to be included in the sample. The value must be in the range .000001 to (but not including) 100.

Restrictions on Sampling During Queries
You can specify SAMPLE only in a query that selects from a single table. Joins are not supported. 
However, you can achieve the same results by using a CREATE TABLE ... AS SELECT query to materialize 
a sample of an underlying table and then rewrite the original query to refer to the newly created table sample. 
If you wish, you can write additional queries to materialize samples for other tables. 
When you specify SAMPLE, Oracle automatically uses cost-based optimization. Rule-based optimization is not supported with this clause. 

--------------------------------------------------------------------------------
Caution: 
The use of statistically incorrect assumptions when using this feature can lead to incorrect or undesirable results.
--------------------------------------------------------------------------------

PostgreSQL同样提供了采样功能:

TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]

BERNOULLI 对应 Oracle SAMPLE(),行级采样

SYSTEM 对应 Oracle SAMPLE BLOCK(),块级采样

REPEATABLE 对应 Oracle SEED,采样种子

测试数据

postgres=# create table test(id int primary key, username text, phonenum text, addr text, pwd text, crt_time timestamp);  
CREATE TABLE  
  
postgres=# insert into test select id, 'test_'||id, 13900000000+(random()*90000000)::int, '中国杭州xxxxxxxxxxxxxxxxxx'||random(), md5(random()::text), clock_timestamp() from generate_series(1,10000000) t(id);  
INSERT 0 10000000  
  
postgres=# select * from test limit 10;  
 id | username |  phonenum   |                    addr                     |               pwd                |          crt_time            
----+----------+-------------+---------------------------------------------+----------------------------------+----------------------------  
  1 | test_1   | 13950521974 | 中国杭州xxxxxxxxxxxxxxxxxx0.953363882377744 | 885723a5f4938808235c5debaab473ec | 2017-06-02 15:05:55.465132  
  2 | test_2   | 13975998000 | 中国杭州xxxxxxxxxxxxxxxxxx0.91321265604347  | 7ea01dc02c0fbc965f38d1bf12b303eb | 2017-06-02 15:05:55.46534  
  3 | test_3   | 13922255548 | 中国杭州xxxxxxxxxxxxxxxxxx0.846756176557392 | 7c2992bdc69312cbb3bb135dd2b98491 | 2017-06-02 15:05:55.46535  
  4 | test_4   | 13985121895 | 中国杭州xxxxxxxxxxxxxxxxxx0.639280265197158 | 202e32f0f0e3fe669c00678f7acd2485 | 2017-06-02 15:05:55.465355  
  5 | test_5   | 13982757650 | 中国杭州xxxxxxxxxxxxxxxxxx0.501174578908831 | b6a42fc1ebe9326ad81a81a5896a5c6c | 2017-06-02 15:05:55.465359  
  6 | test_6   | 13903699864 | 中国杭州xxxxxxxxxxxxxxxxxx0.193029860965908 | f6bc06e5cda459d09141a2c93f317cf2 | 2017-06-02 15:05:55.465363  
  7 | test_7   | 13929797532 | 中国杭州xxxxxxxxxxxxxxxxxx0.192601112183183 | 75c12a3f14c7ef3e558cef79d84a7e8e | 2017-06-02 15:05:55.465368  
  8 | test_8   | 13961108182 | 中国杭州xxxxxxxxxxxxxxxxxx0.900682372972369 | 5df33d15cf7726f2fb57df3ed913b306 | 2017-06-02 15:05:55.465371  
  9 | test_9   | 13978455210 | 中国杭州xxxxxxxxxxxxxxxxxx0.87795089604333  | cbe233f00cdd3c61c67415c1f8691846 | 2017-06-02 15:05:55.465375  
 10 | test_10  | 13957044022 | 中国杭州xxxxxxxxxxxxxxxxxx0.410478914622217 | cdf2f98b0ff5a973efaca6a82625e283 | 2017-06-02 15:05:55.465379  
(10 rows)  

采样

9.5以前的版本,高效采样请参考

《PostgreSQL 巧妙的数据采样方法》

9.5以及以后的版本,可以使用tablesample语法进行采样(注意,采样过滤器在where条件过滤器的前面)。

语法如下

https://www.postgresql.org/docs/9.6/static/sql-select.html

TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]  
  
sampling_method指采样方法  
  
argument指参数,例如采样比例。  
  
REPEATABLE(seed) 指采样随机种子,如果种子一样,那么多次采样请求得到的结果是一样的。如果忽略REPEATABLE则每次都是使用新的seed值,得到不同的结果。  

例子1,BERNOULLI(百分比)采样,使用全表扫描的采样方法,按采样参数百分比返回。

postgres=# select * from test TABLESAMPLE bernoulli (1);  
   id    |   username   |  phonenum   |                      addr                      |               pwd                |          crt_time            
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------  
     110 | test_110     | 13967004360 | 中国杭州xxxxxxxxxxxxxxxxxx0.417577873915434    | 437e5c29e12cbafa0563332909436d68 | 2017-06-02 15:05:55.46585  
     128 | test_128     | 13901119801 | 中国杭州xxxxxxxxxxxxxxxxxx0.63212554808706     | 973dba4b35057d44997eb4744eea691b | 2017-06-02 15:05:55.465938  
     251 | test_251     | 13916668924 | 中国杭州xxxxxxxxxxxxxxxxxx0.0558807463385165   | 71217eedce421bd0f475c0e4e6eb32a9 | 2017-06-02 15:05:55.466423  
     252 | test_252     | 13981440056 | 中国杭州xxxxxxxxxxxxxxxxxx0.457073447294533    | 6649c37c0f0287637a4cb80d84b6bde0 | 2017-06-02 15:05:55.466426  
     423 | test_423     | 13982447202 | 中国杭州xxxxxxxxxxxxxxxxxx0.816960731055588    | 11a8d6d1374cf7565877def6a147f544 | 2017-06-02 15:05:55.46717  
......  

例子2,SYSTEM(百分比)采样,使用块级采样方法,按采样参数百分比返回(被采样到的数据块,内的所有记录都将被返回)。因此离散度不如BERNOULLI,但是效率高很多。

postgres=# select * from test TABLESAMPLE system (1);  
   id    |   username   |  phonenum   |                      addr                      |               pwd                |          crt_time            
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------  
    6986 | test_6986    | 13921391589 | 中国杭州xxxxxxxxxxxxxxxxxx0.874497607816011    | e6a5d695aca17de0f6489d740750c758 | 2017-06-02 15:05:55.495697  
    6987 | test_6987    | 13954425190 | 中国杭州xxxxxxxxxxxxxxxxxx0.374216149561107    | 813fffbf1ee7157c459839987aa7f4b0 | 2017-06-02 15:05:55.495721  
    6988 | test_6988    | 13901878095 | 中国杭州xxxxxxxxxxxxxxxxxx0.624850326217711    | 5056caaad5e076f82b8caec9d02169f6 | 2017-06-02 15:05:55.495725  
    6989 | test_6989    | 13940504557 | 中国杭州xxxxxxxxxxxxxxxxxx0.705925882328302    | a5b4062086a3261740c82774616e64ee | 2017-06-02 15:05:55.495729  
    6990 | test_6990    | 13987358496 | 中国杭州xxxxxxxxxxxxxxxxxx0.981084300205112    | 6ba0b6c9d484e6fb90181dc86cb6598f | 2017-06-02 15:05:55.495734  
    6991 | test_6991    | 13948658183 | 中国杭州xxxxxxxxxxxxxxxxxx0.6592857837677      | 9a0eadd056eeb6e3c1e2b984777cdf6b | 2017-06-02 15:05:55.495738  
    6992 | test_6992    | 13934074866 | 中国杭州xxxxxxxxxxxxxxxxxx0.232706854119897    | 84f6649beac3b78a3a1afeb9c3aabccd | 2017-06-02 15:05:55.495741  
......  

用户还可以通过以下接口自定义采样方法

https://www.postgresql.org/docs/9.6/static/tablesample-method.html

脱敏

脱敏的手段很多,用户对脱敏的需求也可能很多。

常见的例如

1. 隐藏字符串中间的内容,使用*表示,同时保持原始长度

2. 隐藏字符串中间的内容,使用*表示,不保持原始长度

3. 返回加密值

不管什么需求,实际上就是数据的转换,从原始值,转换为目标值。在PostgreSQL中可以通过function实现这样的转换,对不同的需求,编写不同的转换逻辑即可。

例子,将字符串中间部分模糊化,只显示字符串头2个,末尾1个。

select id, substring(username,1,2)||'******'||substring(username,length(username),1),   
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),   
substring(addr,1,2)||'******'||substring(addr, length(addr),1),   
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),   
crt_time  
from test  
TABLESAMPLE bernoulli (1);  
  
   id    | ?column?  | ?column?  |  ?column?   | ?column?  |          crt_time            
---------+-----------+-----------+-------------+-----------+----------------------------  
      69 | te******9 | 13******5 | 中国******9 | c0******2 | 2017-06-02 15:32:26.261624  
     297 | te******7 | 13******2 | 中国******1 | d9******6 | 2017-06-02 15:32:26.262558  
     330 | te******0 | 13******5 | 中国******3 | bd******0 | 2017-06-02 15:32:26.262677  
     335 | te******5 | 13******5 | 中国******6 | 08******f | 2017-06-02 15:32:26.262721  
     416 | te******6 | 13******6 | 中国******2 | b3******d | 2017-06-02 15:32:26.26312  
     460 | te******0 | 13******4 | 中国******8 | e5******f | 2017-06-02 15:32:26.26332  
     479 | te******9 | 13******1 | 中国******1 | 1d******4 | 2017-06-02 15:32:26.263393  
     485 | te******5 | 13******0 | 中国******3 | a3******8 | 2017-06-02 15:32:26.263418  
     692 | te******2 | 13******9 | 中国******4 | 69******8 | 2017-06-02 15:32:26.264326  
    1087 | te******7 | 13******9 | 中国******3 | 8e******5 | 2017-06-02 15:32:26.266091  
    1088 | te******8 | 13******8 | 中国******7 | 37******e | 2017-06-02 15:32:26.266095  
    1116 | te******6 | 13******8 | 中国******2 | 4c******3 | 2017-06-02 15:32:26.266235  
    1210 | te******0 | 13******4 | 中国******8 | 49******c | 2017-06-02 15:32:26.266671  
......  

如果需要更复杂的转换,写PostgreSQL的UDF对字段值进行转换即可。

将采样结果抽取到其他平台的方法也很多,例如copy到stdout,或者ETL工具等。

例子

psql test -c "copy (select id, substring(username,1,2)||'******'||substring(username,length(username),1),   
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),   
substring(addr,1,2)||'******'||substring(addr, length(addr),1),   
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),   
crt_time  
from test  
TABLESAMPLE bernoulli (1)  
) to stdout" > ./sample_test.log  
  
  
less sample_test.log   
54      te******4       13******4       中国******3     52******b       2017-06-02 15:32:26.261451  
58      te******8       13******6       中国******3     23******a       2017-06-02 15:32:26.261584  
305     te******5       13******6       中国******9     c0******4       2017-06-02 15:32:26.262587  
399     te******9       13******5       中国******4     71******7       2017-06-02 15:32:26.26298  
421     te******1       13******0       中国******4     21******3       2017-06-02 15:32:26.263139  
677     te******7       13******5       中国******5     e2******7       2017-06-02 15:32:26.264269  
874     te******4       13******9       中国******2     a6******9       2017-06-02 15:32:26.265159  

参考

《PostgreSQL 巧妙的数据采样方法》

https://www.postgresql.org/docs/9.6/static/tablesample-method.html

https://www.postgresql.org/docs/9.6/static/sql-select.html

Flag Counter

digoal’s 大量PostgreSQL文章入口