Data sampling and desensitization are common test functions. For example, if you use online services to create a test database, you cannot extract the entire database but need to encrypt sensitive data.
Consider the following Oracle example.
SELECT COUNT(innerQuery.C1) FROM (
SELECT ? AS C1FROM RM_SALE_APPORTION SAMPLE BLOCK (?, ?) SEED (?) "RM_SALE_APPORTION"
) innerQuery
SAMPLE [ BLOCK ]
(sample_percent)
[ SEED (seed_value) ]
A variant of the SAMPLE clause is SAMPLE BLOCK, where each block of
records has the same chance of being selected, 20% in our example.
Since records are selected at the block level, this offers a performance improvement for
large tables and should not adversely impact the randomness of the sample.
sample_clause
The sample_clause lets you instruct Oracle to select from a random sample of rows from the table, rather than from the entire table.
BLOCK
BLOCK instructs Oracle to perform random block sampling instead of random row sampling.
sample_percent
sample_percent is a number specifying the percentage of the total row or block count to be included in the sample. The value must be in the range .000001 to (but not including) 100.
Restrictions on Sampling During Queries
You can specify SAMPLE only in a query that selects from a single table. Joins are not supported.
However, you can achieve the same results by using a CREATE TABLE ... AS SELECT query to materialize a sample of an underlying table and then rewrite the original query to refer to the newly created table sample.
If you wish, you can write additional queries to materialize samples for other tables.
When you specify SAMPLE, Oracle automatically uses cost-based optimization. Rule-based optimization is not supported by this clause.
--------------------------------------------------------------------------------
Caution:
The use of statistically incorrect assumptions when using this feature can lead to incorrect or undesirable results.
--------------------------------------------------------------------------------
PostgreSQL also provides the sampling function as shown below.
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
Consider the following as test data.
postgres=# create table test(id int primary key, username text, phonenum text, addr text, pwd text, crt_time timestamp);
CREATE TABLE
postgres=# insert into test select id, 'test_'||id, 13900000000+(random()*90000000)::int, '中国杭州xxxxxxxxxxxxxxxxxx'||random(), md5(random()::text), clock_timestamp() from generate_series(1,10000000) t(id);
INSERT 0 10000000
postgres=# select * from test limit 10;
id | username | phonenum | addr | pwd | crt_time
----+----------+-------------+---------------------------------------------+----------------------------------+----------------------------
1 | test_1 | 13950521974 | 中国杭州xxxxxxxxxxxxxxxxxx0.953363882377744 | 885723a5f4938808235c5debaab473ec | 2017-06-02 15:05:55.465132
2 | test_2 | 13975998000 | 中国杭州xxxxxxxxxxxxxxxxxx0.91321265604347 | 7ea01dc02c0fbc965f38d1bf12b303eb | 2017-06-02 15:05:55.46534
3 | test_3 | 13922255548 | 中国杭州xxxxxxxxxxxxxxxxxx0.846756176557392 | 7c2992bdc69312cbb3bb135dd2b98491 | 2017-06-02 15:05:55.46535
4 | test_4 | 13985121895 | 中国杭州xxxxxxxxxxxxxxxxxx0.639280265197158 | 202e32f0f0e3fe669c00678f7acd2485 | 2017-06-02 15:05:55.465355
5 | test_5 | 13982757650 | 中国杭州xxxxxxxxxxxxxxxxxx0.501174578908831 | b6a42fc1ebe9326ad81a81a5896a5c6c | 2017-06-02 15:05:55.465359
6 | test_6 | 13903699864 | 中国杭州xxxxxxxxxxxxxxxxxx0.193029860965908 | f6bc06e5cda459d09141a2c93f317cf2 | 2017-06-02 15:05:55.465363
7 | test_7 | 13929797532 | 中国杭州xxxxxxxxxxxxxxxxxx0.192601112183183 | 75c12a3f14c7ef3e558cef79d84a7e8e | 2017-06-02 15:05:55.465368
8 | test_8 | 13961108182 | 中国杭州xxxxxxxxxxxxxxxxxx0.900682372972369 | 5df33d15cf7726f2fb57df3ed913b306 | 2017-06-02 15:05:55.465371
9 | test_9 | 13978455210 | 中国杭州xxxxxxxxxxxxxxxxxx0.87795089604333 | cbe233f00cdd3c61c67415c1f8691846 | 2017-06-02 15:05:55.465375
10 | test_10 | 13957044022 | 中国杭州xxxxxxxxxxxxxxxxxx0.410478914622217 | cdf2f98b0ff5a973efaca6a82625e283 | 2017-06-02 15:05:55.465379
(10 rows)
For efficient sampling in versions earlier than 9.5, see Data Sampling in PostgreSQL.
In version 9.5 and later, use the TABLESAMPLE syntax for sampling (note that the sampling filter is used before the where condition filter).
The syntax is as follows, refer to this page for more details.
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
sampling_method指采样方法
argument指参数,例如采样比例。
REPEATABLE(seed) 指采样随机种子,如果种子一样,那么多次采样请求得到的结果是一样的。如果忽略REPEATABLE则每次都是使用新的seed值,得到不同的结果。
Example 1) BERNOULLI (Percentage) Sampling
Scan the full table to return the sampling result according to the percentage of sampling parameters.
postgres=# select * from test TABLESAMPLE bernoulli (1);
id | username | phonenum | addr | pwd | crt_time
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------
110 | test_110 | 13967004360 | 中国杭州xxxxxxxxxxxxxxxxxx0.417577873915434 | 437e5c29e12cbafa0563332909436d68 | 2017-06-02 15:05:55.46585
128 | test_128 | 13901119801 | 中国杭州xxxxxxxxxxxxxxxxxx0.63212554808706 | 973dba4b35057d44997eb4744eea691b | 2017-06-02 15:05:55.465938
251 | test_251 | 13916668924 | 中国杭州xxxxxxxxxxxxxxxxxx0.0558807463385165 | 71217eedce421bd0f475c0e4e6eb32a9 | 2017-06-02 15:05:55.466423
252 | test_252 | 13981440056 | 中国杭州xxxxxxxxxxxxxxxxxx0.457073447294533 | 6649c37c0f0287637a4cb80d84b6bde0 | 2017-06-02 15:05:55.466426
423 | test_423 | 13982447202 | 中国杭州xxxxxxxxxxxxxxxxxx0.816960731055588 | 11a8d6d1374cf7565877def6a147f544 | 2017-06-02 15:05:55.46717
......
Example 2) SYSTEM (Percentage) Sampling
Perform block-based sampling to return the sampling result according to the percentage of sampling parameters (all records in the sampled data block are returned). Therefore, the dispersion of SYSTEM is lower than that of BERNOULLI, but the efficiency is much higher.
postgres=# select * from test TABLESAMPLE system (1);
id | username | phonenum | addr | pwd | crt_time
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------
6986 | test_6986 | 13921391589 | 中国杭州xxxxxxxxxxxxxxxxxx0.874497607816011 | e6a5d695aca17de0f6489d740750c758 | 2017-06-02 15:05:55.495697
6987 | test_6987 | 13954425190 | 中国杭州xxxxxxxxxxxxxxxxxx0.374216149561107 | 813fffbf1ee7157c459839987aa7f4b0 | 2017-06-02 15:05:55.495721
6988 | test_6988 | 13901878095 | 中国杭州xxxxxxxxxxxxxxxxxx0.624850326217711 | 5056caaad5e076f82b8caec9d02169f6 | 2017-06-02 15:05:55.495725
6989 | test_6989 | 13940504557 | 中国杭州xxxxxxxxxxxxxxxxxx0.705925882328302 | a5b4062086a3261740c82774616e64ee | 2017-06-02 15:05:55.495729
6990 | test_6990 | 13987358496 | 中国杭州xxxxxxxxxxxxxxxxxx0.981084300205112 | 6ba0b6c9d484e6fb90181dc86cb6598f | 2017-06-02 15:05:55.495734
6991 | test_6991 | 13948658183 | 中国杭州xxxxxxxxxxxxxxxxxx0.6592857837677 | 9a0eadd056eeb6e3c1e2b984777cdf6b | 2017-06-02 15:05:55.495738
6992 | test_6992 | 13934074866 | 中国杭州xxxxxxxxxxxxxxxxxx0.232706854119897 | 84f6649beac3b78a3a1afeb9c3aabccd | 2017-06-02 15:05:55.495741
......
To customize a sampling method visit this website.
Many desensitization methods are available for the many different scenarios in which users must desensitize data. Common examples include:
1) Use asterisks (*)
to hide the content in the middle of a string but keep the original length.
2) Use asterisks (*)
to hide the content in the middle of a string but do not keep the original length.
3) Return the encrypted value.
In all cases, the desensitization operation converts the original value to the target value. PostgreSQL allows using functions to implement such conversion. For different requirements, just write different conversion logic.
For example, use asterisks (*)
to hide the content in the middle of a string, with only the first two and the last one characters of the string displayed.
select id, substring(username,1,2)||'******'||substring(username,length(username),1),
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),
substring(addr,1,2)||'******'||substring(addr, length(addr),1),
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),
crt_time
from test
TABLESAMPLE bernoulli (1);
id | ?column? | ?column? | ?column? | ?column? | crt_time
---------+-----------+-----------+-------------+-----------+----------------------------
69 | te******9 | 13******5 | 中国******9 | c0******2 | 2017-06-02 15:32:26.261624
297 | te******7 | 13******2 | 中国******1 | d9******6 | 2017-06-02 15:32:26.262558
330 | te******0 | 13******5 | 中国******3 | bd******0 | 2017-06-02 15:32:26.262677
335 | te******5 | 13******5 | 中国******6 | 08******f | 2017-06-02 15:32:26.262721
416 | te******6 | 13******6 | 中国******2 | b3******d | 2017-06-02 15:32:26.26312
460 | te******0 | 13******4 | 中国******8 | e5******f | 2017-06-02 15:32:26.26332
479 | te******9 | 13******1 | 中国******1 | 1d******4 | 2017-06-02 15:32:26.263393
485 | te******5 | 13******0 | 中国******3 | a3******8 | 2017-06-02 15:32:26.263418
692 | te******2 | 13******9 | 中国******4 | 69******8 | 2017-06-02 15:32:26.264326
1087 | te******7 | 13******9 | 中国******3 | 8e******5 | 2017-06-02 15:32:26.266091
1088 | te******8 | 13******8 | 中国******7 | 37******e | 2017-06-02 15:32:26.266095
1116 | te******6 | 13******8 | 中国******2 | 4c******3 | 2017-06-02 15:32:26.266235
1210 | te******0 | 13******4 | 中国******8 | 49******c | 2017-06-02 15:32:26.266671
......
For a more complex conversion, write a PostgreSQL UDF to change the field values.
There are also many methods to extract sampling results on other platforms, such as copying to StdOut or ETL tools.
Consider the example below.
psql test -c "copy (select id, substring(username,1,2)||'******'||substring(username,length(username),1),
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),
substring(addr,1,2)||'******'||substring(addr, length(addr),1),
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),
crt_time
from test
TABLESAMPLE bernoulli (1)
) to stdout" > ./sample_test.log
less sample_test.log
54 te******4 13******4 中国******3 52******b 2017-06-02 15:32:26.261451
58 te******8 13******6 中国******3 23******a 2017-06-02 15:32:26.261584
305 te******5 13******6 中国******9 c0******4 2017-06-02 15:32:26.262587
399 te******9 13******5 中国******4 71******7 2017-06-02 15:32:26.26298
421 te******1 13******0 中国******4 21******3 2017-06-02 15:32:26.263139
677 te******7 13******5 中国******5 e2******7 2017-06-02 15:32:26.264269
874 te******4 13******9 中国******2 a6******9 2017-06-02 15:32:26.265159
Automatic Cleaning and Scheduling of PostgreSQL Rotate Tables - Constraints and Triggers
Alibaba Clouder - May 20, 2020
ApsaraDB - July 23, 2021
digoal - December 14, 2018
Alibaba Clouder - April 11, 2018
digoal - February 5, 2024
digoal - March 25, 2020
Migrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.
Learn MoreTair is a Redis-compatible in-memory database service that provides a variety of data structures and enterprise-level capabilities.
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreTSDB is a stable, reliable, and cost-effective online high-performance time series database service.
Learn MoreMore Posts by digoal