×
Community Blog Compatibility between PostgreSQL and Oracle - Data Sampling and Desensitization

Compatibility between PostgreSQL and Oracle - Data Sampling and Desensitization

This article explores the compatibility between PostgreSQL and Oracle regarding the common test functions- data sampling and data desensitization.

Background

Data sampling and desensitization are common test functions. For example, if you use online services to create a test database, you cannot extract the entire database but need to encrypt sensitive data.

Consider the following Oracle example.

SELECT COUNT(innerQuery.C1) FROM (
  SELECT  ? AS C1FROM RM_SALE_APPORTION SAMPLE BLOCK (?, ?) SEED (?) "RM_SALE_APPORTION"
) innerQuery
  
SAMPLE [ BLOCK ]
       (sample_percent)
       [ SEED (seed_value) ]

A variant of the SAMPLE clause is SAMPLE BLOCK, where each block of
records has the same chance of being selected, 20% in our example. 
Since records are selected at the block level, this offers a performance improvement for
large tables and should not adversely impact the randomness of the sample.

sample_clause
The sample_clause lets you instruct Oracle to select from a random sample of rows from the table, rather than from the entire table.

BLOCK
BLOCK instructs Oracle to perform random block sampling instead of random row sampling.

sample_percent
sample_percent is a number specifying the percentage of the total row or block count to be included in the sample. The value must be in the range .000001 to (but not including) 100.

Restrictions on Sampling During Queries
You can specify SAMPLE only in a query that selects from a single table. Joins are not supported. 
However, you can achieve the same results by using a CREATE TABLE ... AS SELECT query to materialize a sample of an underlying table and then rewrite the original query to refer to the newly created table sample. 
If you wish, you can write additional queries to materialize samples for other tables. 
When you specify SAMPLE, Oracle automatically uses cost-based optimization. Rule-based optimization is not supported by this clause. 

--------------------------------------------------------------------------------
Caution: 
The use of statistically incorrect assumptions when using this feature can lead to incorrect or undesirable results.

--------------------------------------------------------------------------------

PostgreSQL also provides the sampling function as shown below.

TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]

  • BERNOULLI: corresponding to Oracle SAMPLE(), row-based sampling
  • SYSTEM: corresponding to Oracle SAMPLE BLOCK(), block-based sampling
  • REPEATABLE: corresponding to Oracle SEED, which is a sampling seed

Test Data

Consider the following as test data.

postgres=# create table test(id int primary key, username text, phonenum text, addr text, pwd text, crt_time timestamp);  
CREATE TABLE  
  
postgres=# insert into test select id, 'test_'||id, 13900000000+(random()*90000000)::int, '中国杭州xxxxxxxxxxxxxxxxxx'||random(), md5(random()::text), clock_timestamp() from generate_series(1,10000000) t(id);  
INSERT 0 10000000  
  
postgres=# select * from test limit 10;  
 id | username |  phonenum   |                    addr                     |               pwd                |          crt_time            
----+----------+-------------+---------------------------------------------+----------------------------------+----------------------------  
  1 | test_1   | 13950521974 | 中国杭州xxxxxxxxxxxxxxxxxx0.953363882377744 | 885723a5f4938808235c5debaab473ec | 2017-06-02 15:05:55.465132  
  2 | test_2   | 13975998000 | 中国杭州xxxxxxxxxxxxxxxxxx0.91321265604347  | 7ea01dc02c0fbc965f38d1bf12b303eb | 2017-06-02 15:05:55.46534  
  3 | test_3   | 13922255548 | 中国杭州xxxxxxxxxxxxxxxxxx0.846756176557392 | 7c2992bdc69312cbb3bb135dd2b98491 | 2017-06-02 15:05:55.46535  
  4 | test_4   | 13985121895 | 中国杭州xxxxxxxxxxxxxxxxxx0.639280265197158 | 202e32f0f0e3fe669c00678f7acd2485 | 2017-06-02 15:05:55.465355  
  5 | test_5   | 13982757650 | 中国杭州xxxxxxxxxxxxxxxxxx0.501174578908831 | b6a42fc1ebe9326ad81a81a5896a5c6c | 2017-06-02 15:05:55.465359  
  6 | test_6   | 13903699864 | 中国杭州xxxxxxxxxxxxxxxxxx0.193029860965908 | f6bc06e5cda459d09141a2c93f317cf2 | 2017-06-02 15:05:55.465363  
  7 | test_7   | 13929797532 | 中国杭州xxxxxxxxxxxxxxxxxx0.192601112183183 | 75c12a3f14c7ef3e558cef79d84a7e8e | 2017-06-02 15:05:55.465368  
  8 | test_8   | 13961108182 | 中国杭州xxxxxxxxxxxxxxxxxx0.900682372972369 | 5df33d15cf7726f2fb57df3ed913b306 | 2017-06-02 15:05:55.465371  
  9 | test_9   | 13978455210 | 中国杭州xxxxxxxxxxxxxxxxxx0.87795089604333  | cbe233f00cdd3c61c67415c1f8691846 | 2017-06-02 15:05:55.465375  
 10 | test_10  | 13957044022 | 中国杭州xxxxxxxxxxxxxxxxxx0.410478914622217 | cdf2f98b0ff5a973efaca6a82625e283 | 2017-06-02 15:05:55.465379  
(10 rows)  

Sampling

For efficient sampling in versions earlier than 9.5, see Data Sampling in PostgreSQL.

In version 9.5 and later, use the TABLESAMPLE syntax for sampling (note that the sampling filter is used before the where condition filter).

The syntax is as follows, refer to this page for more details.

TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]  
  
sampling_method指采样方法  
  
argument指参数,例如采样比例。  
  
REPEATABLE(seed) 指采样随机种子,如果种子一样,那么多次采样请求得到的结果是一样的。如果忽略REPEATABLE则每次都是使用新的seed值,得到不同的结果。  

Example 1) BERNOULLI (Percentage) Sampling

Scan the full table to return the sampling result according to the percentage of sampling parameters.

postgres=# select * from test TABLESAMPLE bernoulli (1);  
   id    |   username   |  phonenum   |                      addr                      |               pwd                |          crt_time            
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------  
     110 | test_110     | 13967004360 | 中国杭州xxxxxxxxxxxxxxxxxx0.417577873915434    | 437e5c29e12cbafa0563332909436d68 | 2017-06-02 15:05:55.46585  
     128 | test_128     | 13901119801 | 中国杭州xxxxxxxxxxxxxxxxxx0.63212554808706     | 973dba4b35057d44997eb4744eea691b | 2017-06-02 15:05:55.465938  
     251 | test_251     | 13916668924 | 中国杭州xxxxxxxxxxxxxxxxxx0.0558807463385165   | 71217eedce421bd0f475c0e4e6eb32a9 | 2017-06-02 15:05:55.466423  
     252 | test_252     | 13981440056 | 中国杭州xxxxxxxxxxxxxxxxxx0.457073447294533    | 6649c37c0f0287637a4cb80d84b6bde0 | 2017-06-02 15:05:55.466426  
     423 | test_423     | 13982447202 | 中国杭州xxxxxxxxxxxxxxxxxx0.816960731055588    | 11a8d6d1374cf7565877def6a147f544 | 2017-06-02 15:05:55.46717  
......  

Example 2) SYSTEM (Percentage) Sampling

Perform block-based sampling to return the sampling result according to the percentage of sampling parameters (all records in the sampled data block are returned). Therefore, the dispersion of SYSTEM is lower than that of BERNOULLI, but the efficiency is much higher.

postgres=# select * from test TABLESAMPLE system (1);  
   id    |   username   |  phonenum   |                      addr                      |               pwd                |          crt_time            
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------  
    6986 | test_6986    | 13921391589 | 中国杭州xxxxxxxxxxxxxxxxxx0.874497607816011    | e6a5d695aca17de0f6489d740750c758 | 2017-06-02 15:05:55.495697  
    6987 | test_6987    | 13954425190 | 中国杭州xxxxxxxxxxxxxxxxxx0.374216149561107    | 813fffbf1ee7157c459839987aa7f4b0 | 2017-06-02 15:05:55.495721  
    6988 | test_6988    | 13901878095 | 中国杭州xxxxxxxxxxxxxxxxxx0.624850326217711    | 5056caaad5e076f82b8caec9d02169f6 | 2017-06-02 15:05:55.495725  
    6989 | test_6989    | 13940504557 | 中国杭州xxxxxxxxxxxxxxxxxx0.705925882328302    | a5b4062086a3261740c82774616e64ee | 2017-06-02 15:05:55.495729  
    6990 | test_6990    | 13987358496 | 中国杭州xxxxxxxxxxxxxxxxxx0.981084300205112    | 6ba0b6c9d484e6fb90181dc86cb6598f | 2017-06-02 15:05:55.495734  
    6991 | test_6991    | 13948658183 | 中国杭州xxxxxxxxxxxxxxxxxx0.6592857837677      | 9a0eadd056eeb6e3c1e2b984777cdf6b | 2017-06-02 15:05:55.495738  
    6992 | test_6992    | 13934074866 | 中国杭州xxxxxxxxxxxxxxxxxx0.232706854119897    | 84f6649beac3b78a3a1afeb9c3aabccd | 2017-06-02 15:05:55.495741  
......  

To customize a sampling method visit this website.

Desensitization

Many desensitization methods are available for the many different scenarios in which users must desensitize data. Common examples include:

1) Use asterisks (*) to hide the content in the middle of a string but keep the original length.
2) Use asterisks (*) to hide the content in the middle of a string but do not keep the original length.
3) Return the encrypted value.

In all cases, the desensitization operation converts the original value to the target value. PostgreSQL allows using functions to implement such conversion. For different requirements, just write different conversion logic.

For example, use asterisks (*) to hide the content in the middle of a string, with only the first two and the last one characters of the string displayed.

select id, substring(username,1,2)||'******'||substring(username,length(username),1),   
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),   
substring(addr,1,2)||'******'||substring(addr, length(addr),1),   
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),   
crt_time  
from test  
TABLESAMPLE bernoulli (1);  
  
   id    | ?column?  | ?column?  |  ?column?   | ?column?  |          crt_time            
---------+-----------+-----------+-------------+-----------+----------------------------  
      69 | te******9 | 13******5 | 中国******9 | c0******2 | 2017-06-02 15:32:26.261624  
     297 | te******7 | 13******2 | 中国******1 | d9******6 | 2017-06-02 15:32:26.262558  
     330 | te******0 | 13******5 | 中国******3 | bd******0 | 2017-06-02 15:32:26.262677  
     335 | te******5 | 13******5 | 中国******6 | 08******f | 2017-06-02 15:32:26.262721  
     416 | te******6 | 13******6 | 中国******2 | b3******d | 2017-06-02 15:32:26.26312  
     460 | te******0 | 13******4 | 中国******8 | e5******f | 2017-06-02 15:32:26.26332  
     479 | te******9 | 13******1 | 中国******1 | 1d******4 | 2017-06-02 15:32:26.263393  
     485 | te******5 | 13******0 | 中国******3 | a3******8 | 2017-06-02 15:32:26.263418  
     692 | te******2 | 13******9 | 中国******4 | 69******8 | 2017-06-02 15:32:26.264326  
    1087 | te******7 | 13******9 | 中国******3 | 8e******5 | 2017-06-02 15:32:26.266091  
    1088 | te******8 | 13******8 | 中国******7 | 37******e | 2017-06-02 15:32:26.266095  
    1116 | te******6 | 13******8 | 中国******2 | 4c******3 | 2017-06-02 15:32:26.266235  
    1210 | te******0 | 13******4 | 中国******8 | 49******c | 2017-06-02 15:32:26.266671  
......  

For a more complex conversion, write a PostgreSQL UDF to change the field values.

There are also many methods to extract sampling results on other platforms, such as copying to StdOut or ETL tools.

Consider the example below.

psql test -c "copy (select id, substring(username,1,2)||'******'||substring(username,length(username),1),   
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),   
substring(addr,1,2)||'******'||substring(addr, length(addr),1),   
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),   
crt_time  
from test  
TABLESAMPLE bernoulli (1)  
) to stdout" > ./sample_test.log  
  
  
less sample_test.log   
54      te******4       13******4       中国******3     52******b       2017-06-02 15:32:26.261451  
58      te******8       13******6       中国******3     23******a       2017-06-02 15:32:26.261584  
305     te******5       13******6       中国******9     c0******4       2017-06-02 15:32:26.262587  
399     te******9       13******5       中国******4     71******7       2017-06-02 15:32:26.26298  
421     te******1       13******0       中国******4     21******3       2017-06-02 15:32:26.263139  
677     te******7       13******5       中国******5     e2******7       2017-06-02 15:32:26.264269  
874     te******4       13******9       中国******2     a6******9       2017-06-02 15:32:26.265159  

References

0 0 0
Share on

digoal

282 posts | 24 followers

You may also like

Comments

digoal

282 posts | 24 followers

Related Products

  • Oracle Database Migration Solution

    Migrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.

    Learn More
  • Tair

    Tair is a Redis-compatible in-memory database service that provides a variety of data structures and enterprise-level capabilities.

    Learn More
  • PolarDB for PostgreSQL

    Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.

    Learn More
  • Time Series Database (TSDB)

    TSDB is a stable, reliable, and cost-effective online high-performance time series database service.

    Learn More