All Products
Search
Document Center

Realtime Compute for Apache Flink:Deduplication

Last Updated:Aug 07, 2024

This topic describes the compatibility between a job and the state data used to start the job after you modify deduplication queries for the job.

Compatible modifications

  • Fully compatible: Modify non-partition keys in the selected fields for a deduplication query based on the ascending order of the proctime field.

    -- Original SQL statement: 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime ASC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Fully compatible: Add d to the selected fields. 
    SELECT a, b, c, d FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime ASC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Fully compatible: Delete b from the selected fields. 
    SELECT a, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime ASC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Fully compatible: Change a selected field from c to substring(c, 1, 5). 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime ASC) AS rk
       FROM (SELECT a, b, SUBSTRING(c,1,5) as c, proctime FROM MyTable))
    WHERE rk = 1;
  • Fully compatible: Modify the sequence of partition keys.

    -- Original SQL statement: 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a, b, c ORDER BY proctime ASC) as rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Fully compatible: Modify the sequence of partition keys.
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY c, a, b ORDER BY proctime ASC) as rk
       FROM MyTable)
    WHERE rk = 1;
  • Fully compatible: Modify a field without changing the schema in a deduplication query based on the rowtime field or the descending order of the proctime field.

    -- Original SQL statement: 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime DESC) AS rk
       FROM MyTable)
    WHERE rk = 1 AND c > 10;
    
    -- Fully compatible: Delete a field that does not change the schema. 
    SELECT a, b FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime DESC) AS rk
       FROM MyTable)
    WHERE rk = 1 AND c > 10;

Incompatible modifications

  • Add, delete, or modify the partition keys or modify the computation logic of the fields included in the partition keys.

    -- Original SQL statement: 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime ASC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Incompatible: Add d as a partition key. 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a,d ORDER BY proctime ASC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    
    -- Incompatible: Delete a from the partition keys. 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (ORDER BY proctime ASC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Incompatible: Change the partition key from a to a + 1. 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (ORDER BY proctime ASC) AS rk
       FROM (SELECT a + 1 AS a, b, c, proctime FROM MyTable))
    WHERE rk = 1;
  • Modify the fields or order specified in the ORDER BY clause.

    -- Original SQL statement: 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime ASC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Incompatible: Change the field used for sorting from proctime to ts. 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY ts ASC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Incompatible: Change the sorting order from ascending to descending. 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime DESC) AS rk
       FROM MyTable)
    WHERE rk = 1;
  • Add, delete, or modify a field that changes the schema in a deduplication query based on the rowtime field or the descending order of the proctime field.

    -- Original SQL statement: 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime DESC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Incompatible: Add d to the selected fields. 
    SELECT a, b, c, d FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime DESC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Incompatible: Delete c from the selected fields. 
    SELECT a, b FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime DESC) AS rk
       FROM MyTable)
    WHERE rk = 1;
    
    -- Incompatible: Change a selected field from c to substring(c, 1, 5). 
    SELECT a, b, c FROM (
      SELECT *,
       ROW_NUMBER() OVER (PARTITION BY a ORDER BY proctime DESC) AS rk
       FROM (select a, b, substring(c, 1, 5) as c, ts from MyTable))
    WHERE rk = 1;