FAQ and solutions for Blink and Flink - Hologres - Alibaba Cloud Documentation Center

This topic describes common issues you might encounter when using Blink and Flink with Hologres.

Terms

Hologres performance

Write performance

Column-oriented tables: InsertOrIgnore > InsertOrReplace > InsertOrUpdate
Row-oriented tables: InsertOrReplace = InsertOrUpdate > InsertOrIgnore

Parameter	Description
InsertOrIgnore	If the sink table has a primary key and a duplicate primary key is encountered during a real-time write, the later record is discarded.
InsertOrReplace	If the sink table has a primary key and a duplicate primary key is encountered during a real-time write, the record is updated based on the primary key. If the new row does not contain all columns, the missing columns are set to null.
InsertOrUpdate	If the sink table has a primary key and a duplicate primary key is encountered during a real-time write, the record is updated based on the primary key. If the new row does not contain all columns, the existing values for the missing columns are retained.

Point query performance

Row-oriented storage outperforms row-column hybrid storage, which in turn outperforms column-oriented storage.

Support for Blink, Flink (VVR), and open source Flink

Form factor	Data storage type					Description
Form factor	Source table	Sink table	Dimension table	Binary logging	Hologres Catalog	Description
Fully managed Flink	Supports row-oriented and column-oriented storage.	Supports row-oriented and column-oriented storage.	Use row-oriented storage.	Supported	Supported	None
Blink Dedicated	Supports row-oriented and column-oriented storage.	Supports row-oriented and column-oriented storage.	Use row-oriented storage.	Hologres V0.8 supports only row-oriented storage. Hologres V0.9 and later support row-oriented and column-oriented storage. Use row-oriented storage.	Not supported	This product is being phased out. Use fully managed Flink on Alibaba Cloud.
Open source Flink 1.10	Supports row-oriented and column-oriented storage.	Supports row-oriented and column-oriented storage.	None	Not supported	Not supported	None
Open source Flink 1.11 and later	Supports row-oriented and column-oriented storage.	Supports row-oriented and column-oriented storage.	Use row-oriented storage.	Not supported	Not supported	Starting from open source Flink 1.11, the Hologres code is open source. For more information, see GitHub.

The following example shows how to use SQL to map a Flink table to a Hologres table.

create table holo_source(
'hg_binlog_lsn' BIGINT HEADER,
'hg_binlog_event_type' BIGINT HEADER,
'hg_binlog_timestamp_us' BIGINT HEADER,
A int,
B int,
C timestamp )
with (
type = 'hologres',
'endpoint' = 'xxx.hologres.aliyuncs.com:80',   --The Endpoint of the Hologres instance.
'userName' = '',                               --The AccessKey ID of your Alibaba Cloud account.
'password' = '',                               --The AccessKey secret of your Alibaba Cloud account.
'dbName' = 'binlog',                           --The name of the database in the Hologres instance.
'tableName' ='test'                            --The name of the table in the Hologres instance.
'binlog' = 'true',
);

Blink, VVR, and Flink SQL all declare a Flink table and then map it to a specific physical table in Hologres using parameters. Therefore, mapping to a foreign table is not supported.

Troubleshoot slow real-time writes

Confirm the configuration
Check the following configuration information.
- The storage format of the destination table, including row-oriented, column-oriented, and row-column hybrid store tables.
- The insert mode, including InsertOrIgnore, InsertOrUpdate, and InsertOrReplace.
- The Table Group and shard count of the destination table.
Check the real-time write latency metric
If the average write latency is high—in the hundreds of milliseconds or even seconds—the backend has likely reached a write bottleneck. The following issues may exist.
- You are using InsertOrUpdate for a column-oriented table, which performs partial updates. If traffic is high, this can lead to high CPU load and write latency for the instance.
  
  Solution: Change the table type. Use a row-oriented table. If your instance is V1.1 or later, you can choose a row-column hybrid store table.
- Check the CPU load of the instance in Cloud Monitor. If CPU usage is close to 100% but there are no partial updates on column-oriented tables, the cause is usually high QPS queries or a high write volume.
  
  Solution: Scale out the Hologres instance.
- Check if there are continuous Insert into select from commands that trigger BulkLoad writes to the table. BulkLoad writes currently block real-time writes.
  
  Solution: Convert BulkLoad writes to real-time writes, or run them during off-peak hours.
Check for data skew
Use the following SQL command to check for data skew.
```
SELECT hg_shard_id, count(1) FROM t1 GROUP BY hg_shard_id ORDER BY hg_shard_id;
```
Solution: Modify the distribution key to distribute data more evenly.
Check for backend pressure

If the previous steps do not reveal any issues but write performance suddenly drops, the backend cluster is likely under high pressure and experiencing a bottleneck. Contact technical support to confirm the situation. For more information, see How do I get more online support?.
Check for backpressure on the Blink/Flink side

If the previous steps show no obvious issues on the Hologres side, the client is usually slow, meaning the Blink/Flink side is slow. Check if the sink node is experiencing backpressure. If the job has only one node, you cannot see if there is backpressure. In this case, separate the sink node and observe again. For details, contact Flink technical support.

Troubleshoot data write issues

This issue is usually caused by out-of-order data. For example, data with the same primary key is distributed across different Flink tasks, and the write order cannot be guaranteed. Check the Flink SQL logic to ensure that data is shuffled by the Hologres table's primary key before being written to Hologres.

Troubleshoot dimension table query issues

Dimension table joins and dual-stream joins

When reading from Hologres, first confirm whether you are using a dimension table join correctly and not mistaking a dual-stream join for it. The following is an example of using Hologres as a dimension table. If the keywords proctime AS PROCTIME() and hologres_dim FOR SYSTEM_TIME AS are missing, it becomes a dual-stream join.

CREATE TEMPORARY TABLE datagen_source (
   a INT,
   b BIGINT,
   c STRING,
   proctime AS PROCTIME()
) with (
   'connector' = 'datagen'
);

CREATE TEMPORARY TABLE hologres_dim (
   a INT, 
   b VARCHAR, 
   c VARCHAR
) with (
   'connector' = 'hologres',
   ...
);

CREATE TEMPORARY TABLE blackhole_sink (
   a INT,
   b STRING
) with (
   'connector' = 'blackhole'
);

insert into blackhole_sink select T.a,H.b
FROM datagen_source AS T JOIN hologres_dim FOR SYSTEM_TIME AS OF T.proctime AS H ON T.a = H.a;

Dimension table queries
1. Check the dimension table storage format
  
  Check if the dimension table is a row-oriented, column-oriented, or row-column hybrid store table.
2. High latency in dimension table queries
  The most common issue with dimension tables is backpressure on the join node on the Flink/Blink side, which reduces the throughput of the entire job.
  1. Check the Flink dimension table join mode
    The dimension table join feature of the Hologres Flink Connector supports synchronous and asynchronous modes. The asynchronous mode performs better than the synchronous mode. You can distinguish them by checking the Flink SQL. The following is an example of an SQL statement that enables asynchronous dimension table queries.
    CREATE TABLE hologres_dim( id INT, len INT, content VARCHAR ) with ( 'connector'='hologres', 'dbname'='<yourDbname>', --The name of the Hologres database. 'tablename'='<yourTablename>', --The name of the table in Hologres that receives data. 'username'='<yourUsername>', --The AccessKey ID of your Alibaba Cloud account. 'password'='<yourPassword>', --The AccessKey secret of your Alibaba Cloud account. 'endpoint'='<yourEndpoint>' --The VPC Endpoint of your Hologres instance. 'async' = 'true'--Asynchronous mode );
  2. Check the backend query latency
    Check the real-time write latency metric:
    
    Check if a column-oriented table is used as a dimension table. Dimension tables in column-oriented format have high overhead in high-QPS scenarios.
    
    If it is a row-oriented table and the latency is high, the overall load on the instance is usually high. You need to scale out the instance.
3. Check if the join key is the primary key of the Hologres table
  
  Starting from Ververica Runtime (VVR) 4.x (Flink 1.13), the Hologres Connector supports non-primary key queries on Hologres tables based on Holo Client. This usually results in poor performance and high instance load, especially if the table schema is not optimized. In this case, guide the user to optimize the table schema. The most common optimization is to set the join key as the distribution key to enable shard pruning.
4. Check for backpressure on the Blink side
  
  If the previous steps show no obvious issues on the Hologres side, the client is usually slow, meaning the Blink side is slow. Check if the sink node is experiencing backpressure. If the job has only one node, you cannot see if there is backpressure. In this case, separate the sink node and observe again. You can also check if the join node is causing backpressure. For details, contact Flink technical support.

Notes on connections

The Hologres Connector uses Java Database Connectivity (JDBC) related modes by default.

The JDBC_FIXED mode is now supported. This mode does not occupy connections and is not limited by the maximum number of Walsenders when consuming binary logs. For more information, see Hologres.
Starting from Flink engine VVR-8.0.5-Flink-1.17, connection reuse is enabled by default with 'connectionPoolName' = 'default'. For most jobs, this has no impact. If a single job has many tables, performance may decrease after an upgrade. In this case, configure a separate connectionPoolName parameter for hot spot tables to optimize performance.

The JDBC mode occupies a certain number of connections. The default connection usage for different table types is as follows.

Table type	Default connections (per Flink job concurrency)
Binary logging source table	0
Batch source table	1
Dimension table	3 (can be adjusted with the `connectionSize` parameter)
Sink table	3 (can be adjusted with the `connectionSize` parameter)

Connection calculation method
- Default case
  
  By default, the maximum number of connections used by a job can be calculated using the following formula:
  
  Maximum connections = (Number of batch source tables × 1 + Number of dimension tables × connectionSize + Number of sink tables × connectionSize) × Job concurrency.
  
  For example, a job has one full and incremental source table, two dimension tables, and three sink tables. All use the default connectionSize parameter value. The job concurrency is set to 5. The final number of connections used is (1 × 1 + 2 × 3 + 3 × 3) × 5 = 80.
- Connection reuse
  
  Realtime Compute for Apache Flink versions 1.13-vvr-4.1.12 and later support connection reuse. Within the same concurrency of a job, dimension tables and sink tables with the same connectionPoolName will use the same connection pool. In the default example, if the two dimension tables and three sink tables are configured with the same connectionPoolName, and connectionSize is appropriately increased to 5, the final number of connections used is (1 × 1 + 5) × 5 = 30.
  
  Note
  The connection reuse mode is suitable for most scenarios. However, in some scenarios, such as when there are many dimension tables and neither asynchronous mode nor caching is enabled, synchronous point queries will be very frequent. In this case, multi-table connection reuse may cause queries to slow down. You can configure connection reuse only for sink tables.
- Other scenarios that use connections
  - During job startup, connections are established for table metadata validation and other tasks. This may temporarily use 3 to 6 connections, which are released after the job is running normally.
  - Fully managed Flink supports features such as Hologres Catalog, CREATE TABLE AS SELECT (CTAS), and CREATE DATABASE AS (CDAS). Using these features also occupies connections. By default, a job that uses a catalog will occupy an additional three connections for DDL operations such as creating tables.
Diagnose connection usage

When a job has many tables or high concurrency, it can occupy many connections, even exhausting the total connections of the Hologres instance. Use the following methods to understand and diagnose current connection usage.
- Use the following command in HoloWeb to view the current active queries in the pg_stat_activity table. For more information, see Query the pg_stat_activity view. Queries where the application_name field is ververica-connector-hologres represent read and write connections from Realtime Compute for Apache Flink.
```
SELECT application_name, COUNT (1) AS COUNT
FROM
  pg_stat_activity
WHERE
  backend_type = 'client backend'
  AND application_name != 'hologres'
GROUP BY application_name;
```
- Sometimes the job concurrency is set too high. On the Monitoring Information page for the instance in the Hologres Instances list, the number of connections is high at startup and then drops after a period of time. This is because many connections are idle and are closed. This indicates that the job does not actually need such high concurrency or number of connections. Plan the task connections reasonably, reduce the concurrency or the connectionSize parameter value, or use the connection reuse mode.
- Adjust the concurrency of Hologres nodes appropriately. By default, all operators in a Flink job have the same concurrency. In some scenarios, operators with complex calculation logic need to be configured with higher concurrency. However, this concurrency may be redundant for Hologres sink tables and may occupy many connections. In this case, refer to the job resource configuration, select expert mode, and set a suitable and smaller concurrency for the write operator to reduce the total connection usage.

Common errors

Error: `ERPC TIMEOUT` or `ERPC CONNECTION CLOSED`

Symptom: The error com.alibaba.blink.store.core.rpc.RpcException: request xx UpsertRecordBatchRequest failed on final try 4, maxAttempts=4, errorCode=3, msg=ERPC_ERROR_TIMEOUT occurs.
Possible cause: The write operation failed due to excessive pressure, or the cluster is busy. Check if the CPU load of the Hologres instance is maxed out. CONNECTION CLOSED may be caused by a backend node crashing due to excessive load, resulting in an out-of-memory (OOM) error or coredump.
Solution: Retry the write operation. If the issue persists, contact Hologres technical support to investigate the cause.

Error: `BackPresure Exceed Reject Limit`

Possible cause: The Hologres backend is under excessive write pressure, causing the Memtable to fail to flush to disk in time, which results in a write failure.
Solution: If the failure is occasional, you can ignore it. Or, add the parameter rpcRetries = '100' to the sink to increase the number of write retries. If this error persists, contact Hologres technical support to check the status of the backend instance.

Error: `The requested table name xxx mismatches the version of the table xxx from server/org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.Caused by: java.net.SocketTimeoutException: Read timed out`

Possible cause: You performed an ALTER TABLE operation, causing the schema version number of the table carried by the Blink write to be lower than the server-side version number, and the number of client retries was exceeded.
Solution: If the error is occasional, you can ignore it. If this error persists, contact Hologres technical support.

Error: `Failed to query table meta for table`

Possible cause: You are reading from or writing to a Hologres foreign table. The Hologres Connector does not support reading from or writing to foreign tables. If not, there may be an issue with the Hologres instance metadata.
Solution: Contact Hologres technical support.

Error: `Cloud authentication failed for access id`

Possible cause: The configured AccessKey information is incorrect, or the account has not been added to the Hologres instance.
Solution:
- Check if the AccessKey ID and AccessKey secret of the current account are entered correctly. The AccessKey secret is often incorrect or contains spaces.
- If you cannot find the cause, use the current AccessKey to connect to HoloWeb (log on with an account and password). Check the error message during the connectivity test. If the error is the same, there is an issue with the AccessKey. If the error is FATAL: role“ALIYUN$xxxx“Does not exist, the account does not have permission for the instance. The administrator needs to grant permission to the account.

Data cannot be joined from a Hologres dimension table

Possible cause: The Hologres dimension table is a partitioned table. Partitioned tables are not supported as dimension tables.
Solution: Convert the partitioned table to a standard table.

Error: `Modify record by primary key is not on this table`

Possible cause: The update mode was selected for a real-time write, but the Hologres sink table does not have a primary key.
Solution: Set a primary key.

Error: `shard columns count is no match`

Possible cause: When writing to Hologres, the complete distribution key columns (primary key by default) were not written.
Solution: Write the complete distribution key columns.

Error: `Full row is required, but the column xxx is missing`

Possible cause: This is an error message from an older version of Hologres. It usually means that you did not write data to a column that cannot be null.
Solution: Assign a value to the non-nullable column.

Surge in JDBC connections from VVP user access to Hologres

Possible cause: The VVR Hologres Connector uses JDBC mode to read from and write to Hologres (except for binary logging). It occupies a maximum of Number of Hologres tables being read/written × Concurrency × connectionSize (VVR table parameter, default is 3) connections.
Solution: Plan the task connections reasonably. Reduce the concurrency or connectionSize. If you cannot lower the concurrency or connectionSize, you can set the parameter useRpcMode = 'true' for the table to switch back to RPC mode.

Blink/VVR user gets an error indicating a connection failure when reading from or writing to Hologres

Possible cause: The Blink/VVR cluster has slow or no access to the public network by default.
Solution: Ensure that the cluster is in the same region as the Hologres instance and use the VPC Endpoint.

Error: `Hologres rpc mode dimension table does not support one to many join`

Possible cause: The RPC mode dimension table for Blink and VVR must be a row-oriented table, and the join field must be the primary key. The error is often caused by not meeting these two conditions.
Solution: Use JDBC mode, and use a row-oriented or row-column hybrid store table for the dimension table.

Error: DatahubClientException

Symptom: The error Caused by: com.aliyun.datahub.client.exception.DatahubClientException: [httpStatus:503, requestId:null, errorCode:null, errorMessage:{"ErrorCode":"ServiceUnavailable","ErrorMessage":"Queue Full"}] occurs.
Possible cause: Many binary log consumption jobs restart simultaneously for some reason, causing the thread pool to be exhausted.
Solution: Run the binary log consumption jobs in batches.

Error: Error occurs when reading data from datahub

Symptom: The error Error occurs when reading data from datahub, msg: [httpStatus:500, requestId:xxx, errorCode:InternalServerError, errorMessage:Get binlog timeout.] occurs.
Possible cause: Each piece of binary log data is too large. After batching, the size of each RPC request exceeds the maximum limit.
Solution: When each row of data has many fields and long strings, you can reduce the batching configuration.

Error: `Caused by: java.lang.IllegalArgumentException: Column: created_time type does not match: flink row type: TIMESTAMP(6) WITH LOCAL TIME ZONE, hologres type: timestamp`

Possible cause: The TIMESTAMP(6) type is used for a field in Flink. Mapping this type to Hologres is not currently supported.
Solution: Change the field type to TIMESTAMP.

Error: `Caused by: org.postgresql.util.PSQLException: FATAL: Rejected by ip white list. db = xxx, usr=xxx, ip=xx.xx.xx.xx`

Possible cause: An IP whitelist is set in Hologres, but the IP address from which Flink is accessing Hologres is not included in the whitelist, so the access is blocked.
Solution: Add the Flink IP address to the Hologres IP whitelist. For more information, see IP whitelist.

Error: `Caused by: java.lang.RuntimeException: shaded.hologres.com.aliyun.datahub.client.exception.DatahubClientException: [httpStatus:400, requestId:xx, errorCode:TableVersionExpired, errorMessage:The specified table has been modified, please refresh cursor and try again`

Possible cause: You performed a DDL operation on the source table, causing the table version to change and the consumption to fail.
Solution: Upgrade Flink to version 4.0.16 or later, which will retry in this situation.

Exception: Shard ID does not exist exception is thrown when a binary log job starts

Possible cause: The number of shards in the consumed table has changed, possibly due to a table rename or other operation. The job uses the old table's shard information when recovering from a checkpoint.
Solution: After operations such as recreating the table, the binary log consumption checkpoint information is no longer valid. Restart the job without a state.

Error: `ERROR,22021,"invalid byte sequence for encoding ""UTF8"": 0x00"`

Possible cause: During a dimension table point query, the primary key (string type) contains non-UTF-8 encoded characters, causing the SQL execution to fail.
Solution: Process the dirty data in the upstream.

Error: `hologres.org.postgresql.util.PSQLException: ERROR: syntax error`

Possible cause: When consuming a binary log table in JDBC mode, a slot must be specified. This error may occur if the created slot name contains unsupported characters (only lowercase letters, numbers, and underscores are supported).
Solution: Recreate the slot, or use the automatic slot creation feature in VVR-6.0.7.

Error: `create table hologres.hg_replication_progress failed`

Possible cause: When consuming binary logs via JDBC, the hg_replication_progress table may be needed. If this table does not exist in the current database, it needs to be created. However, the number of shards that can be created in the instance has reached the upper limit, causing the creation to fail.
Solution: Clean up unused databases.

Exception: The job gets stuck during runtime. A `thread dump` shows it is stuck at the JDBC driver loading point, usually at a position like `Class.forName`.

Possible cause: JDK 8 performs some static initialization operations when loading a JDBC driver. A race condition can occur when multiple threads load it simultaneously.
Solution: You can retry, or use version 6.0.7 of the connector, which handles this situation.

Exception: When consuming binary logs in JDBC mode, a "no table is defined in publication" or "The table xxx has no slot named xxx" exception is thrown.

Possible cause: When a table is deleted and a table with the same name is recreated, the publication bound to the table is not deleted.
Solution: When this exception occurs, you can execute the select * from pg_publication where pubname not in (select pubname from pg_publication_tables); statement in Hologres to query for publications that were not cleaned up. Then, execute the drop publication xx; statement to delete the residual publication and restart the job.

Error: A "permission denied for database" exception is thrown when the job goes online.

Possible cause: For Hologres V1.3 and V2.0, consuming binary logs in JDBC mode requires permission configuration.
Solution: Upgrade Hologres to V2.1 and use a connector of VVR-8.0.5 or later. Only read-only permission for the table is required to consume binary logs. If it is not convenient to upgrade, refer to the permission granting operation in Limits.

Error: table writer init failed: Fail to fetch table meta from sm

Possible cause: Writing to a table after a truncate or rename operation.
Solution: If this occurs occasionally, you can ignore it. The job will recover after a failover. In Hologres versions V2.1.1 to V2.1.14, the replay cache time for FE nodes is increased, which slows down DDL replay after a DML. The probability of similar exceptions may increase. Upgrade to the latest V2.1 version.

Exception: When developing a Datastream job locally using connector dependencies, an exception like java.lang.ClassNotFoundException: com.alibaba.ververica.connectors.hologres.binlog.source.reader.HologresBinlogRecordEmitter occurs.

Possible cause: The commercial connector JAR package for Realtime Compute for Apache Flink does not provide some runtime classes.
Solution: Refer to the Run and debug jobs that contain connectors locally document to adjust the dependencies for normal debugging and development.

Exception: When consuming binary logs in JDBC mode, a Binlog Convert Failed exception occurs, or data reading for some shards stops at a certain moment.

Possible cause: When the gateway of the Hologres instance receives a timeout exception from the backend, there is an issue in the process of returning the exception to the client, causing data reading to get stuck or a data parsing failure error.
Solution: This usually only occurs when the job has backpressure. If the job has a data reading stall issue, you can restart the job and recover from the latest checkpoint. To completely solve this problem, upgrade the Hologres version to 2.2.21 or later.

Terms

Troubleshoot slow real-time writes

Troubleshoot data write issues

Troubleshoot dimension table query issues

Notes on connections

Common errors

Error: ERPC TIMEOUT or ERPC CONNECTION CLOSED

Error: BackPresure Exceed Reject Limit

Error: The requested table name xxx mismatches the version of the table xxx from server/org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.Caused by: java.net.SocketTimeoutException: Read timed out

Error: Failed to query table meta for table

Error: Cloud authentication failed for access id