This topic describes how to use the Datagen connector.
Background information
The Datagen connector is used for debugging. The connector periodically generates random data of the type that corresponds to the Datagen source table. If you want to use test data to efficiently check the business logic during development or testing, you can use the Datagen connector to generate random data.
The Datagen connector supports the computed column syntax to generate data in a flexible manner.
The following table describes the capabilities supported by the Datagen connector.
Item | Description |
Table type | Source table |
Running mode | Batch mode and streaming mode |
Data format | N/A |
Metric | N/A |
API type | SQL API |
Data update or deletion in a result table | N/A |
Limits
Only Realtime Compute for Apache Flink that uses Ververica Runtime (VVR) 2.0.0 or later supports the Datagen connector.
Syntax
CREATE TABLE datagen_source (
name VARCHAR,
score BIGINT
) WITH (
'connector' = 'datagen'
);
Parameters in the WITH clause
Parameter | Description | Data type | Required | Default value | Remarks |
connector | The type of the source table. | STRING | Yes | No default value | Set the value to |
rows-per-second | The rate at which random data is generated. | LONG | No | 10000 (rows of data per second) | N/A. |
number-of-rows | The total number of rows of data that can be generated. | LONG | No | No default value | By default, an unbounded data source table is generated. If a field generator is a sequence generator, data generation for the source is complete and a bounded table is generated after the sequence of a field is generated. |
fields.<field>.kind | The type of the generator that generates data for <field>. | STRING | No | random | Valid values:
For more information about generators, see Generators. |
fields.<field>.min | The minimum random value that can be generated. | Same as the data type of <field> | No | Minimum value for the data type of <field> | This parameter takes effect when the fields.<field>.kind parameter is set to random. Only numeric data types are supported. |
fields.<field>.max | The maximum random value that can be generated. | Same as the data type of <field> | No | Maximum value for the data type of <field> | This parameter takes effect when the fields.<field>.kind parameter is set to random. Only numeric data types are supported. |
fields.<field>.max-past | The maximum past time relative to the current timestamp of the on-premises machine when a random timestamp is generated. | DURATION | No | 0 | Only the timestamp type is supported. |
fields.<field>.length | The length of the random string that is generated or the capacity of a set of data that is generated. | INTEGER | No | 100 | The following data types are supported:
|
fields.<field>.start | The start value of the sequence generator. | Same as the data type of <field> | No | No default value | This parameter takes effect when the fields.<field>.kind parameter is set to sequence. |
fields.<field>.end | The end value of the sequence generator. | Same as the data type of <field> | No | No default value | This parameter takes effect when the fields.<field>.kind parameter is set to sequence. |
Replace <field> in the parameter with the name of the field that you defined in the DDL statement.
Generators
The Datagen connector can use one of the following types of generators to generate random data:
Random generator: generates random values. You can specify the maximum and minimum values for data that is randomly generated.
Sequence generator: generates ordered values within a specific range. When the generated sequence reaches the end value, the data generation process ends. Therefore, if a sequence generator is used, a bounded table is generated. You can specify the start and end values of the sequence.
The following table describes the generator types that are supported for each data type.
Data type | Supported generator | Remarks |
BOOLEAN | Random | N/A. |
CHAR | Random and sequence | N/A. |
VARCHAR | Random and sequence | N/A. |
BINARY | Random and sequence | N/A. |
VARBINARY | Random and sequence | N/A. |
STRING | Random and sequence | N/A. |
DECIMAL | Random and sequence | N/A. |
TINYINT | Random and sequence | N/A. |
SMALLINT | Random and sequence | N/A. |
INT | Random and sequence | N/A. |
BIGINT | Random and sequence | N/A. |
FLOAT | Random and sequence | N/A. |
DOUBLE | Random and sequence | N/A. |
DATE | Random | Uses the current date of the on-premises machine. |
TIME | Random | Uses the current time of the on-premises machine. |
TIMESTAMP | Random | Generates values within the maximum past time range relative to the current timestamp of the on-premises machine. |
TIMESTAMP_LTZ | Random | Generates values within the maximum past time range relative to the current timestamp of the on-premises machine. |
ROW | Random | Generates random subfields. |
ARRAY | Random | Generates random elements. |
MAP | Random | Generates random pairs of (key,value). |
MULTISET | Random | Generates random elements. |
Example
In most cases, the Datagen connector is used together with the LIKE clause to simulate a table. Sample code:
CREATE TABLE datagen_source (
id INT,
score INT
) WITH (
'connector' = 'datagen',
'fields.id.kind'='sequence',
'fields.id.start'='1',
'fields.id.end'='50',
'fields.score.kind'='random',
'fields.score.min'='70',
'fields.score.max'='100'
);