Datastream job development often has JAR package conflicts and other problems. This article mainly explains which dependencies need to be introduced and which need to be packaged into the job JAR during the job development. It can avoid unnecessary dependencies being inserted into the job JAR and possible dependency conflicts.
A Datastream job involves the following dependencies:
Every Flink application depends on a series of related libraries, which includes Flink's API at least. Many applications rely on connector-related libraries such as Kafka and Cassandra. When you run a Flink application, whether it is running in a distributed environment or testing in a local IDE, related dependencies of Flink in the run time are required.
Like most systems running user-defined applications, there are two broad categories of dependencies in Flink:
The development of each Flink application requires at least the basic dependency on related APIs.
When you manually configure a project, you need to add a dependency on the Java/Scala API. (Let's take Maven as an example. The same dependency can be used in other building tools such as Gradle and SBT.)
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.12.3</version>
<scope>provided</scope>
</dependency>
Attention: All of these dependencies have their scope set to provided. This means they need to be compiled, but they should not be packaged into the application JAR file generated by the project. These dependencies are Flink core dependencies that have been loaded at the run time.
We recommend setting dependencies to provided. If not, the generated JAR will become bloated in the best case since it contains all Flink core dependencies. In the worst case, the Flink core dependencies added to the application JAR file will conflict with some of your dependencies (usually avoided by Flink's reverse class loading mechanism).
Notes on IntelliJ: It is necessary to check the Include dependencies with Provided scope option box in the running configuration to make the application run in IntelliJ IDEA. If this option is unavailable (due to an older version of IntelliJ IDEA), a simple solution is to create a test case that calls the application main() method.
Most applications need specific connectors or libraries to run, such as Kafka and Cassandra connectors. These connectors are not part of Flink core dependencies and must be added to the application as additional dependencies.
The following code is an example of adding Kafka connector dependencies (in Maven syntax):
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.12.3</version>
</dependency>
We recommend packaging the application code and all its dependencies in jar-with-dependencies form into an application jar. This application JAR package can be submitted to an existing Flink cluster or added to the container image of the Flink application.
For projects created from Maven job templates (see Maven job templates below), dependencies will be typed into the JAR package of the application through mvn clean package commands. If no templates are used for configuration, we recommend using the Maven Shade Plugin to build JAR packages that contain dependencies. (The configuration is shown in the appendix.)
Attention: The scope of these application dependencies must be specified as compile for Maven (and other building tools) to package dependencies into application jars. (Unlike core dependencies, the scope of core dependencies must be specified as provided.)
Different versions of Scala (such as 2.11 and 2.12) are incompatible with each other. Therefore, the Flink version corresponding to Scala 2.11 cannot be used for applications that use Scala 2.12.
All Flink dependencies that depend on (or transmit) Scala are suffixed with the Scala version from which they were built, such as flink-streaming-scala_2.11.
If you use Java for development, you can select any Scala version. If you use Scala for development, you must select the Flink dependency version that matches the Scala version of your application.
Note: Scala versions later than 2.12.8 are incompatible with the previous 2.12.x version. Therefore, the Flink project fails to upgrade its 2.12.x version to versions later than 2.12.8. You can compile the corresponding Scala version of Flink locally. If you want it to work properly, you need to add-Djapicmp.skip to skip the binary compatibility check during building.
General Rule: Never add Hadoop-related dependencies to your application except when using the existing Hadoop input/output format with Flink's Hadoop compatible package.
If you want to use Flink with Hadoop, you must include the Flink startup items dependent on Hadoop instead of adding Hadoop as an application dependency. Flink will use Hadoop dependencies specified by HADOOP_CLASSPATH environment variables. You can set them in the following ways:
export HADOOP_CLASSPATH=`hadoop classpath`
There are two main reasons for this design:
If you need Hadoop dependencies (such as HDFS access) during testing or development within the IDE, configure the scope of these dependencies to test or provided.
Flink uses the Service Provider Interfaces (SPI) mechanism in Java to load the connector/format factory of a table through a specific identifier. The SPI resource files named org.apache.flink.table.factories.Factory for connector/format of each table is located in the same directory: META-INF/services. Therefore, these resource files will overwrite each other when building uber jars for projects that use multiple table connector/format, which will cause Flink to fail to load factory classes.
As a result, the recommended method is to convert these resource files in the META-INF/services directory through the ServicesResourceTransformer of the maven shade plug-in. The following is the content of the pom.xml file for the given example, which contains the connector flink-sql-connector-hive-3.1.2 and flink-parquet format.
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>myProject</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- other project dependencies ...-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-sql-connector-hive-3.1.2__2.11</artifactId>
<version>1.13.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-parquet__2.11<</artifactId>
<version>1.13.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<id>shade</id>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers combine.children="append">
<!-- The service transformer is needed to merge META-INF/services files -->
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
<!-- ... -->
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
After the ServicesResourceTransformer is configured, when the project builds the uber-jar, these resource files in the META-INF/services directory are integrated instead of overwriting each other.
We highly recommended using this mode for configuration, which can reduce a lot of repeated configuration work.
The environment requirement is Maven 3.0.4 (or later) and Java 8.x.
Create a project using one of the following methods:
$ mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.12.3
This allows you to name the newly created project. It will interactively require you to enter groupId, artifactId, and package names.
$curl https://flink.apache.org/q/quickstart.sh | bash -s 1.12.3
We recommend importing this project into the IDE to develop and test it. IntelliJ IDEA supports Maven projects. If you use Eclipse, you can use the m2e plug-in to import the Maven project. Some Eclipse bundles include the plug-in by default. Otherwise, you need to install it manually.
Note: The default Java JVM heap size may be small for Flink. You have to add it manually. In Eclipse, choose RunConfigurations->Arguments and write to the VM Arguments box:-Xmx800m. We recommend using the Help | Edit Custom VM Options to change JVM options in IntelliJ IDEA. Please see this article for details.
If you want to build /package the project, go to the project directory and run the mvn clean package command. After execution, you will get a JAR file: target/-.jar, which contains your application, connectors, and libraries added to the application as dependencies.
You can use the following shade plug-in definition to build an application JAR that contains all the dependencies required by the connector and library:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<excludes>
<exclude>com.google.code.findbugs:jsr305</exclude>
<exclude>org.slf4j:*</exclude>
<exclude>log4j:*</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<!-- Do not copy the signatures in the META-INF folder.
Otherwise, this might cause SecurityExceptions when using the JAR. -->
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>my.programs.main.clazz</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
More Than Computing: A New Era Led by the Warehouse Architecture of Apache Flink
Exploration of Advanced Functions in Pravega Flink Connector Table API
150 posts | 43 followers
FollowApache Flink Community China - December 25, 2019
Apache Flink Community China - September 15, 2022
Apache Flink Community China - August 2, 2019
Apache Flink Community China - December 25, 2019
Apache Flink Community China - September 27, 2019
Apache Flink Community China - March 29, 2021
150 posts | 43 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreA fully-managed Apache Kafka service to help you quickly build data pipelines for your big data analytics.
Learn MoreMore Posts by Apache Flink Community