Fusion engine - E-MapReduce - Alibaba Cloud Documentation Center

The Fusion engine is a high-performance vectorized SQL execution engine built into E-MapReduce (EMR) Serverless Spark. The Fusion engine outperforms Apache Spark by twice in TPC-DS benchmark tests. The Fusion engine is fully compatible with Apache Spark. You do not need to modify your existing code. To enable the Fusion engine, you need to only turn on the Use Fusion Acceleration switch when you create a session.

Precautions

The Fusion engine uses the off-heap memory of Spark. When you create a session, you must configure spark.memory.offHeap.enabled=true for the Spark Configuration parameter to enable the off-heap memory of Spark. You can configure the spark.memory.offHeap.size parameter to specify the size of the off-heap memory based on your business requirements.

Supported scenarios

The Fusion engine is suitable for Spark SQL and DataFrame jobs. The Fusion engine can be used to improve the performance of most operators, expressions, and data types. However, the Fusion engine cannot accelerate the execution of Resilient Distributed Dataset (RDD) jobs or the execution of jobs in which user-defined functions (UDFs) are used.

Storage formats

The Fusion engine supports the following data storage formats:

Parquet
Paimon
ORC (partial)

Operators

The Fusion engine provides acceleration for most common operators. The following table describes the operators.

Type	Operator
Source	FileSourceScanExec HiveTableScanExec BatchScanExec InMemoryTableScanExec
Sink	DataWritingCommandExec
Common operation	FilterExec ProjectExec SortExec UnionExec
Aggregation	HashAggregateExec
Join	BroadcastHashJoinExec ShuffledHashJoinExec SortMergeJoinExec BroadcastNestedLoopJoinExec CartesianProductExec
Window	WindowExec WindowTopK
Exchange	ShuffleExchangeExec ReusedExchangeExec BroadcastExchangeExec CoalesceExec
Limit	GlobalLimitExec LocalLimitExec TakeOrderedAndProjectExec
Subquery	SubqueryBroadcastExec
Others	ExpandExec GenerateExec

Expressions

The following table describes the expressions supported by the Fusion engine.

Type	Expression
Comparison/Logic	!, !=, <, <=, >, >=, <=>, <>, =, ==, \|\|, and, between, is not null, is null, negative, null if, and or
Algorithm	%, +, -, *, /, isnan, mod, negative, not, positive, abs, acos, acosh, asin, asinh, atan, atan2, atanh, cbrt, ceil, ceiling, cos, cosh, degrees, e, exp, floor, ln, log, log10, log2, pi, pmod, pow, power, radians, rand, random, rint, round, shiftleft, shiftright, sign, signum, sin, sqrt, tan, and tanh
Bitwise	^, \|, &, ~, bit_and, bit_count, bit_or, bit_xor, and bit_length
Expression	case, if, and when
Set	in and find_in_set
String calculation	ascii, char, chr, char_length, character_length, concat, instr, lcase, lower, length, locate, lower, lpad, and ltrim overlay, replace, reverse, rtrim, split, split_part, substr, substring, trim, ucase, upper, like, regexp, regexp_extract, regexp_extract_all, regexp_like, regexp_replace, and rlike
Aggregation	aggregate, approx_count_distinct, avg, collect_list, collect_set, corr, count, covar_pop, covar_samp, first, first_value, kurtosis, last, last_value, max, max_by, mean, min, regr_avgx, regr_avgy, regr_count, and regr_2 regr_intercept, regr_slope, regr_sxy, regr_sxx, regr_syy, skewness, std, stddev, stddev_pop, stddev_samp, sum, var_pop, var_samp, and variance
Window	cume_dist, dense_rank, lag, lead, nth_value, ntile, percent_rank, rank, and row_number
Time	add_months, current_date, current_timestamp, current_timezone, date, date_add, date_format, date_from_unix_date, date_sub, datediff, day, dayofmonth, dayofweek, dayofyear, from_unixtime, from_utc_timestamp, hour, last_day, make_date, minute, month, next_day, now, quarter, second, timestamp_micros, timestamp_millis, to_date, to_unix_timestamp, unix_seconds, unix_millis, unix_micros, weekday, weekofyear, and year
JSON	get_json_object and json_array_length
Array	array, array_contains, array_distinct, array_except, array_intersect, array_join, array_max, array_min, array_position, array_remove, array_repeat, array_sort, arrays_overlap, arrays_zip, element_at, exists, filter, forall, flatten, shuffle, size, and sort_array
Map	map, get_map_value, map_from_arrays, map_keys, map_values, map_zip_with, named_struct, struct, and str_to_map
Encoding	crc32, hash, md5, sha1, and sha2
Others	current_catalog, current_database, greatest, least, monotonically_increasing_id, nanvl, spark_partition_id, stack, uuid, and rand

Data types

The Fusion engine supports the following data types:

Byte, Short, Int, and Long
Boolean
String and Binary
Decimal
Float and Double
Date and Timestamp

Unsupported scenarios

Operators

Type	Operator
Aggregation	ObjectHashAggregateExec SortAggregateExec
Exchange	CustomShuffleReaderExec
Pandas	AggregateInPandasExec FlatMapGroupsInPandasExec ArrowEvalPythonExec MapInPandasExec WindowInPandasExec
Others	CollectLimitExec RangeExec SampleExec

Data types

Struct
Array
Map