What is the Fusion engine - E-MapReduce - Alibaba Cloud Documentation Center

The Fusion engine is a high-performance vectorized SQL execution engine built into EMR Serverless Spark. It performs three times better than open source Spark in TPC-DS benchmark tests. The Fusion engine is fully compatible with open source Spark. You do not need to modify your existing code. In EMR Serverless Spark, you can enable the engine by turning on the Use Fusion Acceleration switch when you create a session.

Supported scenarios

The Fusion engine is suitable for Spark SQL and DataFrame jobs. The Fusion engine can be used to improve the performance of most operators, expressions, and data types. However, the Fusion engine cannot accelerate the execution of Resilient Distributed Dataset (RDD) jobs or the execution of jobs in which user-defined functions (UDFs) are used.

Storage formats

The Fusion engine supports the following data storage formats:

Parquet
Paimon
ORC (partial)

Operators

The Fusion engine provides acceleration for most common operators. The following table describes the operators.

Type	Operator List
Source	FileSourceScanExec HiveTableScanExec BatchScanExec InMemoryTableScanExec
Sink	DataWritingCommandExec
Common operation	FilterExec ProjectExec SortExec UnionExec
Aggregation	HashAggregateExec
Join	BroadcastHashJoinExec ShuffledHashJoinExec SortMergeJoinExec BroadcastNestedLoopJoinExec CartesianProductExec
Window	WindowExec WindowTopK
Exchange	ShuffleExchangeExec ReusedExchangeExec BroadcastExchangeExec CoalesceExec
Limit	GlobalLimitExec LocalLimitExec TakeOrderedAndProjectExec
Subquery	SubqueryBroadcastExec
Others	ExpandExec GenerateExec

Expressions

The following table describes the expressions supported by the Fusion engine.

Type	Expression List
Comparison/Logic	=, ==, !=, <>, <, <=, >, >=, <=>, is null, is not null, between, and, or, \|\|, !, negative, null if
Arithmetic	%, +, -, *, /, isnan, mod, negative, not, positive, abs, acos, acosh, asin, asinh, atan, atan2, atanh, cbrt, ceil, ceiling, cos, cosh, degrees, e, exp, floor, ln, log, log10, log2, pi, pmod, pow, power, radians, rand, random, rint, round, shiftleft, shiftright, sign, signum, sin, sqrt, tan, and tanh
Bitwise	^, \|, &, ~, bit_and, bit_count, bit_or, bit_xor, and bit_length
Conditional Expression	case, if, and when
Set	in and find_in_set
String calculation	ascii, char, chr, char_length, character_length, concat, instr, lcase, lower, length, locate, lower, lpad, and ltrim overlay, replace, reverse, rtrim, split, split_part, substr, substring, trim, ucase, upper, like, regexp, regexp_extract, regexp_extract_all, regexp_like, regexp_replace, and rlike
Aggregation	aggregate, approx_count_distinct, avg, collect_list, collect_set, corr, count, covar_pop, covar_samp, first, first_value, kurtosis, last, last_value, max, max_by, mean, min, regr_avgx, regr_avgy, regr_count, regr_r2 regr_intercept, regr_slope, regr_sxy, regr_sxx, regr_syy, skewness, std, stddev, stddev_pop, stddev_samp, sum, var_pop, var_samp, and variance
Window	cume_dist, dense_rank, lag, lead, nth_value, ntile, percent_rank, rank, and row_number
Time	add_months, current_date, current_timestamp, current_timezone, date, date_add, date_format, date_from_unix_date, date_sub, datediff, day, dayofmonth, dayofweek, dayofyear, from_unixtime, from_utc_timestamp, hour, last_day, make_date, minute, month, next_day, now, quarter, second, timestamp_micros, timestamp_millis, to_date, to_unix_timestamp, unix_seconds, unix_millis, unix_micros, weekday, weekofyear, and year
JSON	get_json_object and json_array_length
Array	array, array_contains, array_distinct, array_except, array_intersect, array_join, array_max, array_min, array_position, array_remove, array_repeat, array_sort, arrays_overlap, arrays_zip, element_at, exists, filter, forall, flatten, shuffle, size, and sort_array
Map	map, get_map_value, map_from_arrays, map_keys, map_values, map_zip_with, named_struct, struct, and str_to_map
Encoding	crc32, hash, md5, sha1, and sha2
Others	current_catalog, current_database, greatest, least, monotonically_increasing_id, nanvl, spark_partition_id, stack, uuid, and rand

Data types

The Fusion engine supports the following data types:

Byte, Short, Int, and Long
Boolean
String and Binary
Decimal
Float and Double
Date and Timestamp

Unsupported scenarios

Operators

Type	Operator
Aggregation	ObjectHashAggregateExec SortAggregateExec
Exchange	CustomShuffleReaderExec
Pandas	AggregateInPandasExec FlatMapGroupsInPandasExec ArrowEvalPythonExec MapInPandasExec WindowInPandasExec
Others	CollectLimitExec RangeExec SampleExec

Data types

Struct
Array
Map