The Fusion engine is a high-performance vectorized SQL execution engine built into EMR Serverless Spark. It performs three times better than open source Spark in TPC-DS benchmark tests. The Fusion engine is fully compatible with open source Spark. You do not need to modify your existing code. In EMR Serverless Spark, you can enable the engine by turning on the Use Fusion Acceleration switch when you create a session.
Supported scenarios
The Fusion engine is suitable for Spark SQL and DataFrame jobs. The Fusion engine can be used to improve the performance of most operators, expressions, and data types. However, the Fusion engine cannot accelerate the execution of Resilient Distributed Dataset (RDD) jobs or the execution of jobs in which user-defined functions (UDFs) are used.
Storage formats
The Fusion engine supports the following data storage formats:
Parquet
Paimon
ORC (partial)
Operators
The Fusion engine provides acceleration for most common operators. The following table describes the operators.
Type | Operator List |
Source |
|
Sink | DataWritingCommandExec |
Common operation |
|
Aggregation | HashAggregateExec |
Join |
|
Window |
|
Exchange |
|
Limit |
|
Subquery | SubqueryBroadcastExec |
Others |
|
Expressions
The following table describes the expressions supported by the Fusion engine.
Type | Expression List |
Comparison/Logic | =, ==, !=, <>, <, <=, >, >=, <=>, is null, is not null, between, and, or, ||, !, negative, null if |
Arithmetic | %, +, -, *, /, isnan, mod, negative, not, positive, abs, acos, acosh, asin, asinh, atan, atan2, atanh, cbrt, ceil, ceiling, cos, cosh, degrees, e, exp, floor, ln, log, log10, log2, pi, pmod, pow, power, radians, rand, random, rint, round, shiftleft, shiftright, sign, signum, sin, sqrt, tan, and tanh |
Bitwise | ^, |, &, ~, bit_and, bit_count, bit_or, bit_xor, and bit_length |
Conditional Expression | case, if, and when |
Set | in and find_in_set |
String calculation | ascii, char, chr, char_length, character_length, concat, instr, lcase, lower, length, locate, lower, lpad, and ltrim overlay, replace, reverse, rtrim, split, split_part, substr, substring, trim, ucase, upper, like, regexp, regexp_extract, regexp_extract_all, regexp_like, regexp_replace, and rlike |
Aggregation | aggregate, approx_count_distinct, avg, collect_list, collect_set, corr, count, covar_pop, covar_samp, first, first_value, kurtosis, last, last_value, max, max_by, mean, min, regr_avgx, regr_avgy, regr_count, regr_r2 regr_intercept, regr_slope, regr_sxy, regr_sxx, regr_syy, skewness, std, stddev, stddev_pop, stddev_samp, sum, var_pop, var_samp, and variance |
Window | cume_dist, dense_rank, lag, lead, nth_value, ntile, percent_rank, rank, and row_number |
Time | add_months, current_date, current_timestamp, current_timezone, date, date_add, date_format, date_from_unix_date, date_sub, datediff, day, dayofmonth, dayofweek, dayofyear, from_unixtime, from_utc_timestamp, hour, last_day, make_date, minute, month, next_day, now, quarter, second, timestamp_micros, timestamp_millis, to_date, to_unix_timestamp, unix_seconds, unix_millis, unix_micros, weekday, weekofyear, and year |
JSON | get_json_object and json_array_length |
Array | array, array_contains, array_distinct, array_except, array_intersect, array_join, array_max, array_min, array_position, array_remove, array_repeat, array_sort, arrays_overlap, arrays_zip, element_at, exists, filter, forall, flatten, shuffle, size, and sort_array |
Map | map, get_map_value, map_from_arrays, map_keys, map_values, map_zip_with, named_struct, struct, and str_to_map |
Encoding | crc32, hash, md5, sha1, and sha2 |
Others | current_catalog, current_database, greatest, least, monotonically_increasing_id, nanvl, spark_partition_id, stack, uuid, and rand |
Data types
The Fusion engine supports the following data types:
Byte, Short, Int, and Long
Boolean
String and Binary
Decimal
Float and Double
Date and Timestamp
Unsupported scenarios
Operators
Type | Operator |
Aggregation |
|
Exchange | CustomShuffleReaderExec |
Pandas |
|
Others |
|
Data types
Struct
Array
Map