All Products
Search
Document Center

E-MapReduce:Fusion engine

Last Updated:Oct 28, 2024

The Fusion engine is a high-performance vectorized SQL execution engine built into E-MapReduce (EMR) Serverless Spark. The Fusion engine outperforms Apache Spark by twice in TPC-DS benchmark tests. The Fusion engine is fully compatible with Apache Spark. You do not need to modify your existing code. To enable the Fusion engine, you need to only turn on the Use Fusion Acceleration switch when you create a session.

Precautions

The Fusion engine uses the off-heap memory of Spark. When you create a session, you must configure spark.memory.offHeap.enabled=true for the Spark Configuration parameter to enable the off-heap memory of Spark. You can configure the spark.memory.offHeap.size parameter to specify the size of the off-heap memory based on your business requirements.

Supported scenarios

The Fusion engine is suitable for Spark SQL and DataFrame jobs. The Fusion engine can be used to improve the performance of most operators, expressions, and data types. However, the Fusion engine cannot accelerate the execution of Resilient Distributed Dataset (RDD) jobs or the execution of jobs in which user-defined functions (UDFs) are used.

Storage formats

The Fusion engine supports the following data storage formats:

  • Parquet

  • Paimon

  • ORC (partial)

Operators

The Fusion engine provides acceleration for most common operators. The following table describes the operators.

Type

Operator

Source

  • FileSourceScanExec

  • HiveTableScanExec

  • BatchScanExec

  • InMemoryTableScanExec

Sink

DataWritingCommandExec

Common operation

  • FilterExec

  • ProjectExec

  • SortExec

  • UnionExec

Aggregation

HashAggregateExec

Join

  • BroadcastHashJoinExec

  • ShuffledHashJoinExec

  • SortMergeJoinExec

  • BroadcastNestedLoopJoinExec

  • CartesianProductExec

Window

  • WindowExec

  • WindowTopK

Exchange

  • ShuffleExchangeExec

  • ReusedExchangeExec

  • BroadcastExchangeExec

  • CoalesceExec

Limit

  • GlobalLimitExec

  • LocalLimitExec

  • TakeOrderedAndProjectExec

Subquery

SubqueryBroadcastExec

Others

  • ExpandExec

  • GenerateExec

Expressions

The following table describes the expressions supported by the Fusion engine.

Type

Expression

Comparison/Logic

!, !=, <, <=, >, >=, <=>, <>, =, ==, ||, and, between, is not null, is null, negative, null if, and or

Algorithm

%, +, -, *, /, isnan, mod, negative, not, positive, abs, acos, acosh, asin, asinh, atan, atan2, atanh, cbrt, ceil, ceiling, cos, cosh, degrees, e, exp, floor, ln, log, log10, log2, pi, pmod, pow, power, radians, rand, random, rint, round, shiftleft, shiftright, sign, signum, sin, sqrt, tan, and tanh

Bitwise

^, |, &, ~, bit_and, bit_count, bit_or, bit_xor, and bit_length

Expression

case, if, and when

Set

in and find_in_set

String calculation

ascii, char, chr, char_length, character_length, concat, instr, lcase, lower, length, locate, lower, lpad, and ltrim

overlay, replace, reverse, rtrim, split, split_part, substr, substring, trim, ucase, upper, like, regexp, regexp_extract, regexp_extract_all, regexp_like, regexp_replace, and rlike

Aggregation

aggregate, approx_count_distinct, avg, collect_list, collect_set, corr, count, covar_pop, covar_samp, first, first_value, kurtosis, last, last_value, max, max_by, mean, min, regr_avgx, regr_avgy, regr_count, and regr_2

regr_intercept, regr_slope, regr_sxy, regr_sxx, regr_syy, skewness, std, stddev, stddev_pop, stddev_samp, sum, var_pop, var_samp, and variance

Window

cume_dist, dense_rank, lag, lead, nth_value, ntile, percent_rank, rank, and row_number

Time

add_months, current_date, current_timestamp, current_timezone, date, date_add, date_format, date_from_unix_date, date_sub, datediff, day, dayofmonth, dayofweek, dayofyear, from_unixtime, from_utc_timestamp, hour, last_day, make_date, minute, month, next_day, now, quarter, second, timestamp_micros, timestamp_millis, to_date, to_unix_timestamp, unix_seconds, unix_millis, unix_micros, weekday, weekofyear, and year

JSON

get_json_object and json_array_length

Array

array, array_contains, array_distinct, array_except, array_intersect, array_join, array_max, array_min, array_position, array_remove, array_repeat, array_sort, arrays_overlap, arrays_zip, element_at, exists, filter, forall, flatten, shuffle, size, and sort_array

Map

map, get_map_value, map_from_arrays, map_keys, map_values, map_zip_with, named_struct, struct, and str_to_map

Encoding

crc32, hash, md5, sha1, and sha2

Others

current_catalog, current_database, greatest, least, monotonically_increasing_id, nanvl, spark_partition_id, stack, uuid, and rand

Data types

The Fusion engine supports the following data types:

  • Byte, Short, Int, and Long

  • Boolean

  • String and Binary

  • Decimal

  • Float and Double

  • Date and Timestamp

Unsupported scenarios

Operators

Type

Operator

Aggregation

  • ObjectHashAggregateExec

  • SortAggregateExec

Exchange

CustomShuffleReaderExec

Pandas

  • AggregateInPandasExec

  • FlatMapGroupsInPandasExec

  • ArrowEvalPythonExec

  • MapInPandasExec

  • WindowInPandasExec

Others

  • CollectLimitExec

  • RangeExec

  • SampleExec

Data types

  • Struct

  • Array

  • Map