WASM Performance Analysis - Instrumentation Solution

This article introduces performance profiling using WebAssembly as an example, combining bytecode instrumentation and performance flame graph technologies.

By Jiaxiang

Recently, some practice of WebAssembly code performance profiling has found that some ideas and methods are also common to other just-in-time compilation (JIT) code. This article uses WebAssembly as an example to describe the performance profiling method.

WebAssembly and WAVM

First of all, we will briefly introduce WebAssembly (hereinafter referred to as WASM). WASM is a low-level assembly language in bytecode format, which can be compiled from high-level languages such as C/C++ and Rust. It is initially designed to achieve near-native execution efficiency on the Web side (currently available in mainstream browsers). Examples of the application include the Web version of Unity, TensorFlow, AutoCAD, and Google Earth.

Due to the portability of WASM, many projects have started to use WASM in non-browser environments, which means running WASM through a VM or runtime. This article discusses the performance profiling of running WASM through VMs. There are many WASM VMs and runtimes, some of which have provided performance profiling support in recent years. Therefore, we chose to experiment with WAVM, which currently lacks performance profiling support.

WASM is executed by WAVM in JIT mode. As shown in the following figure, developers can compile different high-level languages into WASM, and then WAVM compiles WASM into LLVM IR through LLVM and finally compiles it into the machine code for different platforms.

Performance Profiling - WASM Instrumentation

Back to our initial purpose, how to perform performance profiling on WASM? Let's return to the most basic method, that is, record a timestamp when entering a function, record another one when exiting, and subtract the two timestamps; or record timestamps before and after the function invocation and subtract the two to get the overhead. As shown in the following figure, since there is little difference between the two solutions, we only choose the first solution for convenience.

Instrumentation function design

Implementation based on the hash table

Since the same function may be invoked by different functions, it is necessary to record both the execution time and the call stack structure in perf_start and perf_end. The most intuitive design for this is to maintain a call stack during recording, as follows:

Push the current function name and entry time on the stack at perf_start.

Pop the stack at perf_end and record the overhead of the top function. At the same time, traverse the call stack once to obtain the execution path of the top function, and store the two in the hash table.

However, in the test, we found that although the solution is intuitive, it requires traversing the stack and performing string concatenation in the instrumentation function, which incurs significant overhead. This introduces an observer effect, resulting in unreliable profiling results. As shown in the following figure, for a program where the complex function has an overhead of 80 and the simple function has an overhead of 20, if the instrumentation overhead is 20 for each function call and both functions are called the same number of times, the final observed results would be complex at 100 and simple at 40.

Optimization: tree-based implementation

To reduce the overhead, a tree can be used to record the call graph, as follows:

At perf_start, query an existing child node or create a new child node and update the entry time. The global node pointer is transferred to the child node.

At perf_end, update function time overhead. The global node pointer is transferred to the parent node.

In this way, for functions that are called multiple times, the time overhead of the instrumentation function is basically reduced to just the cost of retrieving the timestamp. After a series of optimizations, we finally tested that the overhead was reduced to about 3%, and the resulting error was within an acceptable range.

void perf_start(int32_t func_id) {
  PerfNode* cur_node = perf_data->perf_node();
  if (!cur_node) {
    return;
  }
  // Obtain or create a child node
  PerfNode* child_node = cur_node->GetChildNode(perf_data->buffer(), func_id);
  if (!child_node){
    perf_data->UpdatePerfNode(NULL);
    return;
  }
  // Record the entry time
  child_node->RecordEntry();
  // g_cur_node points to child_node
  perf_data->UpdatePerfNode(child_node);
}

void perf_end() {
  PerfNode* cur_node = perf_data->perf_node();
  if (!cur_node) {
    return;
  }
  // Record the overhead
  cur_node->RecordExit();
  // g_cur_node points to parent
  perf_data->UpdatePerfNode(cur_node->parent());
}

Other optimizations:

Pooling is performed on the space allocation when creating a new child node.
Sampling record: Apart from a regular time function, it can be replaced with a higher-performance rdtsc instruction to directly obtain the record from the register. A performance comparison of common timing tools is attached.
Scope limit of instrumentation: Instrumentation is only carried out for some functions, while functions with minimal overhead are ignored. For example, with an instrumentation list, only the list functions are instrumented.

WASM instrumentation

Instrumentation process

The following is a comparison of WASM in C code and text format. On the right side, we need to pay attention to three parts:

Type Section: store the types of all WASM functions, including import functions.
Import Function: the interface implemented on the WAVM side that is called by WASM.
Function Section: the specific implementation of the function.

The specific instrumentation process includes the following four steps.

First, add the above perf_start and perf_end to the Import Section. As for the function type, based on the above instrumentation function design, perf_start needs the index ID (i32) of the current function, while perf_end does not require parameters, and neither has a return value. If the Type Section does not have these two function types, you need to add them.
Since WASM function calls are represented by function indexes, after the Import function is added, the indexes of all functions also need to be adjusted, including the index before each function definition and the target function of the call instruction.
Perform instrumentation in each function:

perf_start and perf_end. perf_start requires the current function index as a parameter, so it needs to be pushed in the stack in advance, while perf_end requires instrumentation in front of all return instructions or at the end of the function.

If WASM contains a Name Section, you also need to add the function names of these two functions to the Name Section.

The corresponding WASM (text format) after the instrumentation is as follows:

Instrumentation tools

WASM instrumentation is essentially parsing the WASM bytecode and then re-encoding it. Dependency libraries include Rust's wasmparser, wasm-encoder, and wasmprinter. C++ can be implemented through wabt. Currently, there are some tools and dependency libraries for WASM instrumentation, such as Ewasm's wasm-gas and paritytech/wasm-instrument. Here I modify the wasm-instrument for instrumentation.

The following is the core code of WASM instrumentation, including the preceding steps 2 and 3. Other modifications to the WASM Section are not elaborated here.

// perf_start instrumentation
func_builder.instruction(&Instruction::I32Const((current_func_index+2) as i32));
func_builder.instruction(&Instruction::Call(perf_start));

// The block depth is used to determine whether the function ends
let mut block_depth = 0;
for op in operator_reader {
    let op = op?;
    match op {
        // The target index to be called increases 2
        Operator::Call { function_index } => {
            handle_in_function_call(&mut func_builder, entry_func_index, exit_func_index, function_index)?;
        },
        // Return perf_end instrumentation
        Operator::Return => {
            func_builder.instruction(&Instruction::Call(exit_func_index));
            func_builder.instruction(&Instruction::Return);
        },
        // Count the block depth
        Operator::Block { .. } | Operator::Loop { .. } | Operator::If { .. } | Operator::Try { .. } => {
            block_depth += 1;
            func_builder.instruction(&DefaultTranslator.translate_op(&op)?);
        },
        // For functions without a return, insert perf_end instrumentation
        Operator::End => {
            if block_depth == 0 {
                func_builder.instruction(&Instruction::Call(exit_func_index));
            }
            func_builder.instruction(&DefaultTranslator.translate_op(&op)?);
            block_depth -= 1;
        },
        _ =>{
            func_builder.instruction(&DefaultTranslator.translate_op(&op)?);
        },
    }
}
code_section_builder.function(&func_builder);

LLVM IR instrumentation

As mentioned earlier, WAVM executes WASM in JIT mode, and LLVM is used in the backend. This means that WASM is first compiled into LLVM IR and then the machine code for each platform from IR. In addition to the instrumentation at the WASM level, you can also perform instrumentation at the LLVM IR level. The instrumentation principle is almost the same as above, but different VMs have different implementations of the backend. This modification not only increases workload and complexity but also makes the solution harder to port, so it is not recommended.

Flame graph

The flame graph is a useful script for visualizing performance profiling, which can show the proportion of function overhead. The process generally includes:

$FG_DIR/stackcollapse-perf.pl perf.unfold > perf.folded
$FG_DIR/flamegraph.pl perf.folded > perf.svg

Now that we have obtained the call overhead graph through instrumentation, how can we convert this call overhead graph into a flame graph? We have looked at the formats of perf.unfold and perf.folded respectively.

perf.unfold contains the call stacks and sampling periods for each sample taken by the perf tool, resulting in a large file size.

perf.folded aggregates the samples mentioned above, resulting in a simple output file.

Obviously, perf.folded is simpler, and since call stacks are essentially paths to leaf nodes, they can be output directly through depth-first traversal of the call tree. Here is the core code:

std::string call_stack = "wavm";
while (!node_stack.empty()) {
    PerfNode* cur_node = node_stack.top();
    if (!cur_node->visited_) {
      // Push children into stack
      // Calculate current call stack cost.
      uint64_t cur_cost = cur_node->time_cost_;
      for (size_t i = 0; i < cur_node->children_size_; i++) {
        node_stack.push(cur_node->children_[i]);
        cur_cost -= cur_node->children_[i]->time_cost_;
      }
      // Update call stack and output.
      call_stack.append(";");
      call_stack.append(wasm_func_names.functions[cur_node->func_id_].name);
      call_length_stack.push(call_stack.size());
      fprintf(fp, "%s %ld\n", call_stack.c_str(), cur_cost);

      cur_node->visited_ = true;
      visited_node.push_back(cur_node);
    }
    else {
      node_stack.pop();
      call_length_stack.pop();
      call_stack.resize(call_length_stack.top());
    }
  }

After the above implementation, you only need to package a PerfOutput interface to output the .folded file required by the flame graph after WASM is executed.

Sample

Next, write a sample, a simple fibonacci calculation.

#include <stdio.h>
#include <stdlib.h>

int fibonacci(int n) {
    if (n <= 0)
        return 0;
    if (n == 1)
        return 1;
    return fibonacci(n - 1) + fibonacci(n - 2);
}

int main(int argc, char **argv) {
    int n = atoi(argv[1]);
    printf("fibonacci(%d)=%d\n", n, fibonacci(n));
    return 0;
}

Then, use the following command to compile into WASM and convert it from bytecode format to readable text format (WAT).

# emcc comes from Emscripten SDK and is a compiler for WASM
# -O0 Disable optimization and keep the source code structure -g Disable optimization and add debug information to WASM

# wasm2wat comes from WebAssembly Binary Toolkit
wasm2wat fib.wasm > fib.wat

The fibonacci function can be found in fib.wat:

(func $fibonacci (type 3) (param i32) (result i32)
    (local i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
    global.get $__stack_pointer
    ...
    global.set $__stack_pointer
    local.get 25
    return)

(func indicates that there is a function within this bracket, followed by the function signature and local variables. The function content starts from the third line and continues until the last return.

wasm-instrument instrument fib.wasm -o fib_i.wasm

wasm2wat fib_i.wasm > fib_i.wat

The fib_i.wat content is as follows, showing that the instrumentation has been completed.

(func $fibonacci (type 3) (param i32) (result i32)
    (local i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
    i32.const 7
    call $perf_start
    global.get $__stack_pointer
    ...
    global.set $__stack_pointer
    local.get 25
    call $perf_end
    return)

Run and output.

# Run to generate the fib.folded file.
wavm run -o fib.folded fib.wasm 40

# Convert to the flame graph
$FG_DIR/flamegraph.pl fib.folded > fib.svg

The following figure displays the flame graph obtained:

You can see the calling procedure and overhead proportion of fib_i.wasm, achieving the purpose of performance profiling.

Summary

In general, it is feasible to perform performance profiling on WASM through instrumentation timing, but there are also some shortcomings:

After instrumentation, the execution efficiency of WASM decreases. We control the overhead to 3% by narrowing the instrumentation scope. If each small function needs to be observed, the overhead will exceed the expectation. Once that happens, the observer effect will lead to a decrease in accuracy, which is the biggest pain point.
The performance profiling process is complicated. Intuitively, the addition of an instrumentation process requires the reconstruction of the WASM file, which increases the complexity of the actual profiling.

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

Community

WASM Performance Analysis - Instrumentation Solution

WebAssembly and WAVM

Performance Profiling - WASM Instrumentation

Instrumentation function design

Implementation based on the hash table

Optimization: tree-based implementation

Other optimizations:

WASM instrumentation

Instrumentation process

Instrumentation tools

LLVM IR instrumentation

Flame graph

Sample

Summary

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Elastic High Performance Computing Solution

Elastic High Performance Computing

Remote Rendering Solution

Application Real-Time Monitoring Service

A Free Trial That Lets You Build Big!