By Jiaxiang
Recently, some practice of WebAssembly code performance profiling has found that some ideas and methods are also common to other just-in-time compilation (JIT) code. This article uses WebAssembly as an example to describe the performance profiling method.
First of all, we will briefly introduce WebAssembly (hereinafter referred to as WASM). WASM is a low-level assembly language in bytecode format, which can be compiled from high-level languages such as C/C++ and Rust. It is initially designed to achieve near-native execution efficiency on the Web side (currently available in mainstream browsers). Examples of the application include the Web version of Unity, TensorFlow, AutoCAD, and Google Earth.
Due to the portability of WASM, many projects have started to use WASM in non-browser environments, which means running WASM through a VM or runtime. This article discusses the performance profiling of running WASM through VMs. There are many WASM VMs and runtimes, some of which have provided performance profiling support in recent years. Therefore, we chose to experiment with WAVM, which currently lacks performance profiling support.
WASM is executed by WAVM in JIT mode. As shown in the following figure, developers can compile different high-level languages into WASM, and then WAVM compiles WASM into LLVM IR through LLVM and finally compiles it into the machine code for different platforms.
Back to our initial purpose, how to perform performance profiling on WASM? Let's return to the most basic method, that is, record a timestamp when entering a function, record another one when exiting, and subtract the two timestamps; or record timestamps before and after the function invocation and subtract the two to get the overhead. As shown in the following figure, since there is little difference between the two solutions, we only choose the first solution for convenience.
Since the same function may be invoked by different functions, it is necessary to record both the execution time and the call stack structure in perf_start
and perf_end
. The most intuitive design for this is to maintain a call stack during recording, as follows:
Push the current function name and entry time on the stack at perf_start
.
Pop the stack at perf_end
and record the overhead of the top function. At the same time, traverse the call stack once to obtain the execution path of the top function, and store the two in the hash table.
However, in the test, we found that although the solution is intuitive, it requires traversing the stack and performing string concatenation in the instrumentation function, which incurs significant overhead. This introduces an observer effect, resulting in unreliable profiling results. As shown in the following figure, for a program where the complex
function has an overhead of 80 and the simple
function has an overhead of 20, if the instrumentation overhead is 20 for each function call and both functions are called the same number of times, the final observed results would be complex
at 100 and simple
at 40.
To reduce the overhead, a tree can be used to record the call graph, as follows:
At perf_start
, query an existing child node or create a new child node and update the entry time. The global node pointer is transferred to the child node.
At perf_end
, update function time overhead. The global node pointer is transferred to the parent node.
In this way, for functions that are called multiple times, the time overhead of the instrumentation function is basically reduced to just the cost of retrieving the timestamp. After a series of optimizations, we finally tested that the overhead was reduced to about 3%, and the resulting error was within an acceptable range.
void perf_start(int32_t func_id) {
PerfNode* cur_node = perf_data->perf_node();
if (!cur_node) {
return;
}
// Obtain or create a child node
PerfNode* child_node = cur_node->GetChildNode(perf_data->buffer(), func_id);
if (!child_node){
perf_data->UpdatePerfNode(NULL);
return;
}
// Record the entry time
child_node->RecordEntry();
// g_cur_node points to child_node
perf_data->UpdatePerfNode(child_node);
}
void perf_end() {
PerfNode* cur_node = perf_data->perf_node();
if (!cur_node) {
return;
}
// Record the overhead
cur_node->RecordExit();
// g_cur_node points to parent
perf_data->UpdatePerfNode(cur_node->parent());
}
The following is a comparison of WASM in C code and text format. On the right side, we need to pay attention to three parts:
The specific instrumentation process includes the following four steps.
perf_start
and perf_end
to the Import Section. As for the function type, based on the above instrumentation function design, perf_start
needs the index ID (i32
) of the current function, while perf_end
does not require parameters, and neither has a return value. If the Type Section does not have these two function types, you need to add them.call
instruction.perf_start
and perf_end
. perf_start
requires the current function index as a parameter, so it needs to be pushed in the stack in advance, while perf_end
requires instrumentation in front of all return
instructions or at the end of the function.
The corresponding WASM (text format) after the instrumentation is as follows:
WASM instrumentation is essentially parsing the WASM bytecode and then re-encoding it. Dependency libraries include Rust's wasmparser, wasm-encoder, and wasmprinter. C++ can be implemented through wabt. Currently, there are some tools and dependency libraries for WASM instrumentation, such as Ewasm's wasm-gas and paritytech/wasm-instrument. Here I modify the wasm-instrument for instrumentation.
The following is the core code of WASM instrumentation, including the preceding steps 2 and 3. Other modifications to the WASM Section are not elaborated here.
// perf_start instrumentation
func_builder.instruction(&Instruction::I32Const((current_func_index+2) as i32));
func_builder.instruction(&Instruction::Call(perf_start));
// The block depth is used to determine whether the function ends
let mut block_depth = 0;
for op in operator_reader {
let op = op?;
match op {
// The target index to be called increases 2
Operator::Call { function_index } => {
handle_in_function_call(&mut func_builder, entry_func_index, exit_func_index, function_index)?;
},
// Return perf_end instrumentation
Operator::Return => {
func_builder.instruction(&Instruction::Call(exit_func_index));
func_builder.instruction(&Instruction::Return);
},
// Count the block depth
Operator::Block { .. } | Operator::Loop { .. } | Operator::If { .. } | Operator::Try { .. } => {
block_depth += 1;
func_builder.instruction(&DefaultTranslator.translate_op(&op)?);
},
// For functions without a return, insert perf_end instrumentation
Operator::End => {
if block_depth == 0 {
func_builder.instruction(&Instruction::Call(exit_func_index));
}
func_builder.instruction(&DefaultTranslator.translate_op(&op)?);
block_depth -= 1;
},
_ =>{
func_builder.instruction(&DefaultTranslator.translate_op(&op)?);
},
}
}
code_section_builder.function(&func_builder);
As mentioned earlier, WAVM executes WASM in JIT mode, and LLVM is used in the backend. This means that WASM is first compiled into LLVM IR and then the machine code for each platform from IR. In addition to the instrumentation at the WASM level, you can also perform instrumentation at the LLVM IR level. The instrumentation principle is almost the same as above, but different VMs have different implementations of the backend. This modification not only increases workload and complexity but also makes the solution harder to port, so it is not recommended.
The flame graph is a useful script for visualizing performance profiling, which can show the proportion of function overhead. The process generally includes:
$FG_DIR/stackcollapse-perf.pl perf.unfold > perf.folded
$FG_DIR/flamegraph.pl perf.folded > perf.svg
Now that we have obtained the call overhead graph through instrumentation, how can we convert this call overhead graph into a flame graph? We have looked at the formats of perf.unfold and perf.folded respectively.
perf.unfold contains the call stacks and sampling periods for each sample taken by the perf tool, resulting in a large file size.
perf.folded aggregates the samples mentioned above, resulting in a simple output file.
Obviously, perf.folded is simpler, and since call stacks are essentially paths to leaf nodes, they can be output directly through depth-first traversal of the call tree. Here is the core code:
std::string call_stack = "wavm";
while (!node_stack.empty()) {
PerfNode* cur_node = node_stack.top();
if (!cur_node->visited_) {
// Push children into stack
// Calculate current call stack cost.
uint64_t cur_cost = cur_node->time_cost_;
for (size_t i = 0; i < cur_node->children_size_; i++) {
node_stack.push(cur_node->children_[i]);
cur_cost -= cur_node->children_[i]->time_cost_;
}
// Update call stack and output.
call_stack.append(";");
call_stack.append(wasm_func_names.functions[cur_node->func_id_].name);
call_length_stack.push(call_stack.size());
fprintf(fp, "%s %ld\n", call_stack.c_str(), cur_cost);
cur_node->visited_ = true;
visited_node.push_back(cur_node);
}
else {
node_stack.pop();
call_length_stack.pop();
call_stack.resize(call_length_stack.top());
}
}
After the above implementation, you only need to package a PerfOutput
interface to output the .folded file required by the flame graph after WASM is executed.
Next, write a sample, a simple fibonacci calculation.
#include <stdio.h>
#include <stdlib.h>
int fibonacci(int n) {
if (n <= 0)
return 0;
if (n == 1)
return 1;
return fibonacci(n - 1) + fibonacci(n - 2);
}
int main(int argc, char **argv) {
int n = atoi(argv[1]);
printf("fibonacci(%d)=%d\n", n, fibonacci(n));
return 0;
}
Then, use the following command to compile into WASM and convert it from bytecode format to readable text format (WAT).
# emcc comes from Emscripten SDK and is a compiler for WASM
# -O0 Disable optimization and keep the source code structure -g Disable optimization and add debug information to WASM
# wasm2wat comes from WebAssembly Binary Toolkit
wasm2wat fib.wasm > fib.wat
The fibonacci function can be found in fib.wat:
(func $fibonacci (type 3) (param i32) (result i32)
(local i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
global.get $__stack_pointer
...
global.set $__stack_pointer
local.get 25
return)
(func indicates that there is a function within this bracket, followed by the function signature and local variables. The function content starts from the third line and continues until the last return.
wasm-instrument instrument fib.wasm -o fib_i.wasm
wasm2wat fib_i.wasm > fib_i.wat
The fib_i.wat content is as follows, showing that the instrumentation has been completed.
(func $fibonacci (type 3) (param i32) (result i32)
(local i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
i32.const 7
call $perf_start
global.get $__stack_pointer
...
global.set $__stack_pointer
local.get 25
call $perf_end
return)
Run and output.
# Run to generate the fib.folded file.
wavm run -o fib.folded fib.wasm 40
# Convert to the flame graph
$FG_DIR/flamegraph.pl fib.folded > fib.svg
The following figure displays the flame graph obtained:
You can see the calling procedure and overhead proportion of fib_i.wasm, achieving the purpose of performance profiling.
In general, it is feasible to perform performance profiling on WASM through instrumentation timing, but there are also some shortcomings:
Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
Alibaba Unveils Flagship AI Super Assistant Application Quark
1,115 posts | 342 followers
FollowAlibaba Cloud Community - March 17, 2025
Alibaba Cloud Native - August 14, 2024
feuyeux - May 8, 2021
Alibaba Cloud Native Community - November 18, 2024
Alibaba Cloud Native Community - March 6, 2023
OpenAnolis - December 24, 2024
1,115 posts | 342 followers
FollowHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreA HPCaaS cloud platform providing an all-in-one high-performance public computing service
Learn MoreConnect your on-premises render farm to the cloud with Alibaba Cloud Elastic High Performance Computing (E-HPC) power and continue business success in a post-pandemic world
Learn MoreBuild business monitoring capabilities with real time response based on frontend monitoring, application monitoring, and custom business monitoring capabilities
Learn MoreMore Posts by Alibaba Cloud Community
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Get Started for Free Get Started for Free