The Art of Profiling and Benchmarking: Making Software Faster on AArch64

 As the SPO600 course end nears, I have learned a lot about profiling and benchmarking, important skills for making software faster. This blog post will talk about these topics and what I learned in the course.

What is Profiling and Benchmarking

Profiling is looking at a program to see which parts use the most resources like CPU time and memory. Profiling helps to find slow parts that need fixing.

Benchmarking is running tests to measure how fast a program is. This helps compare different versions of the software or different ways of doing things.

Tools for Profiling and Benchmarking

In the course, we use some tools to profile and benchmark our code. These include:

  1. perf: A powerful tool for profiling on Linux. It gives details about CPU use, cache hits, and more.
  2. gprof: A GNU tool that shows time spent in each function.
  3. Valgrind: Mostly for finding memory leaks, but also has a profiler called callgrind for looking at function calls.

Steps in Profiling and Benchmarking

1. Find the Target Code

First step is to find the part of the code that needs optimization. This could be a function or module that is slow.

2. Collect Profiling Data

Use tools like perf or gprof to collect data on the program's running. For example, running perf:

sh
perf record -g ./my_program perf report

This makes a report showing which functions use most CPU time.

3. Analyze the Data

Look at the data to find slow parts. Functions using a lot of time are good targets for optimization.

4. Make Optimizations

Optimizations can be:

  • Better algorithms: Using more efficient algorithms.
  • Refactor code: Making code simpler and faster.
  • Use hardware features: Using SIMD instructions and other special features.

5. Benchmark the Optimized Code

After optimizing, benchmark the new code to see if it’s faster. Run the same tests to compare.

Example: Optimizing a Sorting Algorithm on AArch64

We worked on making a sorting algorithm faster. Here is the process:

Initial Profiling

Using perf, we found that the compare function in quicksort was slow, using over 40% of CPU time.

Optimization

We made the sorting algorithm faster by:

  • Changing to introsort: It switches to heapsort when recursion is too deep.
  • Using SIMD instructions: Writing a custom compare function with ARM’s NEON intrinsics.

Benchmarking

After optimization, we benchmarked the new code. It was 30% faster than the original, showing our optimizations worked.

Comments

Popular posts from this blog

Exploring Retro Arcade Days - Simple Yet Challenging Breakout

Exploring Assembly Language (Lab-1)

Lab-3