1. Getting Started
- 1.1. Install
- 1.2. Basic Usage
  - `--tolerance` Numerical Error tolerance
- 1.3. Basic Working Principle
2. User API Documentation
- 2.1. Macros
- 2.2. ComPPare `main()`
- 2.3. Google Benchmark Plugin
- 2.4. `DoNotOptimize()`
- Code Documentation

1. Getting Started

1.1. Install

1.1.1. Clone repository

git clone git@github.com:funglf/ComPPare.git --recursive

If submodules like google benchmark/ nvbench is not needed:

git clone git@github.com:funglf/ComPPare.git

1.1.2. (Optional) Build Google Benchmark

See Google Benchmark Instructions

1.1.3. Include ComPPare

In your C++ code, simply include the comppare header file by:

#include <comppare/comppare.hpp>

comppare.hpp

This file is the main include file for the ComPPare framework.

1.2. Basic Usage

There are a few rules in order to use comppare.

1.2.1. Adopt the required function signature

Function output must be void and consists of the input, then output types

void impl(const Inputs&... in, // read‑only inputs

Outputs&... out); // outputs compared to reference

1.2.2. Add `HOTLOOP` Macros

In order to benchmark specific regions of code, following Macros HOTLOOPSTART, HOTLOOPEND are needed. The region in between will be ran multiple times in order to get an accurate timing. The region first runs certain iterations of warmup before actually running the benchmark iterations.

How to use `HOTLOOPSTART`/`HOTLOOPEND` Macros

void impl(const Inputs&... in,
        Outputs&...      out);
{
    /* 
    setup or overhead you DO NOT want to benchmark 
    -- memory allocation, data transfer, etc.
    */
 
    HOTLOOPSTART; // Macro of start of Benchmarking Region of Interest
 
    // ... perform core computation here ...
 
    HOTLOOPEND;   // Macro of end of Benchmarking Region of Interest
}

Alternate Macro

Alternatively, the macro HOTLOOP() can be used to wrap the whole region.

void impl(const Inputs&... in,
        Outputs&...      out);
{
    HOTLOOP(
    // ... perform core computation here ...
    );
}

1.2.3. Setting up Comparison in `main()`

In main(), you can setup the comparison, such as defining the reference function, initializing input data, naming of benchmark etc.

‍### The SAXPY example will be used throughout this section to demonstrate on usage

>SAXPY stands for Single-Precision A·X Plus Y. (Also see examples/saxpy)

‍ It stands for the following operation:

>$$ >y_{\text{out}}[i] = a \cdot x[i] + y[i] >$$

Based on Section [Basic Usage], we can define a saxpy function as:

void saxpy_cpu(/*Input types*/
               float a,
               const std::vector<float> &x,
               const std::vector<float> &y_in,
               /*Output types*/
               std::vector<float> &y_out)
{
    HOTLOOPSTART;
    for (size_t i = 0; i < x.size(); ++i)
        y_out[i] = a * x[i] + y_in[i];
    HOTLOOPEND;
}

Step 1: Initialize Input data

The same input data will be used across different implementations. They are needed to initialize the comppare object.

/* Intialize Input data */
float a = 1.1f
std::vector x(1000, 1.0); 
std::vector y_in(1000, 2.0);

Step 2: Create a Comparison Object

Before testing functions like saxpy_cpu, you must define:

Output type(s) the types of the outputs that the function produces
Input variables the same inputs that are used across different implementations

For example, saxpy_cpu returns a vector of floats:

std::vector<float>

Therefore, the output type must be specified as the template parameter of make_comppare. The function call then takes the **input variables (a, x, y_in) as arguments** (initialized in step1):

auto cmp = comppare::make_comppare<std::vector<float>>(a, x, y_in);

Step 3: Register/Add Functions into framework

After creating the cmp object, we can add functions into it.

To Define the reference function:

cmp.set_reference(/*Displayed Name After Benchmark*/"saxpy reference", /*Function*/saxpy_cpu);

To Add more fucntions:

cmp.add("saxpy gpu", saxpy_gpu);

Step 4: Run the Benchmarks

Command line arguments are the parameters for run():

cmp.run(argc, argv);

Step 5: Results

Example Output:

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
============ ComPPare Framework ============
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
 
Number of implementations:             4
Warmup iterations:                   100
Benchmark iterations:                100
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
 
Implementation              ROI µs/Iter            Func µs            Ovhd µs         Max|err|[0]        Mean|err|[0]       Total|err|[0]
saxpy reference                     0.28               33.67                5.63            0.00e+00            0.00e+00            0.00e+00                                   
saxpy gpu                          10.89           137828.11           136739.02            5.75e+06            2.85e+06            2.92e+09            <-- FAIL         

In this case, saxpy_gpu failed.

1.2.4. Command Line Options – Iterations

There are 2 commmand line options to control the number of warmup and benchmark iterations:

`--warmups` Warmup Iterations

When used with a unsigned 64 bit integer, this sets the number of warmup iterations before running benchmark runs.

Example:

./saxpy --warmups 1000

`--iters` Benchmark Iterations

When used with a unsigned 64 bit integer, this sets the number of benchmark iterations

Example:

./saxpy --iters 1000

`--tolerance` Numerical Error tolerance

Floating point operations are never 100% accurate. Therefore tolerance is needed to be set.

For any floating point T, the tolerance is defaulted to be:

std::numeric_limits<T>::epsilon() * 1e3

For integral type, the tolerance is defaulted to be 0.

To define the tolerance, the --tolerance flag can be used to set both floating point and integral tolerance:

./saxpy --tolerance 1e-3 #FP tol = 1e-3; Int tol = 0;

./saxpy --tolerance 2 #FP tol = 2.0; Int tol = 2;

1.3. Basic Working Principle

1.3.1. `HOTLOOP` Macros

HOTLOOP Macros essentially wrap your region of interest in a lambda before running the lambda across the warmup and benchmark iterations. The region of iterest is being timed across all the benchmark iterations to obtain an average runtime.

In comppare/comppare.hpp, HOTLOOPSTART and HOTLOOPEND are defined as:

#define HOTLOOPSTART \
    auto &&hotloop_body = [&]() { /* start of lambda */
 
#define HOTLOOPEND     \
    }; /* end lambda */ \

Therefore, any code in between the 2 macros are simply wrapped into the lambda hotloop_body:

HOTLOOPSTART;
foo();
HOTLOOPEND;
 
/* is equivalent to */
 
auto &&hotloop_body = [&]() {
    foo();
};

After that, the lambda is being ran:

/* Warm-up */                                                      
for (i = 0; i < warmup_iterations; ++i) 
    hotloop_body();                                                
                
/* Benchmark */
auto start = now();                        
for (i = 0; i < benchmark_iterations; ++i)  
    hotloop_body();                                                
auto end = now();                        

1.3.2. Output Comparison

Each implementation follows the same signature:

void impl(const Inputs&... in, // read‑only inputs

Outputs&... out); // outputs compared to reference

Input – Passed in by the framework. All candidates get exactly the same data.

Output – The framework creates private output objects and passes into the implementation by reference. Therefore any change in the object will be stored in the framework.

After each implementation has finished running, the framework compares each output against the reference implementation, and prints out the results in terms of difference/error and whether it has failed.

2. User API Documentation

2.1. Macros

2.1.1 hotloop Macros

`HOTLOOPSTART` & `HOTLOOPEND`

Used to wrap around region of CPU functions/operations for framework to benchmark

example:

impl(...)
{
    HOTLOOPSTART;
    cpu_func();
    a+b;
    HOTLOOPEND;
}

`HOTLOOP()`

Alternative of HOTLOOPSTART/END

example:

impl(...)
{
    HOTLOOP(
    cpu_func();
    a+b;
    );
}

`GPU_HOTLOOPSTART` & `GPU_HOTLOOPEND`

Host Macro to wrap around region of GPU host functions/operations for framework to benchmark. Supports both CUDA and HIP, provided that the host function is compiled with the respected CUDA-compiler-wrapper nvcc and HIP-compiler-wrapper hipcc

‍Warning Do Not use this within GPU kernels.

example:

gpu_impl(...)
{
    GPU_HOTLOOPSTART;
    kernel<<<...>>>(...)
    cudaMemcpy(...);
    GPU_HOTLOOPEND;
}

`GPU_HOTLOOP()`

Alternative of GPU_HOTLOOPSTART/END

example:

gpu_impl(...)
{
    GPU_HOTLOOP(
    kernel<<<...>>>(...)
    cudaMemcpy(...);
    );
}

2.1.2 Manual timer Macros

`MANUAL_TIMER_START` & `MANUAL_TIMER_END`

Used to wrap around region of CPU functions/operations within HOTLOOP

‍These set of macros can ONLY be used once within the Hotloop

Example – Only times a+b:

impl(...)
{
    HOTLOOPSTART;
    cpu_func();
    MANUAL_TIMER_START
    a+b;
    MANUAL_TIMER_END
    HOTLOOPEND;
}

`GPU_MANUAL_TIMER_START` & `GPU_MANUAL_TIMER_END`

Used to wrap around region of GPU host functions/operations within HOTLOOP. Supports both CUDA and HIP, provided that the host function is compiled with the respected CUDA-compiler-wrapper nvcc and HIP-compiler-wrapper hipcc

‍These set of macros can ONLY be used once within the Hotloop

Example – Only times kernel:

gpu_impl(...)
{
    GPU_HOTLOOPSTART;
    GPU_MANUAL_TIMER_START;
    kernel<<<...>>>(...)
    GPU_MANUAL_TIMER_END;
    cudaMemcpy(...);
    GPU_HOTLOOPEND;
}

2.1.3 Custom iteration timer Macro

`SET_ITERATION_TIME(us)`

This Macro takes in a **floating point number representing time of current iteration in $\mu s$**

Example on mixed timers:

mixed_impl(...)
{
    cudaEvent_t gpu_start, gpu_end;
    cudaEventCreate(&gpu_start);              \
    cudaEventCreate(&gpu_end);
 
    HOTLOOPSTART;
    /* Timing CPU region of cpu_func() only*/
    auto cpu_start = chrono::steady_clock::now();
    cpu_func();
    auto cpu_end = chrono::steady_clock::now();
    a+b
    /* Microseconds taken by cpu_func() */
    float cpu_us = std::chrono::duration<float, std::micro>(end - start).count();
 
    /* Timing GPU region of kernel only */
    cudaEventRecord(gpu_start);
    kernel<<<...>>>(...)
    cudaEventRecord(gpu_end);
    cudaMemcpy(...);
 
    /* Microseconds taken by gpu kernel */
    float gpu_ms;
    cudaEventElapsedTime(&ms_manual, gpu_start, gpu_end);
    float gpu_us = gpu_ms * 1e3;
 
    /* set the total iteration time in us */
    float total_iteration_us = cpu_us + gpu_us;
    SET_ITERATION_TIME(total_iteration_us);
    HOTLOOPEND;
}

2.2. ComPPare `main()`

2.2.1. Creating the ComPPare object

Given your implementation signature:

void impl(const Inputs&... in,

Outputs&... out);

Instantiate the comparison context by defining output types and forwarding the input arguments:

InputType1 input1{...}; // initialize input1
InputType2 input2{...}; // initialize input2
auto cmp = comppare::make_comppare<OutputType1, OutputType2, ...>(input1, input2, ...);

OutputType1, OutputType2, ...: types of output
input1, input2, ... : variables/values of input

The order of inputs... and OutputTypes... must match the order in the impl signature. After construction, cmp is ready to have implementations registered (via set_reference / add) and executed (via run).

2.2.2. Setting Implementations for Framework

`set_reference`

Registers the “reference” implementation and returns its corresponding Impl descriptor for further configuration – eg attaching to Plugins like Google Benchmark.

Context

Member of

comppare::InputContext<Inputs...>::OutputContext<Outputs...>

comppare::InputContext

InputContext class template to hold input parameters for the comparison framework.

Definition comppare.hpp:283

Signature

comppare::Impl& set_reference(
    std::string display_name,
    std::function<void(const Inputs&... , Outputs&...)> f
);

Parameters

Name	Type	Description
`display_name`	`std::string`	Human-readable label for this reference implementation. Used in output report.
`f`	`std::function<void(const Inputs&... , Outputs&...)>`	Function matching the signature: `void(const Inputs&... in, Outputs&... out)`

Returns

**comppare::Impl&**

Reference to the internal Impl object representing the just-registered implementation. Used mainly for attaching to plugins like Google Benchmark.

Recommended to discard return value.

Example

void ref_impl(const InputTypes&..., OutputTypes&...){};
auto cmp = comppare::make_comppare<OutputTypes...>(inputs...);
 
cmp.set_reference(/*display name*/"reference implementation", /*function*/ ref_impl);

`add`

Registers additional implementation which will be compared against reference and returns its corresponding Impl descriptor for further configuration – eg attaching to Plugins like Google Benchmark.

Context

Member of

comppare::InputContext<Inputs...>::OutputContext<Outputs...>

Signature

comppare::Impl& add(
    std::string display_name,
    std::function<void(const Inputs&... , Outputs&...)> f
);

Parameters

Name	Type	Description
`display_name`	`std::string`	Human-readable label for this reference implementation. Used in output report.
`f`	`std::function<void(const Inputs&... , Outputs&...)>`	Function matching the signature: `void(const Inputs&... in, Outputs&... out)`

Returns

**comppare::Impl&**

Reference to the internal Impl object representing the just-registered implementation. Used mainly for attaching to plugins like Google Benchmark.

Recommended to discard return value.

Example

void ref_impl(const InputTypes&..., OutputTypes&...){};
auto cmp = comppare::make_comppare<OutputTypes...>(inputs...);
 
cmp.set_reference(/*display name*/"reference implementation", /*function*/ ref_impl);
cmp.add(/*display name*/"Optimized memcpy", /*function*/ fast_memcpy_impl);

2.2.3. Running Framework

`run`

Runs all the added implementations into the comppare framework.

Context

Member of

comppare::InputContext<Inputs...>::OutputContext<Outputs...>

Signature

void run(int argc, char** argv);

Parameters

Name	Type	Description
`argc`	`int`	Number of Command Line Arguments
`argv`	`char**`	Command Line Argument Vector

Example

int main(int argc, char **argv)
{
    /* Create cmp object */
    cmp.run(argc, argv);
}

2.2.4. Summary of `main()`

void reference_impl(const InputTypes&... in,
                    OutputTypes&...      out);
 
void optimized_impl(const InputTypes&... in,
                    OutputTypes&...      out);
 
int main(int argc, char** argv)
{
    auto cmp = comppare::make_comppare<OutputTypes...>(inputs...);
    
    cmp.set_reference("Reference", reference_impl);
 
    cmp.add("Optimized", optimized_impl);
 
    cmp.run(argc, argv);
}

For more concrete example, please see examples.

2.3. Google Benchmark Plugin

`google_benchmark()`

Attaches the Google Benchmark plugin when calling `set_reference` or `add`, enabling google benchmark to the current implementation.

Context

Member of an internal struct

comppare::Impl

Signature

benchmark::internal::Benchmark* google_benchmark();

Returns

**benchmark::internal::Benchmark*** Pointer to the underlying Google Benchmark Benchmark instance. Use this to chain additional benchmark configuration calls (e.g. ->Arg(), ->Threads(), ->Unit(), etc.).

Detailed Description

After registering an implementation via set_reference or add, both functions return a reference to an internal struct comppare::Impl (see here). google_benchmark() attaches the plugin to the current implementation. The returned pointer allows you to further customize the benchmark before execution.

Example

cmp.set_reference("Reference", reference_impl)
   .google_benchmark();
 
cmp.add("Optimized", optimized_impl)
   .google_benchmark();
 
cmp.run(argc, argv);

Manual Timing with Google Benchmark

Enable manual timing for any registered implementation by appending ->UseManualTime() to the Benchmark* returned from google_benchmark(). This instructs Google Benchmark to measure only the intervals you explicitly mark inside your implementation, with Manual Timer Macros or SetIterationTime() macro.

UseManualTime() is a Google Benchmark API call that switches the benchmark into Manual Timing mode. (See Google Benchmark's Documentation)

impl_manualtimer_macro(...)
{
    HOTLOOPSTART;
    ...
    MANUAL_TIMER_START;
    ...
    MANUAL_TIMER_END;
    ...
    HOTLOOPEND;
}
 
impl_setitertime_macro(...)
{
    HOTLOOPSTART;
    ...
    double elapsed_us;
    SetIterationTime(elapsed_us)
    ...
    HOTLOOPEND;
}
 
int main()
{
    cmp.set_reference("Manual Timer Macro", impl_manualtimer_macro)
    .google_benchmark() 
        ->UseManualTime();
 
    cmp.add("SetIterationTime Macro", impl_setitertime_macro)
    .google_benchmark() 
        ->UseManualTime();
}

2.4. `DoNotOptimize()`

For a deep dive into the working principle of DoNotOptimize() please visit examples/advanced_demo/DoNotOptimize

Disclaimer

I, LF Fung, am not the author of DoNotOptimize(). The implementation of comppare::DoNotOptimize() is a verbatim of Google Benchmark's benchmark::DoNotOptimize().

References:

2.4.1. Problem with Compiler Optimsation

Compiler optimization can sometimes remove operations and variables completely.

For instance in the following function:

void SAXPY(const float a, const float* x, const float* y)
{
    float yout;
    for (int i = 0; i < N; ++i)
    {
        yout = a * x[i] + y[i];
    }
}

When compiling at high optimization, the compiler realizes yout is just a temporary that’s never used elsewhere. As a result, yout is optimized out, and thus the whole saxpy operation would be optimized out.

SAXPY() in Assembly

When SAXPY() is compiled in AArch64 with Optimisation -O3

__Z5SAXPYfPKfS0_:                       ; @_Z5SAXPYfPKfS0_
    .cfi_startproc
; %bb.0:
    ret
    .cfi_endproc
                                        ; -- End function

The function body is practically empty: a single ret which is “return from subroutine” Arm A64 Instruction Set: RET. In simple terms, it just returns — nothing happens.

2.4.2. Solution – Google Benchmark's DoNotOptimize()

Optimization is important to understand the performance of particular operations in production builds. This creates the conflicting ideas of optimize but do not optimize away. This was solved by Google in their benchmark – a microbenchmarking library. Google Benchmark provides benchmark::DoNotOptimize() to prevent variables from being optimized away.

With the same SAXPY function, we simply add DoNotOptimize() around the temporary variable yout

void SAXPY_DONOTOPTIMIZE(const float a, const float* x, const float* y)
{
    float yout;
    for (int i = 0; i < N; ++i)
    {
        yout = a * x[i] + y[i];
        DoNotOptimize(yout);
    }
}

This DoNotOptimize call tells the compiler not to eliminate the temporary variable, so the operation itself won’t be optimized away.

SAXPY_DONOTOPTIMIZE() in Assembly

When SAXPY_DONOTOPTIMIZE() is compiled in AArch64 with Optimisation -O3:





5 __Z19SAXPY_DONOTOPTIMIZEfPKfS0_:        ; @_Z19SAXPY_DONOTOPTIMIZEfPKfS0_
6       .cfi_startproc
7 ; bb.0:
8       sub     sp, sp, #16
9       .cfi_def_cfa_offset 16
10      mov     w8, #16960
11      movk    w8, #15, lsl #16
12      add     x9, sp, #4
13      add     x10, sp, #8
14 LBB0_1:                                 ; =>This Inner Loop Header: Depth=1
15      ldr     s1, [x0], #4
16      ldr     s2, [x1], #4
17      fmadd   s1, s0, s1, s2
18      str     s1, [sp, #4]
19      str     x9, [sp, #8]
20      ; InlineAsm Start
21      ; InlineAsm End
22      subs    x8, x8, #1
23      b.ne    LBB0_1
24 ; bb.2:
25      add     sp, sp, #16
26      ret
27      .cfi_endproc
28                                         ; -- End function

Further inspection reveals the Fused-Multiply-Add instruction – indicating that SAXPY operation was not optimized away.


17      fmadd   s0, s0, s1, s2

Reference to Arm A64 Instruction Set: FMADD

2.4.3. `comppare::DoNotOptimize()`

Provided the usefulness of Google Benchmark's benchmark::DoNotOptimize(), comppare includes a verbatim of benchmark::DoNotOptimize().

Example:

impl(...)
{
    ...
    comppare::DoNotOptimize(temporary_variable);
    ...
}

Code Documentation

To Generate code documentation, use doxygen:

cd docs/

doxygen

it should create a directory docs/html

Find docs/html/index.html and you will be able to view the documentation in your own web browser.

1. Getting Started

1.1. Install

1.1.1. Clone repository

If submodules like google benchmark/ nvbench is not needed:

1.1.2. (Optional) Build Google Benchmark

1.1.3. Include ComPPare

1.2. Basic Usage

1.2.1. Adopt the required function signature

1.2.2. Add HOTLOOP Macros

How to use HOTLOOPSTART/HOTLOOPEND Macros

Alternate Macro

1.2.3. Setting up Comparison in main()

Step 1: Initialize Input data

Step 2: Create a Comparison Object

Step 3: Register/Add Functions into framework

Step 4: Run the Benchmarks

Step 5: Results

1.2.4. Command Line Options – Iterations

--warmups Warmup Iterations

--iters Benchmark Iterations

--tolerance Numerical Error tolerance

1.3. Basic Working Principle

1.3.1. HOTLOOP Macros

1.3.2. Output Comparison

2. User API Documentation

2.1. Macros

2.1.1 hotloop Macros

HOTLOOPSTART & HOTLOOPEND

HOTLOOP()

GPU_HOTLOOPSTART & GPU_HOTLOOPEND

GPU_HOTLOOP()

2.1.2 Manual timer Macros

MANUAL_TIMER_START & MANUAL_TIMER_END

GPU_MANUAL_TIMER_START & GPU_MANUAL_TIMER_END

2.1.3 Custom iteration timer Macro

SET_ITERATION_TIME(us)

2.2. ComPPare main()

2.2.1. Creating the ComPPare object

2.2.2. Setting Implementations for Framework

set_reference

Context

Signature

Parameters

Returns

Example

add

Context

Signature

Parameters

Returns

Example

2.2.3. Running Framework

run

Context

Signature

Parameters

Example

2.2.4. Summary of main()

2.3. Google Benchmark Plugin

google_benchmark()

Context

Signature

Returns

Detailed Description

Example

Manual Timing with Google Benchmark

2.4. DoNotOptimize()

Disclaimer

References:

2.4.1. Problem with Compiler Optimsation

SAXPY() in Assembly

2.4.2. Solution – Google Benchmark's DoNotOptimize()

SAXPY_DONOTOPTIMIZE() in Assembly

2.4.3. comppare::DoNotOptimize()

Code Documentation

1.2.2. Add `HOTLOOP` Macros

How to use `HOTLOOPSTART`/`HOTLOOPEND` Macros

1.2.3. Setting up Comparison in `main()`

`--warmups` Warmup Iterations

`--iters` Benchmark Iterations

`--tolerance` Numerical Error tolerance

1.3.1. `HOTLOOP` Macros

`HOTLOOPSTART` & `HOTLOOPEND`

`HOTLOOP()`

`GPU_HOTLOOPSTART` & `GPU_HOTLOOPEND`

`GPU_HOTLOOP()`

`MANUAL_TIMER_START` & `MANUAL_TIMER_END`

`GPU_MANUAL_TIMER_START` & `GPU_MANUAL_TIMER_END`

`SET_ITERATION_TIME(us)`

2.2. ComPPare `main()`

`set_reference`

`add`

`run`

2.2.4. Summary of `main()`

`google_benchmark()`

2.4. `DoNotOptimize()`

2.4.3. `comppare::DoNotOptimize()`