ComPPare 1.0.0
Loading...
Searching...
No Matches
User Guide ComPPare – Validation & Benchmarking Framework <!-- omit from toc -->
  • 1. Getting Started
    • 1.1. Install
    • 1.2. Basic Usage
      • `--tolerance` Numerical Error tolerance
    • 1.3. Basic Working Principle
  • 2. User API Documentation
    • 2.1. Macros
    • 2.2. ComPPare `main()`
    • 2.3. Google Benchmark Plugin
    • 2.4. `DoNotOptimize()`
    • Code Documentation

1. Getting Started

1.1. Install

1.1.1. Clone repository

git clone git@github.com:funglf/ComPPare.git --recursive

If submodules like google benchmark/ nvbench is not needed:

git clone git@github.com:funglf/ComPPare.git

1.1.2. (Optional) Build Google Benchmark

See Google Benchmark Instructions

1.1.3. Include ComPPare

In your C++ code, simply include the comppare header file by:

This file is the main include file for the ComPPare framework.

1.2. Basic Usage

There are a few rules in order to use comppare.

1.2.1. Adopt the required function signature

Function output must be void and consists of the input, then output types

void impl(const Inputs&... in, // read‑only inputs
Outputs&... out); // outputs compared to reference

1.2.2. Add HOTLOOP Macros

In order to benchmark specific regions of code, following Macros HOTLOOPSTART, HOTLOOPEND are needed. The region in between will be ran multiple times in order to get an accurate timing. The region first runs certain iterations of warmup before actually running the benchmark iterations.

How to use HOTLOOPSTART/HOTLOOPEND Macros

void impl(const Inputs&... in,
Outputs&... out);
{
/*
setup or overhead you DO NOT want to benchmark
-- memory allocation, data transfer, etc.
*/
HOTLOOPSTART; // Macro of start of Benchmarking Region of Interest
// ... perform core computation here ...
HOTLOOPEND; // Macro of end of Benchmarking Region of Interest
}
#define HOTLOOPSTART
Macro to mark the start of a hot loop for benchmarking. This macro defines a lambda function hotloop_...
Definition comppare.hpp:964
#define HOTLOOPEND
Definition comppare.hpp:1003

Alternate Macro

Alternatively, the macro HOTLOOP() can be used to wrap the whole region.

void impl(const Inputs&... in,
Outputs&... out);
{
// ... perform core computation here ...
);
}
#define HOTLOOP(LOOP_BODY)
Macro to wrap a code block for benchmarking.
Definition comppare.hpp:1013

1.2.3. Setting up Comparison in main()

In main(), you can setup the comparison, such as defining the reference function, initializing input data, naming of benchmark etc.

‍### The SAXPY example will be used throughout this section to demonstrate on usage

>SAXPY stands for Single-Precision A·X Plus Y. (Also see examples/saxpy)

‍ It stands for the following operation:

>$$ >y_{\text{out}}[i] = a \cdot x[i] + y[i] >$$

Based on Section [Basic Usage], we can define a saxpy function as:

void saxpy_cpu(/*Input types*/
float a,
const std::vector<float> &x,
const std::vector<float> &y_in,
/*Output types*/
std::vector<float> &y_out)
{
for (size_t i = 0; i < x.size(); ++i)
y_out[i] = a * x[i] + y_in[i];
}

Step 1: Initialize Input data

The same input data will be used across different implementations. They are needed to initialize the comppare object.

/* Intialize Input data */
float a = 1.1f
std::vector x(1000, 1.0);
std::vector y_in(1000, 2.0);

Step 2: Create a Comparison Object

Before testing functions like saxpy_cpu, you must define:

  1. Output type(s) the types of the outputs that the function produces
  2. Input variables the same inputs that are used across different implementations

For example, saxpy_cpu returns a vector of floats:

std::vector<float>

Therefore, the output type must be specified as the template parameter of make_comppare. The function call then takes the **input variables (a, x, y_in) as arguments** (initialized in step1):

auto cmp = comppare::make_comppare<std::vector<float>>(a, x, y_in);

Step 3: Register/Add Functions into framework

After creating the cmp object, we can add functions into it.

To Define the reference function:

cmp.set_reference(/*Displayed Name After Benchmark*/"saxpy reference", /*Function*/saxpy_cpu);

To Add more fucntions:

cmp.add("saxpy gpu", saxpy_gpu);

Step 4: Run the Benchmarks

Command line arguments are the parameters for run():

cmp.run(argc, argv);

Step 5: Results

Example Output:

*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
============ ComPPare Framework ============
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
Number of implementations: 4
Warmup iterations: 100
Benchmark iterations: 100
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*
Implementation ROI µs/Iter Func µs Ovhd µs Max|err|[0] Mean|err|[0] Total|err|[0]
saxpy reference 0.28 33.67 5.63 0.00e+00 0.00e+00 0.00e+00
saxpy gpu 10.89 137828.11 136739.02 5.75e+06 2.85e+06 2.92e+09 <-- FAIL

In this case, saxpy_gpu failed.

1.2.4. Command Line Options – Iterations

There are 2 commmand line options to control the number of warmup and benchmark iterations:

--warmups Warmup Iterations

When used with a unsigned 64 bit integer, this sets the number of warmup iterations before running benchmark runs.

Example:

./saxpy --warmups 1000

--iters Benchmark Iterations

When used with a unsigned 64 bit integer, this sets the number of benchmark iterations

Example:

./saxpy --iters 1000

--tolerance Numerical Error tolerance

Floating point operations are never 100% accurate. Therefore tolerance is needed to be set.

For any floating point T, the tolerance is defaulted to be:

std::numeric_limits<T>::epsilon() * 1e3

For integral type, the tolerance is defaulted to be 0.

To define the tolerance, the --tolerance flag can be used to set both floating point and integral tolerance:

./saxpy --tolerance 1e-3 #FP tol = 1e-3; Int tol = 0;
./saxpy --tolerance 2 #FP tol = 2.0; Int tol = 2;

1.3. Basic Working Principle

1.3.1. HOTLOOP Macros

HOTLOOP Macros essentially wrap your region of interest in a lambda before running the lambda across the warmup and benchmark iterations. The region of iterest is being timed across all the benchmark iterations to obtain an average runtime.

In comppare/comppare.hpp, HOTLOOPSTART and HOTLOOPEND are defined as:

#define HOTLOOPSTART \
auto &&hotloop_body = [&]() { /* start of lambda */
#define HOTLOOPEND \
}; /* end lambda */ \

Therefore, any code in between the 2 macros are simply wrapped into the lambda hotloop_body:

foo();
/* is equivalent to */
auto &&hotloop_body = [&]() {
foo();
};

After that, the lambda is being ran:

/* Warm-up */
for (i = 0; i < warmup_iterations; ++i)
hotloop_body();
/* Benchmark */
auto start = now();
for (i = 0; i < benchmark_iterations; ++i)
hotloop_body();
auto end = now();

1.3.2. Output Comparison

Each implementation follows the same signature:

void impl(const Inputs&... in, // read‑only inputs
Outputs&... out); // outputs compared to reference

Input – Passed in by the framework. All candidates get exactly the same data.

Output – The framework creates private output objects and passes into the implementation by reference. Therefore any change in the object will be stored in the framework.

After each implementation has finished running, the framework compares each output against the reference implementation, and prints out the results in terms of difference/error and whether it has failed.

2. User API Documentation

2.1. Macros

2.1.1 hotloop Macros

HOTLOOPSTART & HOTLOOPEND

Used to wrap around region of CPU functions/operations for framework to benchmark

example:

impl(...)
{
cpu_func();
a+b;
}

HOTLOOP()

Alternative of HOTLOOPSTART/END

example:

impl(...)
{
cpu_func();
a+b;
);
}

GPU_HOTLOOPSTART & GPU_HOTLOOPEND

Host Macro to wrap around region of GPU host functions/operations for framework to benchmark. Supports both CUDA and HIP, provided that the host function is compiled with the respected CUDA-compiler-wrapper nvcc and HIP-compiler-wrapper hipcc

Warning Do Not use this within GPU kernels.

example:

gpu_impl(...)
{
GPU_HOTLOOPSTART;
kernel<<<...>>>(...)
cudaMemcpy(...);
GPU_HOTLOOPEND;
}

GPU_HOTLOOP()

Alternative of GPU_HOTLOOPSTART/END

example:

gpu_impl(...)
{
GPU_HOTLOOP(
kernel<<<...>>>(...)
cudaMemcpy(...);
);
}

2.1.2 Manual timer Macros

MANUAL_TIMER_START & MANUAL_TIMER_END

Used to wrap around region of CPU functions/operations within HOTLOOP

‍These set of macros can ONLY be used once within the Hotloop

Example – Only times a+b:

impl(...)
{
cpu_func();
a+b;
}
#define MANUAL_TIMER_START
Macro to mark the start of a manual timer for benchmarking.
Definition comppare.hpp:1019
#define MANUAL_TIMER_END
Macro to mark the end of a manual timer for benchmarking.
Definition comppare.hpp:1025

GPU_MANUAL_TIMER_START & GPU_MANUAL_TIMER_END

Used to wrap around region of GPU host functions/operations within HOTLOOP. Supports both CUDA and HIP, provided that the host function is compiled with the respected CUDA-compiler-wrapper nvcc and HIP-compiler-wrapper hipcc

‍These set of macros can ONLY be used once within the Hotloop

Example – Only times kernel:

gpu_impl(...)
{
GPU_HOTLOOPSTART;
GPU_MANUAL_TIMER_START;
kernel<<<...>>>(...)
GPU_MANUAL_TIMER_END;
cudaMemcpy(...);
GPU_HOTLOOPEND;
}

2.1.3 Custom iteration timer Macro

SET_ITERATION_TIME(us)

This Macro takes in a **floating point number representing time of current iteration in $\mu s$**

Example on mixed timers:

mixed_impl(...)
{
cudaEvent_t gpu_start, gpu_end;
cudaEventCreate(&gpu_start); \
cudaEventCreate(&gpu_end);
/* Timing CPU region of cpu_func() only*/
auto cpu_start = chrono::steady_clock::now();
cpu_func();
auto cpu_end = chrono::steady_clock::now();
a+b
/* Microseconds taken by cpu_func() */
float cpu_us = std::chrono::duration<float, std::micro>(end - start).count();
/* Timing GPU region of kernel only */
cudaEventRecord(gpu_start);
kernel<<<...>>>(...)
cudaEventRecord(gpu_end);
cudaMemcpy(...);
/* Microseconds taken by gpu kernel */
float gpu_ms;
cudaEventElapsedTime(&ms_manual, gpu_start, gpu_end);
float gpu_us = gpu_ms * 1e3;
/* set the total iteration time in us */
float total_iteration_us = cpu_us + gpu_us;
SET_ITERATION_TIME(total_iteration_us);
}
#define SET_ITERATION_TIME(TIME)
Definition comppare.hpp:1040

2.2. ComPPare main()

2.2.1. Creating the ComPPare object

Given your implementation signature:

void impl(const Inputs&... in,
Outputs&... out);

Instantiate the comparison context by defining output types and forwarding the input arguments:

InputType1 input1{...}; // initialize input1
InputType2 input2{...}; // initialize input2
auto cmp = comppare::make_comppare<OutputType1, OutputType2, ...>(input1, input2, ...);
auto make_comppare(Inputs &&...ins)
Helper function to create a comppare object.
Definition comppare.hpp:953
  • OutputType1, OutputType2, ...: types of output
  • input1, input2, ... : variables/values of input

The order of inputs... and OutputTypes... must match the order in the impl signature. After construction, cmp is ready to have implementations registered (via set_reference / add) and executed (via run).

2.2.2. Setting Implementations for Framework

set_reference

Registers the “reference” implementation and returns its corresponding Impl descriptor for further configuration – eg attaching to Plugins like Google Benchmark.

Context

Member of

comppare::InputContext<Inputs...>::OutputContext<Outputs...>
InputContext class template to hold input parameters for the comparison framework.
Definition comppare.hpp:283
Signature
comppare::Impl& set_reference(
std::string display_name,
std::function<void(const Inputs&... , Outputs&...)> f
);
Parameters
Name Type Description
display_name std::string Human-readable label for this reference implementation. Used in output report.
f std::function<void(const Inputs&... , Outputs&...)> Function matching the signature: void(const Inputs&... in, Outputs&... out)
Returns
  • **comppare::Impl&**

    Reference to the internal Impl object representing the just-registered implementation. Used mainly for attaching to plugins like Google Benchmark.

    Recommended to discard return value.

Example
void ref_impl(const InputTypes&..., OutputTypes&...){};
auto cmp = comppare::make_comppare<OutputTypes...>(inputs...);
cmp.set_reference(/*display name*/"reference implementation", /*function*/ ref_impl);

add

Registers additional implementation which will be compared against reference and returns its corresponding Impl descriptor for further configuration – eg attaching to Plugins like Google Benchmark.

Context

Member of

comppare::InputContext<Inputs...>::OutputContext<Outputs...>
Signature
comppare::Impl& add(
std::string display_name,
std::function<void(const Inputs&... , Outputs&...)> f
);
Parameters
Name Type Description
display_name std::string Human-readable label for this reference implementation. Used in output report.
f std::function<void(const Inputs&... , Outputs&...)> Function matching the signature: void(const Inputs&... in, Outputs&... out)

Returns
  • **comppare::Impl&**

    Reference to the internal Impl object representing the just-registered implementation. Used mainly for attaching to plugins like Google Benchmark.

    Recommended to discard return value.

Example
void ref_impl(const InputTypes&..., OutputTypes&...){};
auto cmp = comppare::make_comppare<OutputTypes...>(inputs...);
cmp.set_reference(/*display name*/"reference implementation", /*function*/ ref_impl);
cmp.add(/*display name*/"Optimized memcpy", /*function*/ fast_memcpy_impl);

2.2.3. Running Framework

run

Runs all the added implementations into the comppare framework.

Context

Member of

comppare::InputContext<Inputs...>::OutputContext<Outputs...>
Signature
void run(int argc, char** argv);
Parameters
Name Type Description
argc int Number of Command Line Arguments
argv char** Command Line Argument Vector
Example
int main(int argc, char **argv)
{
/* Create cmp object */
cmp.run(argc, argv);
}

2.2.4. Summary of main()

void reference_impl(const InputTypes&... in,
OutputTypes&... out);
void optimized_impl(const InputTypes&... in,
OutputTypes&... out);
int main(int argc, char** argv)
{
auto cmp = comppare::make_comppare<OutputTypes...>(inputs...);
cmp.set_reference("Reference", reference_impl);
cmp.add("Optimized", optimized_impl);
cmp.run(argc, argv);
}

For more concrete example, please see examples.

2.3. Google Benchmark Plugin

google_benchmark()

Attaches the Google Benchmark plugin when calling `set_reference` or `add`, enabling google benchmark to the current implementation.


Context

Member of an internal struct

comppare::Impl
Signature
benchmark::internal::Benchmark* google_benchmark();
Returns
  • **benchmark::internal::Benchmark*** Pointer to the underlying Google Benchmark Benchmark instance. Use this to chain additional benchmark configuration calls (e.g. ->Arg(), ->Threads(), ->Unit(), etc.).
Detailed Description

After registering an implementation via set_reference or add, both functions return a reference to an internal struct comppare::Impl (see here). google_benchmark() attaches the plugin to the current implementation. The returned pointer allows you to further customize the benchmark before execution.

Example
cmp.set_reference("Reference", reference_impl)
.google_benchmark();
cmp.add("Optimized", optimized_impl)
.google_benchmark();
cmp.run(argc, argv);

Manual Timing with Google Benchmark

Enable manual timing for any registered implementation by appending ->UseManualTime() to the Benchmark* returned from google_benchmark(). This instructs Google Benchmark to measure only the intervals you explicitly mark inside your implementation, with Manual Timer Macros or SetIterationTime() macro.

UseManualTime() is a Google Benchmark API call that switches the benchmark into Manual Timing mode. (See Google Benchmark's Documentation)

impl_manualtimer_macro(...)
{
...
...
...
}
impl_setitertime_macro(...)
{
...
double elapsed_us;
SetIterationTime(elapsed_us)
...
}
int main()
{
cmp.set_reference("Manual Timer Macro", impl_manualtimer_macro)
.google_benchmark()
->UseManualTime();
cmp.add("SetIterationTime Macro", impl_setitertime_macro)
.google_benchmark()
->UseManualTime();
}

2.4. DoNotOptimize()

For a deep dive into the working principle of DoNotOptimize() please visit examples/advanced_demo/DoNotOptimize

Disclaimer

I, LF Fung, am not the author of DoNotOptimize(). The implementation of comppare::DoNotOptimize() is a verbatim of Google Benchmark's benchmark::DoNotOptimize().

References:

  1. CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"
  2. Google Benchmark Github Repository

2.4.1. Problem with Compiler Optimsation

Compiler optimization can sometimes remove operations and variables completely.

For instance in the following function:

void SAXPY(const float a, const float* x, const float* y)
{
float yout;
for (int i = 0; i < N; ++i)
{
yout = a * x[i] + y[i];
}
}

When compiling at high optimization, the compiler realizes yout is just a temporary that’s never used elsewhere. As a result, yout is optimized out, and thus the whole saxpy operation would be optimized out.

SAXPY() in Assembly

When SAXPY() is compiled in AArch64 with Optimisation -O3

__Z5SAXPYfPKfS0_: ; @_Z5SAXPYfPKfS0_
.cfi_startproc
; %bb.0:
ret
.cfi_endproc
; -- End function

The function body is practically empty: a single ret which is “return from subroutine” Arm A64 Instruction Set: RET. In simple terms, it just returns — nothing happens.

2.4.2. Solution – Google Benchmark's DoNotOptimize()

Optimization is important to understand the performance of particular operations in production builds. This creates the conflicting ideas of optimize but do not optimize away. This was solved by Google in their benchmark – a microbenchmarking library. Google Benchmark provides benchmark::DoNotOptimize() to prevent variables from being optimized away.

With the same SAXPY function, we simply add DoNotOptimize() around the temporary variable yout

void SAXPY_DONOTOPTIMIZE(const float a, const float* x, const float* y)
{
float yout;
for (int i = 0; i < N; ++i)
{
yout = a * x[i] + y[i];
DoNotOptimize(yout);
}
}

This DoNotOptimize call tells the compiler not to eliminate the temporary variable, so the operation itself won’t be optimized away.

SAXPY_DONOTOPTIMIZE() in Assembly

When SAXPY_DONOTOPTIMIZE() is compiled in AArch64 with Optimisation -O3:





5 __Z19SAXPY_DONOTOPTIMIZEfPKfS0_:        ; @_Z19SAXPY_DONOTOPTIMIZEfPKfS0_
6       .cfi_startproc
7 ; bb.0:
8       sub     sp, sp, #16
9       .cfi_def_cfa_offset 16
10      mov     w8, #16960
11      movk    w8, #15, lsl #16
12      add     x9, sp, #4
13      add     x10, sp, #8
14 LBB0_1:                                 ; =>This Inner Loop Header: Depth=1
15      ldr     s1, [x0], #4
16      ldr     s2, [x1], #4
17      fmadd   s1, s0, s1, s2
18      str     s1, [sp, #4]
19      str     x9, [sp, #8]
20      ; InlineAsm Start
21      ; InlineAsm End
22      subs    x8, x8, #1
23      b.ne    LBB0_1
24 ; bb.2:
25      add     sp, sp, #16
26      ret
27      .cfi_endproc
28                                         ; -- End function


Further inspection reveals the Fused-Multiply-Add instruction – indicating that SAXPY operation was not optimized away.


17      fmadd   s0, s0, s1, s2

Reference to Arm A64 Instruction Set: FMADD

2.4.3. comppare::DoNotOptimize()

Provided the usefulness of Google Benchmark's benchmark::DoNotOptimize(), comppare includes a verbatim of benchmark::DoNotOptimize().

Example:

impl(...)
{
...
comppare::DoNotOptimize(temporary_variable);
...
}
void DoNotOptimize(T const &value)
Prevents the compiler from optimizing away the given value.
Definition comppare.hpp:103

Code Documentation

To Generate code documentation, use doxygen:

cd docs/
doxygen

it should create a directory docs/html

Find docs/html/index.html and you will be able to view the documentation in your own web browser.