ComPPare 1.0.0
|
See Google Benchmark Instructions
In your C++ code, simply include the comppare header file by:
There are a few rules in order to use comppare.
Function output must be void
and consists of the input, then output types
HOTLOOP
Macros In order to benchmark specific regions of code, following Macros HOTLOOPSTART
, HOTLOOPEND
are needed. The region in between will be ran multiple times in order to get an accurate timing. The region first runs certain iterations of warmup before actually running the benchmark iterations.
HOTLOOPSTART
/HOTLOOPEND
Macros Alternatively, the macro HOTLOOP()
can be used to wrap the whole region.
main()
In main()
, you can setup the comparison, such as defining the reference function, initializing input data, naming of benchmark etc.
### The SAXPY example will be used throughout this section to demonstrate on usage
>SAXPY stands for Single-Precision A·X Plus Y. (Also see examples/saxpy)
It stands for the following operation:
>$$ >y_{\text{out}}[i] = a \cdot x[i] + y[i] >$$
Based on Section [Basic Usage], we can define a saxpy
function as:
The same input data will be used across different implementations. They are needed to initialize the comppare object.
Before testing functions like saxpy_cpu
, you must define:
For example, saxpy_cpu
returns a vector of floats:
Therefore, the output type must be specified as the template parameter of make_comppare
. The function call then takes the **input variables (a
, x
, y_in
) as arguments** (initialized in step1):
After creating the cmp
object, we can add functions into it.
To Define the reference function:
To Add more fucntions:
Command line arguments are the parameters for run():
Example Output:
In this case, saxpy_gpu
failed.
There are 2 commmand line options to control the number of warmup and benchmark iterations:
--warmups
Warmup Iterations When used with a unsigned 64 bit integer, this sets the number of warmup iterations before running benchmark runs.
Example:
--iters
Benchmark Iterations When used with a unsigned 64 bit integer, this sets the number of benchmark iterations
Example:
--tolerance
Numerical Error toleranceFloating point operations are never 100% accurate. Therefore tolerance is needed to be set.
For any floating point T, the tolerance is defaulted to be:
For integral type, the tolerance is defaulted to be 0.
To define the tolerance, the --tolerance
flag can be used to set both floating point and integral tolerance:
HOTLOOP
Macros HOTLOOP Macros essentially wrap your region of interest in a lambda before running the lambda across the warmup and benchmark iterations. The region of iterest is being timed across all the benchmark iterations to obtain an average runtime.
In comppare/comppare.hpp
, HOTLOOPSTART
and HOTLOOPEND
are defined as:
Therefore, any code in between the 2 macros are simply wrapped into the lambda hotloop_body
:
After that, the lambda is being ran:
Each implementation follows the same signature:
Input – Passed in by the framework. All candidates get exactly the same data.
Output – The framework creates private output objects and passes into the implementation by reference. Therefore any change in the object will be stored in the framework.
After each implementation has finished running, the framework compares each output against the reference implementation, and prints out the results in terms of difference/error and whether it has failed.
HOTLOOPSTART
& HOTLOOPEND
Used to wrap around region of CPU functions/operations for framework to benchmark
example:
HOTLOOP()
Alternative of HOTLOOPSTART/END
example:
GPU_HOTLOOPSTART
& GPU_HOTLOOPEND
Host Macro to wrap around region of GPU host functions/operations for framework to benchmark. Supports both CUDA and HIP, provided that the host function is compiled with the respected CUDA-compiler-wrapper nvcc
and HIP-compiler-wrapper hipcc
Warning Do Not use this within GPU kernels.
example:
GPU_HOTLOOP()
Alternative of GPU_HOTLOOPSTART/END
example:
MANUAL_TIMER_START
& MANUAL_TIMER_END
Used to wrap around region of CPU functions/operations within HOTLOOP
These set of macros can ONLY be used once within the Hotloop
Example – Only times a+b
:
GPU_MANUAL_TIMER_START
& GPU_MANUAL_TIMER_END
Used to wrap around region of GPU host functions/operations within HOTLOOP. Supports both CUDA and HIP, provided that the host function is compiled with the respected CUDA-compiler-wrapper nvcc
and HIP-compiler-wrapper hipcc
These set of macros can ONLY be used once within the Hotloop
Example – Only times kernel:
SET_ITERATION_TIME(us)
This Macro takes in a **floating point number representing time of current iteration in $\mu s$**
Example on mixed timers:
main()
Given your implementation signature:
Instantiate the comparison context by defining output types and forwarding the input arguments:
OutputType1, OutputType2, ...
: types of outputinput1, input2, ...
: variables/values of inputThe order of inputs...
and OutputTypes...
must match the order in the impl
signature. After construction, cmp
is ready to have implementations registered (via set_reference
/ add
) and executed (via run
).
set_reference
Registers the “reference” implementation and returns its corresponding Impl
descriptor for further configuration – eg attaching to Plugins like Google Benchmark.
Member of
Name | Type | Description |
---|---|---|
display_name | std::string | Human-readable label for this reference implementation. Used in output report. |
f | std::function<void(const Inputs&... , Outputs&...)> | Function matching the signature: void(const Inputs&... in, Outputs&... out) |
**comppare::Impl&
**
Reference to the internal Impl
object representing the just-registered implementation. Used mainly for attaching to plugins like Google Benchmark.
Recommended to discard return value.
add
Registers additional implementation which will be compared against reference and returns its corresponding Impl
descriptor for further configuration – eg attaching to Plugins like Google Benchmark.
Member of
Name | Type | Description |
---|---|---|
display_name | std::string | Human-readable label for this reference implementation. Used in output report. |
f | std::function<void(const Inputs&... , Outputs&...)> | Function matching the signature: void(const Inputs&... in, Outputs&... out) |
**comppare::Impl&
**
Reference to the internal Impl
object representing the just-registered implementation. Used mainly for attaching to plugins like Google Benchmark.
Recommended to discard return value.
run
Runs all the added implementations into the comppare framework.
Member of
Name | Type | Description |
---|---|---|
argc | int | Number of Command Line Arguments |
argv | char** | Command Line Argument Vector |
main()
For more concrete example, please see examples.
google_benchmark()
Attaches the Google Benchmark plugin when calling `set_reference` or `add`, enabling google benchmark to the current implementation.
Member of an internal struct
benchmark::internal::Benchmark*
** Pointer to the underlying Google Benchmark Benchmark
instance. Use this to chain additional benchmark configuration calls (e.g. ->Arg()
, ->Threads()
, ->Unit()
, etc.).After registering an implementation via set_reference
or add
, both functions return a reference to an internal struct comppare::Impl
(see here). google_benchmark()
attaches the plugin to the current implementation. The returned pointer allows you to further customize the benchmark before execution.
Enable manual timing for any registered implementation by appending ->UseManualTime()
to the Benchmark* returned from google_benchmark(). This instructs Google Benchmark to measure only the intervals you explicitly mark inside your implementation, with Manual Timer Macros or SetIterationTime() macro.
UseManualTime()
is a Google Benchmark API call that switches the benchmark into Manual Timing mode. (See Google Benchmark's Documentation)
DoNotOptimize()
For a deep dive into the working principle of DoNotOptimize()
please visit examples/advanced_demo/DoNotOptimize
I, LF Fung, am not the author of DoNotOptimize()
. The implementation of comppare::DoNotOptimize()
is a verbatim of Google Benchmark's benchmark::DoNotOptimize()
.
Compiler optimization can sometimes remove operations and variables completely.
For instance in the following function:
When compiling at high optimization, the compiler realizes yout
is just a temporary that’s never used elsewhere. As a result, yout
is optimized out, and thus the whole saxpy operation would be optimized out.
When SAXPY() is compiled in AArch64 with Optimisation -O3
The function body is practically empty: a single ret
which is “return from subroutine” Arm A64 Instruction Set: RET. In simple terms, it just returns — nothing happens.
Optimization is important to understand the performance of particular operations in production builds. This creates the conflicting ideas of optimize
but do not optimize away
. This was solved by Google in their benchmark – a microbenchmarking library. Google Benchmark provides benchmark::DoNotOptimize()
to prevent variables from being optimized away.
With the same SAXPY function, we simply add DoNotOptimize() around the temporary variable yout
This DoNotOptimize
call tells the compiler not to eliminate the temporary variable, so the operation itself won’t be optimized away.
When SAXPY_DONOTOPTIMIZE() is compiled in AArch64 with Optimisation -O3
:
5 __Z19SAXPY_DONOTOPTIMIZEfPKfS0_: ; @_Z19SAXPY_DONOTOPTIMIZEfPKfS0_
6 .cfi_startproc
7 ; bb.0:
8 sub sp, sp, #16
9 .cfi_def_cfa_offset 16
10 mov w8, #16960
11 movk w8, #15, lsl #16
12 add x9, sp, #4
13 add x10, sp, #8
14 LBB0_1: ; =>This Inner Loop Header: Depth=1
15 ldr s1, [x0], #4
16 ldr s2, [x1], #4
17 fmadd s1, s0, s1, s2
18 str s1, [sp, #4]
19 str x9, [sp, #8]
20 ; InlineAsm Start
21 ; InlineAsm End
22 subs x8, x8, #1
23 b.ne LBB0_1
24 ; bb.2:
25 add sp, sp, #16
26 ret
27 .cfi_endproc
28 ; -- End function
Further inspection reveals the Fused-Multiply-Add instruction – indicating that SAXPY operation was not optimized away.
17 fmadd s0, s0, s1, s2
Reference to Arm A64 Instruction Set: FMADD
comppare::DoNotOptimize()
Provided the usefulness of Google Benchmark's benchmark::DoNotOptimize()
, comppare includes a verbatim of benchmark::DoNotOptimize()
.
Example:
To Generate code documentation, use doxygen:
it should create a directory docs/html
Find docs/html/index.html
and you will be able to view the documentation in your own web browser.