|
| 1 | +# Fleetbench Multiprocessing |
| 2 | + |
| 3 | +This page describes the Fleetbench |
| 4 | +multiprocessing framework, which resides in |
| 5 | +/fleetbench/parallel/. |
| 6 | + |
| 7 | +[TOC] |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +This framework runs Fleetbench benchmarks concurrently across multiple CPU cores |
| 12 | +to simulate specific CPU utilization levels. It schedules benchmarks based on |
| 13 | +their expected runtimes and configurable weighting strategies to achieve and |
| 14 | +maintain the target load over a defined duration. |
| 15 | + |
| 16 | +The main script is `parallel_bench.py`, which orchestrates the parallel |
| 17 | +execution using helper modules for CPU management, benchmark execution, worker |
| 18 | +threads, and reporting. |
| 19 | + |
| 20 | +## Features |
| 21 | + |
| 22 | +* **Parallel Execution:** Leverages multiple CPU cores to run benchmarks |
| 23 | + simultaneously. |
| 24 | +* **Target Utilization:** Aims to achieve a user-defined average CPU |
| 25 | + utilization across selected cores. |
| 26 | +* **CPU Affinity:** Option to bind worker threads to specific CPU cores for |
| 27 | + predictable performance. |
| 28 | +* **Flexible Benchmark Selection:** |
| 29 | + * Run all default benchmarks. |
| 30 | + * Filter benchmarks by keywords. |
| 31 | + * Filter benchmarks by predefined workload groups (e.g., `proto`, `libc`). |
| 32 | +* **Weighted Scheduling:** Selects benchmarks to run based on different |
| 33 | + strategies (e.g., workload runtime, individual benchmark runtime, custom |
| 34 | + weights) to match target distributions. |
| 35 | +* **Hyperthreading Control (x86_64):** Manage Simultaneous Multithreading |
| 36 | + (SMT) state and workload placement (disable, dynamic, skewed, balanced). |
| 37 | +* **Performance Monitoring:** Collects basic timing information and optionally |
| 38 | + integrates with `perf` counters. |
| 39 | +* **Reporting:** Generates summary reports in JSON format, including |
| 40 | + aggregated results and average utilization. |
| 41 | + |
| 42 | +## Usage |
| 43 | + |
| 44 | +1. Build the Fleetbench binary: |
| 45 | + |
| 46 | + ``` |
| 47 | + bazel build --config=clang --config=opt --config=haswell fleetbench:fleetbench |
| 48 | + ``` |
| 49 | +
|
| 50 | +1. Build the parallel framework binary: |
| 51 | +
|
| 52 | + ``` |
| 53 | + bazel build --config=clang --config=opt --config=haswell fleetbench/parallel:parallel_bench |
| 54 | + ``` |
| 55 | +
|
| 56 | +1. Run the framework mode. By default, it will run for 60 seconds and attempt |
| 57 | + to utilize 75% of the system's CPU. |
| 58 | +
|
| 59 | + ``` |
| 60 | + bazel-bin/fleetbench/parallel/parallel_bench --benchmark_target=bazel-bin/fleetbench/fleetbench |
| 61 | + ``` |
| 62 | +
|
| 63 | + If you want to run a single benchmark on multi cores, `--benchmark_filter` |
| 64 | + can be helpful: |
| 65 | +
|
| 66 | + ``` |
| 67 | + bazel-bin/fleetbench/parallel/parallel_bench --benchmark_target=bazel-bin/fleetbench/fleetbench --benchmark_filter=BM_PROTO_Arena" |
| 68 | + ``` |
| 69 | +
|
| 70 | + Tip: Please check [here](#command-flags) for more flags. |
| 71 | +
|
| 72 | +## How it works |
| 73 | +
|
| 74 | +1. **Initialization** Parses command-line flags, detects CPU architecture (x86 |
| 75 | + vs. arm), selects the specific CPU cores to use based on `--num_cpus` and |
| 76 | + `--hyperthreading_mode` and creates the main `ParallelBench` object. |
| 77 | +
|
| 78 | +1. **Benchmark Setup** Determines the list of benchmarks to run based on |
| 79 | + filters. Calculates initial benchmark selection weights/probabilities based |
| 80 | + on the chosen strategy. And initializes controller and worker threads. |
| 81 | +
|
| 82 | +1. **Scheduling Loop** This loop runs for the specified `--duration`. |
| 83 | +
|
| 84 | + * Periodically checks current CPU utilization across worker cores. |
| 85 | + * Calculates the number of additional benchmark jobs needed to reach |
| 86 | + `--utilization`. |
| 87 | + * Selects which benchmarks to run next. Probabilities are adjusted |
| 88 | + dynamically based on observed runtimes of completed benchmarks. |
| 89 | + * Dispatches the selected benchmark objects to idle worker threads. |
| 90 | + * Collects completed `Result` objects from workers, updating runtime |
| 91 | + statistics for future scheduling decisions. |
| 92 | +
|
| 93 | +1. **Benchmark Execution** Each Worker thread waits for its assigned benchmark |
| 94 | + object. |
| 95 | +
|
| 96 | + * If `--cpu_affinity` is set, the worker binds itself to its assigned CPU. |
| 97 | + * The worker executes the benchmark using subprocess, passing necessary |
| 98 | + benchmark flags. It reads the JSON output from the benchmark, parses key |
| 99 | + metrics, and stores the corresponding result. |
| 100 | +
|
| 101 | +1. **Shutdown & Reporting** |
| 102 | +
|
| 103 | + * After the duration completes, the controller signals workers to stop and |
| 104 | + collects any remaining results. |
| 105 | + * Results from all runs within a repetition are aggregated into a |
| 106 | + Dataframe. If perf counters were requested, they are parsed and added. |
| 107 | + * A summary report is generated and saved to results.json in the |
| 108 | + repetition's temporary directory. |
| 109 | + * If `--keep_raw_data` is false, individual run files (`run_XXX`) are |
| 110 | + deleted.If multiple repetitions were run, a final aggregated report is |
| 111 | + created in the parent directory. |
| 112 | +
|
| 113 | +## Command Flags |
| 114 | +
|
| 115 | +Flags are defined in |
| 116 | +[parallel_bench.py](https://github.com/google/fleetbench/blob/main/fleetbench/parallel/parallel_bench.py). |
| 117 | +
|
| 118 | +Here's a breakdown: |
| 119 | +
|
| 120 | +### Basic Execution |
| 121 | +
|
| 122 | +* `--duration:` (Integer, default: 60) Minimum duration in seconds for the |
| 123 | + entire parallel run. The tool will try to sustain load for at least this |
| 124 | + long. |
| 125 | +
|
| 126 | +* `--utilization`: (Float, 0.0-1.0, default: 0.75) The desired average CPU |
| 127 | + utilization across the worker cores. |
| 128 | +
|
| 129 | +* `--repetitions`: (Integer, default: 1) How many times to repeat the entire |
| 130 | + benchmarking experiment (each run lasts `--duration` seconds). Results |
| 131 | + across repetitions are averaged in the final report. |
| 132 | +
|
| 133 | +* `--temp_dir`: (String, default: `/tmp/parallel_bench`) Directory for storing |
| 134 | + temporary files and results. Each repetition gets a subdirectory (e.g., |
| 135 | + `run_0`, `run_1`). |
| 136 | +
|
| 137 | +### Benchmark Selection & Configuration |
| 138 | +
|
| 139 | +* `--benchmark_target`: (String, default: `fleetbench`) The name or full path |
| 140 | + of the Fleetbench benchmark executable. |
| 141 | +
|
| 142 | +* `--benchmark_filter`: (String, repeatable) Selects benchmarks from the |
| 143 | + minimal set |
| 144 | + that contain any of the specified keywords. |
| 145 | +
|
| 146 | + Example: `--benchmark_filter=PROTO --benchmark_filter=LIBC` |
| 147 | +
|
| 148 | +* `--workload_filter`: (String, repeatable) Selects benchmarks based on |
| 149 | + workload groups. Overrides `--benchmark_filter`. The format is |
| 150 | + `workload_name,keyword1,keyword2` or `workload_name,all`. This flag |
| 151 | + differentiates from the previous flag as it selects benchmarks from entire |
| 152 | + collection of benchmark pools. In contrast, `--benchmark_filter` only |
| 153 | + filters benchmarks within the minimal set. |
| 154 | +
|
| 155 | + Example: `--workload_filter=LIBC,Memcpy,Memcmp --workload_filter=PROTO,all`. |
| 156 | +
|
| 157 | +* `--scheduling_strategy`: (Enum, default: `WORKLOAD_WEIGHTED`) Strategy for |
| 158 | + choosing the next benchmark: |
| 159 | +
|
| 160 | + * `WORKLOAD_WEIGHTED`: Based on expected aggregate runtime of benchmarks |
| 161 | + within a workload. |
| 162 | + * `BM_WEIGHTED`: Based on the expected runtime of individual benchmarks. |
| 163 | + * `DCTAX_WEIGHTED`: We provide a templated |
| 164 | + [weights.csv](https://github.com/google/fleetbench/blob/main/fleetbench/parallel/weights.csv) |
| 165 | + for this strategy. Please adjust the file at your interest. With this |
| 166 | + strategy, the aggregated runtime for each benchmark will be proportional |
| 167 | + to the ratios defined in this CSV file. |
| 168 | +
|
| 169 | +* `--benchmark_weights`: (String, repeatable) Assign custom weights to |
| 170 | + benchmarks or filtered groups. Format: |
| 171 | + "benchmark_name|filter_keyword:weight". Benchmarks not explicitly weighted |
| 172 | + default to 1.0. By adjusting the weight for different benchmarks, it's able |
| 173 | + to control the memory bandwidth utilization during the run. |
| 174 | +
|
| 175 | + Example: `--benchmark_weights="PROTO:2.5" --benchmark_weights="COLD:0.5"` |
| 176 | +
|
| 177 | +### Performance & Tuning |
| 178 | +
|
| 179 | +* `--benchmark_repetitions`: (Integer, default: 0) Number of times each |
| 180 | + individual benchmark invocation should repeat internally. This is the same |
| 181 | + flag as supported in |
| 182 | + https://github.com/google/benchmark. |
| 183 | +
|
| 184 | +* `--benchmark_min_time`: (String, default: "2s") Minimum duration for each |
| 185 | + individual benchmark invocation. Again, please follow the |
| 186 | + https://github.com/google/benchmark |
| 187 | + usage. |
| 188 | +
|
| 189 | +* `--benchmark_perf_counters`: (String, default: "") Comma-separated list of |
| 190 | + perf counters to collect for each individual benchmark run such as |
| 191 | + `--benchmark_perf_counters=cycles,instructions` |
| 192 | +
|
| 193 | +### CPU & Scheduling Control |
| 194 | +
|
| 195 | +* `--num_cpus`: (Integer, default: all available) Total number of logical CPUs |
| 196 | + to use. One CPU is reserved for the controller thread; the rest are used for |
| 197 | + workers. Must be >= 2. |
| 198 | +* `--cpu_affinity`: (Boolean, default: True) If true, bind each worker thread |
| 199 | + tightly to its assigned CPU core. If false, allow the OS scheduler to manage |
| 200 | + worker placement. |
| 201 | +* `--hyperthreading_mode`: (Enum, default: DYNAMIC, x86_64 only) Controls SMT |
| 202 | + (Hyperthreading) behavior: |
| 203 | +
|
| 204 | + * `DISABLE`: SMT disabled. Attempts to select only one thread per physical |
| 205 | + core, effectively disabling SMT for the benchmark run. Utilization |
| 206 | + target applies to these selected cores. |
| 207 | + * `DYNAMIC`: SMT enabled, OS scheduler manages placement across available |
| 208 | + cores (up to --num_cpus). |
| 209 | + * `SKEWED`: SMT enabled. Tries to fill cores on Socket 0 first, then |
| 210 | + Socket 1. Uses lower-numbered sibling threads preferentially. |
| 211 | + * `BALANCED`: SMT enabled. Tries to distribute work evenly across sockets |
| 212 | + and cores/threads. |
| 213 | +
|
| 214 | + Note: SKEWED and BALANCED modes adjust `--num_cpus` internally and set |
| 215 | + `--utilization` effectively to 1.0 for the selected set of cores to ensure |
| 216 | + they are fully loaded. |
| 217 | +
|
| 218 | +### Output & Reporting |
| 219 | +
|
| 220 | +* `--keep_raw_data`: (Boolean, default: False) If true, keeps the individual |
| 221 | + JSON output files generated by each benchmark run within the `temp_dir`. |
| 222 | + Otherwise, they are deleted after results are aggregated. |
0 commit comments