Skip to content

Commit e19a0a0

Browse files
liyuying0000copybara-github
authored andcommitted
Add OSS Fleetbench multiprocessing doc
PiperOrigin-RevId: 750341969 Change-Id: Ice75190df6caf66ba471d3571d6b44bbe70d591d
1 parent 970f0c0 commit e19a0a0

1 file changed

Lines changed: 222 additions & 0 deletions

File tree

fleetbench/parallel/README.md

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# Fleetbench Multiprocessing
2+
3+
This page describes the Fleetbench
4+
multiprocessing framework, which resides in
5+
/fleetbench/parallel/.
6+
7+
[TOC]
8+
9+
## Overview
10+
11+
This framework runs Fleetbench benchmarks concurrently across multiple CPU cores
12+
to simulate specific CPU utilization levels. It schedules benchmarks based on
13+
their expected runtimes and configurable weighting strategies to achieve and
14+
maintain the target load over a defined duration.
15+
16+
The main script is `parallel_bench.py`, which orchestrates the parallel
17+
execution using helper modules for CPU management, benchmark execution, worker
18+
threads, and reporting.
19+
20+
## Features
21+
22+
* **Parallel Execution:** Leverages multiple CPU cores to run benchmarks
23+
simultaneously.
24+
* **Target Utilization:** Aims to achieve a user-defined average CPU
25+
utilization across selected cores.
26+
* **CPU Affinity:** Option to bind worker threads to specific CPU cores for
27+
predictable performance.
28+
* **Flexible Benchmark Selection:**
29+
* Run all default benchmarks.
30+
* Filter benchmarks by keywords.
31+
* Filter benchmarks by predefined workload groups (e.g., `proto`, `libc`).
32+
* **Weighted Scheduling:** Selects benchmarks to run based on different
33+
strategies (e.g., workload runtime, individual benchmark runtime, custom
34+
weights) to match target distributions.
35+
* **Hyperthreading Control (x86_64):** Manage Simultaneous Multithreading
36+
(SMT) state and workload placement (disable, dynamic, skewed, balanced).
37+
* **Performance Monitoring:** Collects basic timing information and optionally
38+
integrates with `perf` counters.
39+
* **Reporting:** Generates summary reports in JSON format, including
40+
aggregated results and average utilization.
41+
42+
## Usage
43+
44+
1. Build the Fleetbench binary:
45+
46+
```
47+
bazel build --config=clang --config=opt --config=haswell fleetbench:fleetbench
48+
```
49+
50+
1. Build the parallel framework binary:
51+
52+
```
53+
bazel build --config=clang --config=opt --config=haswell fleetbench/parallel:parallel_bench
54+
```
55+
56+
1. Run the framework mode. By default, it will run for 60 seconds and attempt
57+
to utilize 75% of the system's CPU.
58+
59+
```
60+
bazel-bin/fleetbench/parallel/parallel_bench --benchmark_target=bazel-bin/fleetbench/fleetbench
61+
```
62+
63+
If you want to run a single benchmark on multi cores, `--benchmark_filter`
64+
can be helpful:
65+
66+
```
67+
bazel-bin/fleetbench/parallel/parallel_bench --benchmark_target=bazel-bin/fleetbench/fleetbench --benchmark_filter=BM_PROTO_Arena"
68+
```
69+
70+
Tip: Please check [here](#command-flags) for more flags.
71+
72+
## How it works
73+
74+
1. **Initialization** Parses command-line flags, detects CPU architecture (x86
75+
vs. arm), selects the specific CPU cores to use based on `--num_cpus` and
76+
`--hyperthreading_mode` and creates the main `ParallelBench` object.
77+
78+
1. **Benchmark Setup** Determines the list of benchmarks to run based on
79+
filters. Calculates initial benchmark selection weights/probabilities based
80+
on the chosen strategy. And initializes controller and worker threads.
81+
82+
1. **Scheduling Loop** This loop runs for the specified `--duration`.
83+
84+
* Periodically checks current CPU utilization across worker cores.
85+
* Calculates the number of additional benchmark jobs needed to reach
86+
`--utilization`.
87+
* Selects which benchmarks to run next. Probabilities are adjusted
88+
dynamically based on observed runtimes of completed benchmarks.
89+
* Dispatches the selected benchmark objects to idle worker threads.
90+
* Collects completed `Result` objects from workers, updating runtime
91+
statistics for future scheduling decisions.
92+
93+
1. **Benchmark Execution** Each Worker thread waits for its assigned benchmark
94+
object.
95+
96+
* If `--cpu_affinity` is set, the worker binds itself to its assigned CPU.
97+
* The worker executes the benchmark using subprocess, passing necessary
98+
benchmark flags. It reads the JSON output from the benchmark, parses key
99+
metrics, and stores the corresponding result.
100+
101+
1. **Shutdown & Reporting**
102+
103+
* After the duration completes, the controller signals workers to stop and
104+
collects any remaining results.
105+
* Results from all runs within a repetition are aggregated into a
106+
Dataframe. If perf counters were requested, they are parsed and added.
107+
* A summary report is generated and saved to results.json in the
108+
repetition's temporary directory.
109+
* If `--keep_raw_data` is false, individual run files (`run_XXX`) are
110+
deleted.If multiple repetitions were run, a final aggregated report is
111+
created in the parent directory.
112+
113+
## Command Flags
114+
115+
Flags are defined in
116+
[parallel_bench.py](https://github.com/google/fleetbench/blob/main/fleetbench/parallel/parallel_bench.py).
117+
118+
Here's a breakdown:
119+
120+
### Basic Execution
121+
122+
* `--duration:` (Integer, default: 60) Minimum duration in seconds for the
123+
entire parallel run. The tool will try to sustain load for at least this
124+
long.
125+
126+
* `--utilization`: (Float, 0.0-1.0, default: 0.75) The desired average CPU
127+
utilization across the worker cores.
128+
129+
* `--repetitions`: (Integer, default: 1) How many times to repeat the entire
130+
benchmarking experiment (each run lasts `--duration` seconds). Results
131+
across repetitions are averaged in the final report.
132+
133+
* `--temp_dir`: (String, default: `/tmp/parallel_bench`) Directory for storing
134+
temporary files and results. Each repetition gets a subdirectory (e.g.,
135+
`run_0`, `run_1`).
136+
137+
### Benchmark Selection & Configuration
138+
139+
* `--benchmark_target`: (String, default: `fleetbench`) The name or full path
140+
of the Fleetbench benchmark executable.
141+
142+
* `--benchmark_filter`: (String, repeatable) Selects benchmarks from the
143+
minimal set
144+
that contain any of the specified keywords.
145+
146+
Example: `--benchmark_filter=PROTO --benchmark_filter=LIBC`
147+
148+
* `--workload_filter`: (String, repeatable) Selects benchmarks based on
149+
workload groups. Overrides `--benchmark_filter`. The format is
150+
`workload_name,keyword1,keyword2` or `workload_name,all`. This flag
151+
differentiates from the previous flag as it selects benchmarks from entire
152+
collection of benchmark pools. In contrast, `--benchmark_filter` only
153+
filters benchmarks within the minimal set.
154+
155+
Example: `--workload_filter=LIBC,Memcpy,Memcmp --workload_filter=PROTO,all`.
156+
157+
* `--scheduling_strategy`: (Enum, default: `WORKLOAD_WEIGHTED`) Strategy for
158+
choosing the next benchmark:
159+
160+
* `WORKLOAD_WEIGHTED`: Based on expected aggregate runtime of benchmarks
161+
within a workload.
162+
* `BM_WEIGHTED`: Based on the expected runtime of individual benchmarks.
163+
* `DCTAX_WEIGHTED`: We provide a templated
164+
[weights.csv](https://github.com/google/fleetbench/blob/main/fleetbench/parallel/weights.csv)
165+
for this strategy. Please adjust the file at your interest. With this
166+
strategy, the aggregated runtime for each benchmark will be proportional
167+
to the ratios defined in this CSV file.
168+
169+
* `--benchmark_weights`: (String, repeatable) Assign custom weights to
170+
benchmarks or filtered groups. Format:
171+
"benchmark_name|filter_keyword:weight". Benchmarks not explicitly weighted
172+
default to 1.0. By adjusting the weight for different benchmarks, it's able
173+
to control the memory bandwidth utilization during the run.
174+
175+
Example: `--benchmark_weights="PROTO:2.5" --benchmark_weights="COLD:0.5"`
176+
177+
### Performance & Tuning
178+
179+
* `--benchmark_repetitions`: (Integer, default: 0) Number of times each
180+
individual benchmark invocation should repeat internally. This is the same
181+
flag as supported in
182+
https://github.com/google/benchmark.
183+
184+
* `--benchmark_min_time`: (String, default: "2s") Minimum duration for each
185+
individual benchmark invocation. Again, please follow the
186+
https://github.com/google/benchmark
187+
usage.
188+
189+
* `--benchmark_perf_counters`: (String, default: "") Comma-separated list of
190+
perf counters to collect for each individual benchmark run such as
191+
`--benchmark_perf_counters=cycles,instructions`
192+
193+
### CPU & Scheduling Control
194+
195+
* `--num_cpus`: (Integer, default: all available) Total number of logical CPUs
196+
to use. One CPU is reserved for the controller thread; the rest are used for
197+
workers. Must be >= 2.
198+
* `--cpu_affinity`: (Boolean, default: True) If true, bind each worker thread
199+
tightly to its assigned CPU core. If false, allow the OS scheduler to manage
200+
worker placement.
201+
* `--hyperthreading_mode`: (Enum, default: DYNAMIC, x86_64 only) Controls SMT
202+
(Hyperthreading) behavior:
203+
204+
* `DISABLE`: SMT disabled. Attempts to select only one thread per physical
205+
core, effectively disabling SMT for the benchmark run. Utilization
206+
target applies to these selected cores.
207+
* `DYNAMIC`: SMT enabled, OS scheduler manages placement across available
208+
cores (up to --num_cpus).
209+
* `SKEWED`: SMT enabled. Tries to fill cores on Socket 0 first, then
210+
Socket 1. Uses lower-numbered sibling threads preferentially.
211+
* `BALANCED`: SMT enabled. Tries to distribute work evenly across sockets
212+
and cores/threads.
213+
214+
Note: SKEWED and BALANCED modes adjust `--num_cpus` internally and set
215+
`--utilization` effectively to 1.0 for the selected set of cores to ensure
216+
they are fully loaded.
217+
218+
### Output & Reporting
219+
220+
* `--keep_raw_data`: (Boolean, default: False) If true, keeps the individual
221+
JSON output files generated by each benchmark run within the `temp_dir`.
222+
Otherwise, they are deleted after results are aggregated.

0 commit comments

Comments
 (0)