CSHARP-5992: Add LINQ translation benchmark suite by adelinowona · Pull Request #2004 · mongodb/mongo-csharp-driver

adelinowona · 2026-05-19T16:00:28Z

Summary

Adds a 10-benchmark LinqBench suite that exercises the LINQ-to-aggregation translation layer in isolation (no DB I/O), plus a small LinqEndToEnd suite that compares LINQ vs hand-built BsonDocument query plans end-to-end. Wires LinqBench into the perf-job category filter and the composite-score path. Designed to give us a defensible signal when the translator (especially SerializerFinder, AstSimplifier, and the various method-call sub-translators) moves.

Cross-commit runs against pre-/post-SerializerFinder and against an in-flight optimization PR (#1961) show the suite catches the kinds of regressions and improvements we'd want it to catch — see Validation below.

Motivation

The driver has spec-driven benchmarks for I/O-heavy patterns (Find, BulkWrite, GridFS, BSON encode/decode) but nothing for the LINQ translator. As internal code paths shift — SerializerFinder overhauls, AstSimplifier changes, new visitor support — we currently have no signal on translation-cost movement. When CSHARP-5572 introduced SerializerFinder (#1700), we had no way to quantify what it cost us. This PR closes that gap.

What this adds

File	Purpose
`benchmarks/MongoDB.Driver.Benchmarks/Linq/LinqTranslationBenchmark.cs`	10-benchmark translation suite (filter / field / projection / update / IQueryable entry points)
`benchmarks/MongoDB.Driver.Benchmarks/Linq/LinqEndToEndBenchmark.cs`	12 LINQ-vs-raw end-to-end benchmarks (6 patterns × {LINQ, Raw}) against a local mongod
`benchmarks/MongoDB.Driver.Benchmarks/Linq/README.md`	Suite-level docs: what each benchmark exercises, how to interpret, threshold story
`benchmarks/MongoDB.Driver.Benchmarks/DriverBenchmarkCategory.cs`	New `LinqBench` const; `LinqBench` and `BulkWriteBench` added to `AllCategories` so composite scores are emitted
`benchmarks/MongoDB.Driver.Benchmarks/BenchmarkResult.cs` + `Exporters/*.cs`	Composite-score labelling for `LinqBench`
`evergreen/run-perf-tests.sh`	Adds `LinqBench` to the `--anyCategories` filter

Cedar/SPS auto-discovers new composite categories — no dashboard config required.

Design decisions

Translation-only as the primary suite. LinqTranslationBenchmark calls into translator entry points directly (LinqProviderAdapter.TranslateExpressionTo*, ExpressionToExecutableQueryTranslator.Translate). No DB execution. This isolates translator regressions from network and serialization noise. End-to-end coverage lives in LinqEndToEndBenchmark for visibility but is not what we'd alert on first.
Representative user queries, not a visitor matrix. An earlier proposal organized benchmarks by which SerializerFinderVisit*.cs file they exercised. Pushed back to "what users actually write" framing; matrix coverage remains an option for targeted gaps in the future.
LinqBench does not cross-tag with the spec categories (DriverBench, BsonBench, etc.). Keeps LINQ numbers out of the spec composite averages. LinqBench does land in AllCategories so its own composite is emitted.
BulkWriteBench composite addition bundled in this PR. Was previously excluded with a "not part of the benchmarking spec" comment. Team wanted its composite tracked too; doing it here avoids a tiny standalone PR.
[MemoryDiagnoser] on everything. Allocation regressions matter independently of time and surface earlier than time regressions on noisy hardware.

Benchmark inventory

Translation suite (10 benchmarks)

Filters (4) — TranslateExpressionToFilter:

Benchmark	Pattern	Translator path
`MultiFieldSearch`	`Status == s && CustomerName.StartsWith(prefix) && ShippingAddress.City == city && CreatedAt > cutoff && !IsPaid`	And → Comparison, MethodCall (StartsWith), nested MemberAccess, Not + boolean MemberAccess
`OrFilter`	4-way `==` OR with literal constants	Or → Comparison. No closures — fastest filter, most sensitive to small regressions
`BatchLookup`	`ids.Contains(x.Id)`	MethodCall → ContainsMethodToFilterTranslator → `$in`
`ArrayElementQuery`	`x.Items.Any(i => i.Price > t)`	MethodCall → AllOrAnyMethodToFilterTranslator → `$elemMatch`, `@<elem>` symbol

Field (1) — TranslateExpressionToField:

Benchmark	Pattern	Notes
`FieldSelection`	`x => x.Items[0].ProductId`	`GetItemMethodToFilterFieldTranslator` path. `List<T>[0]` is a `MethodCallExpression` in C#, not `IndexExpression`.

Projections (2) — TranslateExpressionToProjection:

Benchmark	Pattern	Notes
`AggregationProjection`	4-field DTO with computed `Subtotal + Tax - Discount` and `Items.Select(i => i.ProductId)`	Full `ExpressionToAggregationExpressionTranslator` + `AstSimplifier`
`ProjectionSentinel`	`x => x`	Fast-path early return. Sentinel for breakage detection, not for translator perf

Update (1) — TranslateExpressionToSetStage:

Benchmark	Pattern	Notes
`UpdatePipeline`	Sets 3 fields including computed expression	`ExpressionToSetStageTranslator`, MemberInit pattern matching

IQueryable (2) — ExpressionToExecutableQueryTranslator.Translate:

Benchmark	Pattern	Notes
`QueryablePipeline`	`Where → Select → OrderBy → Take`	Filter + projection + sort + limit stages
`GroupByAggregation`	`GroupBy → Select` with Count + Sum	`GroupByMethodToPipelineTranslator`, `$group`, `IGroupingSerializer`, accumulators

End-to-end suite (12 benchmarks)

Six patterns (MultiFieldSearch, OrFilter, GroupBy, Projection, InFilter, PagedQuery), each run twice: once written in LINQ, once as a hand-built BsonDocument / pipeline. Seeds 500 documents plus secondary indexes on Status, CreatedAt, and ShippingAddress.City into a local mongod in [GlobalSetup]. The LINQ and Raw versions of each pair render to byte-equivalent BSON (verified via LinqProviderAdapter), so LINQ−Raw delta isolates translator + provider overhead from query-shape differences. Useful for the "how much of the user-visible latency is translation?" question. Not part of the regression-alert path.

Validation

Five lines of evidence that the suite produces actionable signal.

1. Within-run noise

Across all perf-hw runs (n=10 each, multiple commits), BDN-reported within-run StdDev is <1% on every non-sentinel benchmark (typically 0.2-0.9%). The ProjectionSentinel and FieldSelection micro-benchmarks land at 0.3-1.4%. Each individual run is a well-converged measurement.

2. Selectivity — targeted regression injection

Thread.SpinWait(300) (~10 µs on M1) injected into four specific translator code paths in turn:

Injection point	Expected affected benchmarks
`SerializerFinder.FindSerializers()`	All non-sentinel (universal path)
`GetItemMethodToFilterFieldTranslator`	`FieldSelection` only
`NotExpressionToFilterTranslator`	`MultiFieldSearch` (Not), `OrFilter` (chains Comparison dispatch)
`GroupByMethodToPipelineTranslator`	`GroupByAggregation` only

Result: each injection moved only the benchmarks that should have moved. ProjectionSentinel stayed flat across all injections. The suite has clean per-path selectivity — when something regresses, the benchmarks that move tell you which translator moved.

3. Cross-run drift on the actual perf-job hardware (n=10 on `rhel90-dbx-perf-large`)

Submitted via evergreen patch against test-csharp-spec-benchmarks with the runner looping 10× through --anyCategories "LinqBench". .NET 8.0, X64 RyuJit. All runs in a single perf-task invocation on the same host.

Benchmark	Min	Median	Max	Range%	CV%	Within-run StdDev%
`MultiFieldSearch`	494.0 µs	497.2 µs	507.8 µs	2.8%	1.03%	0.23%
`OrFilter`	13.3 µs	13.5 µs	13.7 µs	3.1%	1.01%	0.23%
`BatchLookup`	144.9 µs	145.8 µs	148.1 µs	2.2%	0.74%	0.88%
`ArrayElementQuery`	177.4 µs	179.4 µs	181.4 µs	2.2%	0.77%	0.51%
`FieldSelection`	5,610 ns	5,795 ns	6,025 ns	7.1%	2.30%	0.32%
`AggregationProjection`	406.5 µs	412.6 µs	420.1 µs	3.3%	1.07%	0.47%
`ProjectionSentinel`	36.18 ns	37.22 ns	38.17 ns	5.3%	1.47%	0.54%
`UpdatePipeline`	95.5 µs	96.3 µs	98.4 µs	3.0%	1.06%	0.59%
`QueryablePipeline`	874.5 µs	890.7 µs	913.6 µs	4.4%	1.38%	0.52%
`GroupByAggregation`	489.7 µs	495.1 µs	506.6 µs	3.4%	0.94%	0.77%

All non-sentinel benchmarks land in 2-4.5% range, CV ≤1.5%. That's a ~5× compression of the M1 drift bands we'd characterized earlier (which sat at 12-38% range on the heavier benchmarks). Only FieldSelection (~6µs benchmark, ~7% range) and ProjectionSentinel (~37 ns, ~5% range) drift wider — both are fast enough that small absolute drift looks proportionally large.

Two consequences:

Thresholds can be 5-10× tighter on perf-hw than on M1 (see below).
Sub-10% deltas are now individually resolvable. That matters for Set B (in-flight optimization) where M1 noise swamped the signal.

For reference, M1 Max characterization on the same suite (n=7) showed ranges of 5-30%, with the heavier IQueryable benchmarks at CV 8-11%. M1 numbers are kept as a development-machine guide; perf-hw numbers are what we'd actually alert on.

4. Cross-commit reality check on perf-hardware — does the suite catch real translator changes?

The suite was transplanted onto pinned commits and run on rhel90-dbx-perf-large (n=10 per commit). Set A captures a known historical regression (the SerializerFinder introduction in #1700). Set B captures an in-flight optimization (PR #1961). Together these answer "does the suite detect what we want it to detect?"

Set A — SerializerFinder cost-of-introduction. Parent of #1700 (46640eac98) vs the #1700 merge (59c9d34180). Both commits are from 2026-02-09; the only meaningful diff between them is the SerializerFinder introduction.

Benchmark	Pre-SF median	Post-SF median	Δ time	Pre-SF alloc	Post-SF alloc	Δ alloc
`MultiFieldSearch`	458.8 µs	497.8 µs	+8.5%	24.7 KB	27.1 KB	+10.0%
`OrFilter`	5,245 ns	13,100 ns	+150.7%	4.2 KB	5.5 KB	+31.3%
`BatchLookup`	120.0 µs	146.4 µs	+22.0%	7.6 KB	11.5 KB	+50.6%
`ArrayElementQuery`	150.2 µs	179.5 µs	+19.5%	8.7 KB	12.0 KB	+38.8%
`FieldSelection`	1,989 ns	5,644 ns	+183.8%	1.5 KB	2.3 KB	+49.7%
`AggregationProjection`	141.5 µs	408.1 µs	+188.3%	27.1 KB	67.0 KB	+147.5%
`ProjectionSentinel`	37.5 ns	40.3 ns	+7.5%	32 B	32 B	—
`UpdatePipeline`	90.6 µs	658.0 µs	+626.1%	9.1 KB	91.2 KB	+897.7%
`QueryablePipeline`	350.3 µs	880.7 µs	+151.4%	33.8 KB	105.6 KB	+212.2%
`GroupByAggregation`	158.3 µs	489.3 µs	+209.1%	28.7 KB	75.1 KB	+161.8%

Every non-sentinel benchmark moves above the 2-7% drift band. ProjectionSentinel correctly stays close to flat (the fast-path detection bypasses SerializerFinder entirely). The smallest non-sentinel delta is MultiFieldSearch at +8.5%/+10.0%, which still clears its 2.8% drift band cleanly. Most benchmarks cluster around 2-3× regressions, consistent with adding a per-translation serializer-discovery pass over the full expression tree.

UpdatePipeline is a structural outlier (+626% time, +898% allocation — 9 KB → 91 KB), and the suite revealed why. Reading the two commits side by side: TranslateExpressionToFilter, TranslateExpressionToProjection, and ExpressionToExecutableQueryTranslator.Translate all preprocess the lambda once at the top (LinqExpressionPreprocessor.Preprocess(expression)) and then let TranslationContext.Create run SerializerFinder on the canonicalized tree. TranslateExpressionToSetStage does it the other way around — it runs SerializerFinder on the raw lambda first and only preprocesses each field's value expression later, inside ExpressionToSetStageTranslator.TranslateNewWithOptionalMemberInitializers. Since SerializerFinder is a fixed-point visitor that loops until it stops making progress, running it over an un-preprocessed tree (partial-evaluated constants not collapsed, CLR-compat rewrites not applied) costs disproportionately more passes for expressions like our Total = x.Subtotal + x.Tax - x.Discount arithmetic chain. The same factor shows up on M1 (+735%) and perf-hw (+626%), ruling out machine-specific artifacts. This is a real driver asymmetry — follow-up ticket noted below. (Note that the design isn't a bug: SetStage's dispatch pattern-matches on body is NewExpression | MemberInitExpression, which PartialEvaluator would collapse if applied at the top. Any fix has to preserve that dispatch shape.)

Set B — In-flight optimization PR #1961 (SerializerFinderVisitMethodCall switch → MethodInfo-keyed lookup table, plus a follow-up Lookup performance optimization commit). Base (66780341e7 = current main) vs head (54973d039a). Same perf-task host, runs interleaved with Set A.

Benchmark	B1 (main) median	B2 (PR head) median	Δ time	B1 alloc	B2 alloc	Δ alloc
`MultiFieldSearch`	497.2 µs	495.0 µs	—	27.1 KB	27.1 KB	—
`OrFilter`	13.5 µs	13.3 µs	-1.5% (noise)	5.5 KB	5.5 KB	—
`BatchLookup`	145.8 µs	145.0 µs	-0.6% (noise)	11.5 KB	11.1 KB	-3.3%
`ArrayElementQuery`	179.4 µs	176.5 µs	-1.6% (noise)	12.0 KB	10.5 KB	-12.4%
`FieldSelection`	5,795 ns	5,612 ns	-3.2% (noise)	2.3 KB	2.3 KB	—
`AggregationProjection`	412.6 µs	409.9 µs	-0.6% (noise)	67.5 KB	67.3 KB	—
`ProjectionSentinel`	37.22 ns	36.72 ns	-1.3% (noise)	32 B	32 B	—
`UpdatePipeline`	96.3 µs	96.7 µs	—	10.2 KB	10.2 KB	—
`QueryablePipeline`	890.7 µs	868.0 µs	-2.6%	106.3 KB	101.6 KB	-4.5%
`GroupByAggregation`	495.1 µs	483.0 µs	-2.4%	75.4 KB	72.9 KB	-3.3%

Honest reading: at perf-hw resolution, the PR delivers measurable allocation wins on IQueryable benchmarks (ArrayElementQuery -12.4%, QueryablePipeline -4.5%, BatchLookup -3.3%, GroupByAggregation -3.3%) — all of which walk method-call subtrees that SerializerFinderVisitMethodCall dispatches through. The two largest time deltas (QueryablePipeline -2.6%, GroupByAggregation -2.4%) are at the edge of their respective 2.9-3.4% drift bands — directionally consistent but not slam-dunk on a single n=10 sample. Everything else sits inside noise.

The takeaway for the suite isn't "PR #1961 is great" or "PR #1961 is marginal" — it's that the suite resolves what kind of improvement this is: a real allocation reduction in the method-call dispatch path, with smaller time effects. That distinction was invisible at M1 noise levels (where everything looked like noise or a 5% time improvement). At perf-hw, allocation deltas are the cleaner metric for changes this size.

5. End-to-end overhead — Atlas dev cluster, perf-hardware, n=10

To answer "how much of user-visible LINQ latency is translator cost?", the e2e suite runs six patterns twice each (LINQ-translated vs hand-built BsonDocument / pipeline) against a live Atlas dev cluster from the perf host. 500 documents seeded per run with secondary indexes on Status, CreatedAt, and ShippingAddress.City; dropped at cleanup. Each LINQ-Raw pair renders to byte-equivalent BSON — verified by running each LINQ expression through LinqProviderAdapter and diffing the resulting filter/pipeline — so LINQ−Raw delta reflects translator and provider overhead, not query-shape differences.

Pattern	LINQ median	Raw median	LINQ−Raw	LINQ/Raw	Translator share	LINQ alloc	Raw alloc	Alloc ratio
`MultiFieldSearch`	1,966 µs	1,185 µs	+782 µs	1.66×	39.7%	74.2 KB	43.6 KB	1.70×
`OrFilter`	7,321 µs	7,230 µs	+91 µs	1.01×	1.2%	719.1 KB	711.1 KB	1.01×
`GroupBy`	2,034 µs	1,476 µs	+558 µs	1.38×	27.4%	65.4 KB	20.6 KB	3.18×
`Projection`	2,160 µs	1,051 µs	+1,109 µs	2.05×	51.3%	124.8 KB	35.5 KB	3.52×
`InFilter`	1,100 µs	784 µs	+315 µs	1.40×	28.7%	41.5 KB	28.8 KB	1.44×
`PagedQuery`	1,645 µs	1,213 µs	+432 µs	1.36×	26.3%	63.3 KB	34.8 KB	1.82×

Translator share is the fraction of user-visible LINQ time that disappears if you write raw BsonDocument instead — an upper bound on translator cost (the LINQ delta also includes provider overhead like cursor construction and command serialization).

Projection-heavy patterns (Projection 51%, MultiFieldSearch 40%): translator is ~40-50% of user-visible latency on indexed selective queries. Projection is the upper-bound case — LINQ has to translate a Where filter and construct a serializer for the anonymous-type projection. A 10% translator regression here is a ~5% user-visible regression.
Filter / aggregation patterns (GroupBy 27%, InFilter 29%, PagedQuery 26%): translator is ~25-30%. Each does meaningful server-side work ($group, $in, sort-skip-limit) so server time partially offsets translator overhead.
Broad scans (OrFilter 1.2%): translator is ~1% because the 4-way OR matches a large fraction of documents (no index helps), and the result set serializes ~700 KB. A 10% translator regression here is invisible to users.
Allocation ratios highlight LINQ-side overhead independently of time: Projection allocates 3.5× more in LINQ (125 KB vs 36 KB), GroupBy 3.2× more (65 KB vs 21 KB). Projected-type-serializer and IGroupingSerializer construction is allocation-heavy; this is the metric that would catch translator-side allocation regressions even when network time masks the time delta.

Caveat: Atlas dev cluster across the internet from the perf host. Per-run range on individual benchmarks is wide — typically ±15-30%, up to ±50% on patterns where server time dominates (GroupBy Raw, Projection Raw). The translator-share ratios are more stable than the absolute times because LINQ and Raw runs on the same iteration see correlated network noise. Single-batch result, 500 docs — multi-batch cursor iteration may behave differently.

An earlier iteration of this suite (v1, 3 patterns, no secondary indexes, partial LINQ-Raw shape divergences) reported different shares for the overlapping patterns (OrFilter 4.5%, GroupBy 32.6%). The shifts trace to v1's biases — missing indexes inflated server time on selective queries (lowering apparent translator share) and shape inequivalences in MultiFieldSearch and GroupBy Raw added non-translator contributions to LINQ−Raw delta. The v2 numbers above are the defensible ones.

Regression-alert thresholds (perf-hardware-calibrated)

Calibrated to observed drift on rhel90-dbx-perf-large:

Bucket	Threshold	Benchmarks	Rationale
Tight	8%	`MultiFieldSearch`, `BatchLookup`, `ArrayElementQuery`, `AggregationProjection`, `UpdatePipeline`, `QueryablePipeline`, `GroupByAggregation`, `OrFilter`	Observed range 2.2–4.4%. 8% gives ~2× headroom over the worst-observed drift.
Wider	12%	`FieldSelection`	~6µs benchmark; observed range 7.1%.
Sentinel	15%	`ProjectionSentinel`	~37 ns benchmark; observed range 5.3%. Meant to catch fast-path breakage, not measure translator perf.

Allocation thresholds should be even tighter — observed allocation drift is 0-1.2% across all benchmarks. A 5% allocation threshold would catch real allocation regressions while ignoring observed noise.

These replace the M1-calibrated 15%/30%/100% bands used during development.

Follow-ups (not in this PR)

File a ticket to investigate the TranslateExpressionToSetStage preprocessing asymmetry surfaced by Set A. The current design (Preprocess each field's value individually, after dispatching on the un-preprocessed body shape) is intentional — PartialEvaluator would collapse MemberInitExpression if applied at the top — but it causes SerializerFinder to do disproportionate work on arithmetic update expressions. The fix is likely either a partial preprocess that preserves dispatch shape, or a SerializerFinder optimization on un-preprocessed trees. Not a one-line change.
Smaller-magnitude regression detection on real translator changes — current cross-commit experiments validate large deltas (Set A: 100%+) and modest allocation deltas (Set B: 3-12% alloc); sub-3% time-delta detection on real changes hasn't been measured end-to-end. The perf-hw drift bands suggest it's feasible but needs a controlled experiment to confirm.
Possible future additions for coverage gaps: Lookup/Join, Distinct, SelectMany, Cast, $expr fallback paths.

Test plan

dotnet run -c Release -- --filter "*LinqTranslation*" from benchmarks/MongoDB.Driver.Benchmarks/ — all 10 benchmarks run, no build errors, results land in BenchmarkDotNet.Artifacts/results/
dotnet run -c Release -- --filter "*LinqEndToEnd*" — requires a reachable mongod (local or remote), all 12 benchmarks run
dotnet run -c Release -- --driverBenchmarks --anyCategories "LinqBench" — full perf-task path: composite score is emitted for LinqBench
Perf-job dry run on Evergreen via evergreen patch against test-csharp-spec-benchmarks / driver-performance-tests — task succeeds, results uploadable
Cedar/SPS picks up LinqBench composite without dashboard changes (verified by team after merge)

Caveats reviewers should know

All quantitative numbers in this description are from rhel90-dbx-perf-large runs (n=10 per commit). Cross-run drift is 2-7% on time, 0-1.2% on allocation. Set A and Set B results survive that noise floor as described per-row.
Set A's per-commit comparison required running the benchmark suite at two Feb-2026 commits. The benchmark code was transplanted onto those commits with one minor patch — subPathRoot parameter was added to TranslateExpressionToField between Feb and April, so the FieldSelection benchmark drops that arg at A1/A2. This isolates the comparison to driver code differences, not benchmark code differences.
Set B's PR1961 head commit (54973d039a) is not on mongodb/mongo-csharp-driver main's history. Submitting via evergreen patch-file --base <sha> silently runs the project's mainline task instead — we hit this once. Re-submitted via evergreen patch --uncommitted from a worktree at 54973d039a so the PR1961 commits land inside the patch diff. Documented for anyone who needs to do this in the future.
LinqEndToEndBenchmark uses a real MongoClient. Without a reachable cluster it fails at [GlobalSetup]. The translation suite has no such dependency.

15 benchmarks covering filter, projection, and IQueryable composition translation paths. New LinqBench category added to AllCategories and the perf-test runner. BulkWriteBench also added to AllCategories.

Document benchmark inventory, code path coverage, interpretation guidance, and provisional thresholds. Record targeted injection test results validating selective sensitivity of each benchmark to its target translator code path.

PartialEvaluator injection test showed OrChainFilter is still affected (evaluator traverses all expressions). Reframed as a sensitivity amplifier rather than diagnostic isolator.

…methods LinqBench uses translations/second instead of MB/s since there is no data throughput to measure. Add Unit and MetricName to BenchmarkResult so exporters label scores correctly. Change all benchmark methods to return their values to prevent JIT dead-code elimination.

The composite score loop was using default MB/s labels for all categories including LinqBench. Now correctly labels LinqBench composites as translations_per_second.

…d-to-end benchmarks; update regression thresholds - Redesign LinqTranslationBenchmark.cs: 15 individual feature benchmarks → 10 representative user queries covering distinct translator code paths (MultiFieldSearch, OrStatusFilter, BatchLookup, ArrayElementQuery, FieldSelection, AggregationProjection, ProjectionSentinel, UpdatePipeline, QueryablePipeline, GroupByAggregation) - Add LinqEndToEndBenchmark.cs: one-off characterization of translation overhead vs pre-built BsonDocument queries on a live collection; not wired into CI - Update README regression thresholds based on 7-run M1 Max drift characterization: tight bucket (15%) for MultiFieldSearch/UpdatePipeline/BatchLookup/ArrayElementQuery; wider bucket (30%) for OrStatusFilter/FieldSelection/AggregationProjection/ QueryablePipeline/GroupByAggregation

BorisDog

Look very good overall.

BorisDog · 2026-05-20T22:27:22Z

+    // --- OrStatusFilter: simple OR filter ---
+
+    [Benchmark]
+    public List<OrderDocument> OrStatusFilterLinq()


minor: I think we need more descriptive naming, not tied to a specific field name. Something like OrFilterLinq, or FindOrFilterLinq

BorisDog · 2026-05-20T22:30:50Z

+    {
+        var pipeline = PipelineDefinition<OrderDocument, BsonDocument>.Create(_groupByPipeline);
+        return _collection.Aggregate(pipeline).ToList();
+    }


I would also add additional case with all the following

Select/Pojection,

something with arrays, like $in, $all....

OrderBy.ThenBy, Take

ajcvickers · 2026-05-27T11:56:15Z

+    {
+        return _collection.Find(x =>
+            x.Status == _statusFilter &&
+            x.CustomerName.StartsWith(_prefix) &&


Does StartsWith create the exact same regex as in the raw version?

It might makes sense to made a first translation somewhere in GlobalSetup and compare the produced MQL. So if we will change the translation in future the Benchmark will throw.

ajcvickers · 2026-05-27T12:02:00Z

+{
+    private const string DatabaseName = "linqbench";
+    private const string CollectionName = "orders";
+    private const int SeedCount = 500;


Consider creating indexes in the database such that server time is minimized and translation time changes will be more apparent.

ajcvickers · 2026-05-27T12:03:57Z

+    public List<BsonDocument> GroupByLinq()
+    {
+        return _collection.Aggregate()
+            .Group(x => x.Status, g => new { Status = g.Key, Count = g.Count(), TotalRevenue = g.Sum(x => x.Total) })


There is projection to an anonymous type here, followed by creation of the BSON document. The raw example does not project to an anonymous type.

…quivalence fixes; rename OrStatusFilter to OrFilter

Copilot

Pull request overview

Adds a new LINQ-focused benchmark suite to the driver benchmarks project, enabling perf-job tracking and composite scoring for LINQ translation performance (plus an optional end-to-end comparison suite).

Changes:

Introduces LinqTranslationBenchmark (translation-only, no query execution) and LinqEndToEndBenchmark (LINQ vs raw query plans with real DB execution).
Wires new LinqBench category into perf-job filtering and composite-score export (including score units/metric names).
Extends benchmark category constants and composite export output to include LinqBench (and now BulkWriteBench) in AllCategories.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
evergreen/run-perf-tests.sh	Adds `LinqBench` to Evergreen perf category filter.
benchmarks/MongoDB.Driver.Benchmarks/Linq/README.md	Documents the LINQ benchmark inventory and interpretation guidance.
benchmarks/MongoDB.Driver.Benchmarks/Linq/LinqTranslationBenchmark.cs	Adds translation-focused benchmarks across filter/field/projection/update/IQueryable entry points.
benchmarks/MongoDB.Driver.Benchmarks/Linq/LinqEndToEndBenchmark.cs	Adds end-to-end LINQ vs raw benchmarks that seed data and run against a live server.
benchmarks/MongoDB.Driver.Benchmarks/Exporters/LocalExporter.cs	Emits per-category and per-benchmark units (MB/s vs translations/s).
benchmarks/MongoDB.Driver.Benchmarks/Exporters/EvergreenExporter.cs	Emits per-category and per-benchmark metric names for Evergreen (MB/s vs translations/s).
benchmarks/MongoDB.Driver.Benchmarks/DriverBenchmarkCategory.cs	Adds `LinqBench` and includes it (and `BulkWriteBench`) in composite category list.
benchmarks/MongoDB.Driver.Benchmarks/BenchmarkResult.cs	Adds unit/metric metadata and computes translations/s scoring for `LinqBench`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+| Benchmark | Expression | Translator path exercised |
+|---|---|---|
+| `MultiFieldSearch` | `x.Status == s && x.CustomerName.StartsWith(prefix) && x.ShippingAddress.City == city && x.CreatedAt > cutoff && !x.IsPaid` | And → Comparison, MethodCall (StartsWith), nested MemberAccess, Not + boolean MemberAccess |
+| `OrStatusFilter` | 4-way `==` OR with literal constants | Or → Comparison. No closures — fastest filter, most sensitive to small regressions |


+    [GlobalSetup]
+    public void Setup()
+    {
+        _client = MongoConfiguration.CreateClient();


+    private void SetupQueryableExpressions(string statusFilter)
+    {
+        var mongoUri = Environment.GetEnvironmentVariable("MONGODB_URI");
+        var settings = mongoUri != null ? MongoClientSettings.FromConnectionString(mongoUri) : new MongoClientSettings();
+        settings.ClusterSource = DisposingClusterSource.Instance;
+        _queryClient = new MongoClient(settings);
+
+        var collection = _queryClient.GetDatabase("linqbench").GetCollection<OrderDocument>("orders");
+        var queryable = collection.AsQueryable();
+


+
+**Allocation changes** are often more actionable than time changes. A new allocation in a hot path is a real regression even if the time delta is within noise. The `[MemoryDiagnoser]` columns (`Gen0`, `Allocated`) make allocation regressions visible.
+
+**`OrStatusFilter`** is the fastest filter (~7µs, ~5x faster than others) because it uses literal constants instead of captured variables, producing a simpler expression tree with less work at every stage. This makes it the most sensitive filter benchmark — small translator regressions that would be lost in the noise on slower benchmarks show up clearly here.


+| Bucket | Threshold | Benchmarks | Observed range |
+|---|---|---|---|
+| Tight | 15% | `MultiFieldSearch`, `UpdatePipeline`, `BatchLookup`, `ArrayElementQuery` | 9–15% |
+| Wider | 30% | `OrStatusFilter`, `FieldSelection`, `AggregationProjection`, `QueryablePipeline`, `GroupByAggregation` | 20–29% |
+| Sentinel | 100% | `ProjectionSentinel` | 5% |
+
+`OrStatusFilter` and `FieldSelection` land in the wider bucket for different reasons: `OrStatusFilter` is ~7µs and proportional noise is large at that scale; `FieldSelection` is ~2µs with similar characteristics. The three complex benchmarks (`AggregationProjection`, `QueryablePipeline`, `GroupByAggregation`) show higher drift because they allocate more and exercise more GC pressure.


sanych-sun · 2026-05-28T22:23:32Z

+    public void Cleanup()
+    {
+        _client.DropDatabase(DatabaseName);
+        _client.Dispose();


In v3 MongoDB.Driver this will not unregister the cluster. If we really want to clean up we should follow the clean up guide.

sanych-sun · 2026-05-28T22:28:18Z

+    {
+        return _collection.Find(x =>
+            x.Status == _statusFilter &&
+            x.CustomerName.StartsWith(_prefix) &&


It might makes sense to made a first translation somewhere in GlobalSetup and compare the produced MQL. So if we will change the translation in future the Benchmark will throw.

sanych-sun · 2026-05-28T22:32:52Z

+{
+    private const string DatabaseName = "linqbench";
+    private const string CollectionName = "orders";
+    private const int SeedCount = 500;


Do we really need 500 documents? As far as I understood the point is to measure the LINQ translation overhead, so I would say it should be enough to have even a single document in the collection to reduce the server/network/serialization overhead.

adelinowona requested review from BorisDog, ajcvickers and sanych-sun May 19, 2026 16:00

adelinowona added the maintenance Non-code maintenance (deps, docs, configs, etc.). label May 19, 2026

adelinowona force-pushed the csharp5992 branch 3 times, most recently from ea1069d to 148d1de Compare May 20, 2026 18:15

adelinowona added 6 commits May 20, 2026 15:00

CSHARP-5992: Add LINQ translation benchmark suite

4416bd3

15 benchmarks covering filter, projection, and IQueryable composition translation paths. New LinqBench category added to AllCategories and the perf-test runner. BulkWriteBench also added to AllCategories.

Update OrChainFilter framing and record PartialEvaluator test

4af6c7d

PartialEvaluator injection test showed OrChainFilter is still affected (evaluator traverses all expressions). Reframed as a sensitivity amplifier rather than diagnostic isolator.

Fix composite score metric labels for LinqBench in exporters

7d9cf86

The composite score loop was using default MB/s labels for all categories including LinqBench. Now correctly labels LinqBench composites as translations_per_second.

adelinowona force-pushed the csharp5992 branch from 148d1de to 8e2d08d Compare May 20, 2026 19:00

BorisDog requested changes May 20, 2026

View reviewed changes

ajcvickers reviewed May 27, 2026

View reviewed changes

Add Projection, InFilter, PagedQuery e2e pairs with indexes and Raw e…

f0eafea

…quivalence fixes; rename OrStatusFilter to OrFilter

adelinowona changed the title ~~WIP: Add LINQ translation benchmark suite~~ CSHARP-5992: Add LINQ translation benchmark suite May 28, 2026

adelinowona marked this pull request as ready for review May 28, 2026 15:59

adelinowona requested a review from a team as a code owner May 28, 2026 15:59

Copilot AI review requested due to automatic review settings May 28, 2026 15:59

Copilot started reviewing on behalf of adelinowona May 28, 2026 15:59 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

sanych-sun reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSHARP-5992: Add LINQ translation benchmark suite#2004

CSHARP-5992: Add LINQ translation benchmark suite#2004
adelinowona wants to merge 7 commits into
mongodb:mainfrom
adelinowona:csharp5992

adelinowona commented May 19, 2026 •

edited

Loading

Uh oh!

BorisDog left a comment

Uh oh!

BorisDog May 20, 2026

Uh oh!

BorisDog May 20, 2026

Uh oh!

ajcvickers May 27, 2026

Uh oh!

sanych-sun May 28, 2026

Uh oh!

ajcvickers May 27, 2026

Uh oh!

ajcvickers May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

sanych-sun May 28, 2026

Uh oh!

sanych-sun May 28, 2026

Uh oh!

sanych-sun May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		Allocation changes are often more actionable than time changes. A new allocation in a hot path is a real regression even if the time delta is within noise. The `[MemoryDiagnoser]` columns (`Gen0`, `Allocated`) make allocation regressions visible.

		`OrStatusFilter` is the fastest filter (~7µs, ~5x faster than others) because it uses literal constants instead of captured variables, producing a simpler expression tree with less work at every stage. This makes it the most sensitive filter benchmark — small translator regressions that would be lost in the noise on slower benchmarks show up clearly here.

Conversation

adelinowona commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What this adds

Design decisions

Benchmark inventory

Translation suite (10 benchmarks)

End-to-end suite (12 benchmarks)

Validation

1. Within-run noise

2. Selectivity — targeted regression injection

3. Cross-run drift on the actual perf-job hardware (n=10 on rhel90-dbx-perf-large)

4. Cross-commit reality check on perf-hardware — does the suite catch real translator changes?

5. End-to-end overhead — Atlas dev cluster, perf-hardware, n=10

Regression-alert thresholds (perf-hardware-calibrated)

Follow-ups (not in this PR)

Test plan

Caveats reviewers should know

Uh oh!

BorisDog left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

adelinowona commented May 19, 2026 •

edited

Loading

3. Cross-run drift on the actual perf-job hardware (n=10 on `rhel90-dbx-perf-large`)