From 3025988408c802b97ddbb1f4388bb47f6bb12a63 Mon Sep 17 00:00:00 2001 From: Sergey Minaev <5072859+jovfer@users.noreply.github.com> Date: Fri, 15 May 2026 12:35:05 +0100 Subject: [PATCH 1/4] Document regr_r2 coefficient of determination aggregate Add a regr_r2(y, x) entry to the finance functions reference, placed alphabetically between regr_intercept and regr_slope. The section covers what R-squared measures, supported types, argument order, and the SQL:2003 edge cases: - null when X is constant (Sxx = 0) - 1.0 when Y is constant and X varies (Syy = 0) It also notes the divergence from corr(y, x), which returns null in the constant-Y case rather than 1.0. Three worked examples mirror the surrounding sections: a fleet telemetry trend-detection query (slope + R-squared as a significance filter), a basic dataset with R-squared = 0.81, and a GROUP BY example over sales_data showing a near-perfect linear fit per category. A null-handling example confirms that pairs with either argument null are skipped. Also add the function to the May 2026 reference changelog entry. Pairs with questdb/questdb#7104. --- documentation/changelog.mdx | 1 + documentation/query/functions/finance.md | 137 +++++++++++++++++++++++ 2 files changed, 138 insertions(+) diff --git a/documentation/changelog.mdx b/documentation/changelog.mdx index 6c9e379c6..e09ff69f4 100644 --- a/documentation/changelog.mdx +++ b/documentation/changelog.mdx @@ -19,6 +19,7 @@ This page tracks significant updates to the QuestDB documentation. ### Reference - Added [`ntile()`](/docs/query/functions/window-functions/reference/#ntile), [`cume_dist()`](/docs/query/functions/window-functions/reference/#cume_dist), and [`nth_value()`](/docs/query/functions/window-functions/reference/#nth_value) window functions +- Added [`regr_r2()`](/docs/query/functions/finance/#regr_r2) coefficient of determination aggregate - Fixed `timestamp_ns` valid range documentation - Added precise microsecond-grained timestamp examples for PHP clients - Fixed PIVOT syntax: removed redundant `FOR` from multiple pivot statements diff --git a/documentation/query/functions/finance.md b/documentation/query/functions/finance.md index d4d22cf09..c192bc1a6 100644 --- a/documentation/query/functions/finance.md +++ b/documentation/query/functions/finance.md @@ -340,6 +340,143 @@ Result: Only the rows where both x and y are not null are considered in the calculation. +## regr_r2 + +`regr_r2(y, x)` - Calculates the coefficient of determination (R-squared) of +the linear regression of y on x. R-squared measures how well the regression +line fits the data, on a scale from 0 (no linear relationship) to 1 (perfect +linear fit). + +- The function requires at least two valid (y, x) pairs to compute a value. + - If fewer than two pairs are available, the function returns null. +- The function returns null when x is constant across all rows (zero variance + in the independent variable). This includes the single-row case. +- The function returns 1 when y is constant and x varies (a horizontal line + fits constant y perfectly). This follows the SQL:2003 specification and + differs from `corr(y, x)`, which returns null in that case. +- Supported data types for x and y include `double`, `float`, and `integer` + types. +- The order of arguments in `regr_r2(y, x)` matters. + - Ensure that y is the dependent variable and x is the independent variable. +- `regr_r2(y, x)` equals `corr(y, x) * corr(y, x)` everywhere except the + constant-y edge case noted above. + +### Calculation + +The coefficient of determination $r^2$ is the squared Pearson correlation: + +$$ +r^2 = \frac{\left(\sum (x_i - \bar{x})(y_i - \bar{y})\right)^2}{\sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2} +$$ + +Where: + +- $\bar{x}$ and $\bar{y}$ are the means of x and y across the valid pairs. +- The numerator is the squared sum of cross-deviations $S_{xy}^2$. +- The denominator is the product of the sums of squared deviations + $S_{xx} \cdot S_{yy}$. + +When $S_{xx} = 0$ the function returns null; when $S_{xx} \neq 0$ and +$S_{yy} = 0$ the function returns 1. + +### Arguments + +- y: A numeric column representing the dependent variable. +- x: A numeric column representing the independent variable. + +### Return value + +Return value type is `double`. + +The function returns a value in the range $[0, 1]$. A value close to 1 +indicates a strong linear relationship; a value close to 0 indicates that x +has little linear predictive power for y. + +### Examples + +#### Detect a strong trend in fleet telemetry + +A common use case is filtering for series whose value is trending with +statistical confidence, separating real drift from noise: + +```questdb-sql +SELECT robot_id, + regr_slope(motor_temp, elapsed_seconds) AS temp_trend, + regr_r2(motor_temp, elapsed_seconds) AS r_squared +FROM telemetry +WHERE ts > now() - 7d +GROUP BY robot_id +HAVING regr_r2(motor_temp, elapsed_seconds) > 0.7 + AND regr_slope(motor_temp, elapsed_seconds) > 0 +ORDER BY temp_trend DESC; +``` + +Only robots whose temperature is rising and whose rise explains more than 70% +of the variance over the past week are returned. + +#### Compute R-squared on a known dataset + +Using the same measurements table: + +| x | y | +| --- | --- | +| 1.0 | 2.0 | +| 2.0 | 3.0 | +| 3.0 | 5.0 | +| 4.0 | 4.0 | +| 5.0 | 6.0 | + +```questdb-sql +SELECT regr_r2(y, x) AS r_squared FROM measurements; +``` + +Result: + +| r_squared | +| --------- | +| 0.81 | + +About 81% of the variance in y is explained by a linear relationship with x. + +#### Compute R-squared grouped by category + +Using the same sales_data table: + +| category | advertising_spend | sales | +| -------- | ---------------- | ----- | +| A | 1000 | 15000 | +| A | 2000 | 22000 | +| A | 3000 | 28000 | +| B | 1500 | 18000 | +| B | 2500 | 26000 | +| B | 3500 | 31000 | + +```questdb-sql +SELECT category, regr_r2(sales, advertising_spend) AS fit_quality +FROM sales_data +GROUP BY category; +``` + +Both categories show a near-perfect linear relationship between advertising +spend and sales, with R-squared values close to 1. + +#### Handling null values + +The function ignores rows where either x or y is null: + +```questdb-sql +SELECT regr_r2(y, x) AS r_squared +FROM ( + SELECT 1 AS x, 2 AS y + UNION ALL SELECT 2, NULL + UNION ALL SELECT NULL, 4 + UNION ALL SELECT 4, 5 +); +``` + +Only the rows where both x and y are not null are considered in the +calculation. + ## regr_slope `regr_slope(y, x)` - Calculates the slope of the linear regression line for the From d0cbd3c460cfc60791431112c9a7d664fe5f4840 Mon Sep 17 00:00:00 2001 From: Sergey Minaev <5072859+jovfer@users.noreply.github.com> Date: Wed, 3 Jun 2026 17:31:59 +0100 Subject: [PATCH 2/4] Replace regr_r2 examples with demo-validated trips queries Address @javier review on #449: - The fleet-telemetry example was not valid QuestDB SQL (fictional table, HAVING on aggregates, now() - 7d interval arithmetic). Replace all examples with queries against the demo instance `trips` table, each verified on https://demo.questdb.io with its exact result. - Add a grouped fit-quality example over `payment_type` (R-squared as an anomaly lens) and a null-handling example with a non-trivial result. - Fix the y/x argument order in the supported-types and means bullets for consistency with the regr_r2(y, x) signature. Co-Authored-By: Claude Opus 4.8 --- documentation/query/functions/finance.md | 116 ++++++++++------------- 1 file changed, 52 insertions(+), 64 deletions(-) diff --git a/documentation/query/functions/finance.md b/documentation/query/functions/finance.md index c192bc1a6..ce99a6c40 100644 --- a/documentation/query/functions/finance.md +++ b/documentation/query/functions/finance.md @@ -354,7 +354,7 @@ linear fit). - The function returns 1 when y is constant and x varies (a horizontal line fits constant y perfectly). This follows the SQL:2003 specification and differs from `corr(y, x)`, which returns null in that case. -- Supported data types for x and y include `double`, `float`, and `integer` +- Supported data types for y and x include `double`, `float`, and `integer` types. - The order of arguments in `regr_r2(y, x)` matters. - Ensure that y is the dependent variable and x is the independent variable. @@ -371,7 +371,7 @@ $$ Where: -- $\bar{x}$ and $\bar{y}$ are the means of x and y across the valid pairs. +- $\bar{y}$ and $\bar{x}$ are the means of y and x across the valid pairs. - The numerator is the squared sum of cross-deviations $S_{xy}^2$. - The denominator is the product of the sums of squared deviations $S_{xx} \cdot S_{yy}$. @@ -394,88 +394,76 @@ has little linear predictive power for y. ### Examples -#### Detect a strong trend in fleet telemetry - -A common use case is filtering for series whose value is trending with -statistical confidence, separating real drift from noise: - -```questdb-sql -SELECT robot_id, - regr_slope(motor_temp, elapsed_seconds) AS temp_trend, - regr_r2(motor_temp, elapsed_seconds) AS r_squared -FROM telemetry -WHERE ts > now() - 7d -GROUP BY robot_id -HAVING regr_r2(motor_temp, elapsed_seconds) > 0.7 - AND regr_slope(motor_temp, elapsed_seconds) > 0 -ORDER BY temp_trend DESC; -``` - -Only robots whose temperature is rising and whose rise explains more than 70% -of the variance over the past week are returned. +The following examples use the `trips` table of NYC taxi rides from the +[QuestDB demo instance](https://demo.questdb.io/), where the fare of a ride is +largely driven by the distance travelled. -#### Compute R-squared on a known dataset +#### Measure how well one variable explains another -Using the same measurements table: +Treating `fare_amount` as the dependent variable y and `trip_distance` as the +independent variable x measures how much of the fare is explained by distance: -| x | y | -| --- | --- | -| 1.0 | 2.0 | -| 2.0 | 3.0 | -| 3.0 | 5.0 | -| 4.0 | 4.0 | -| 5.0 | 6.0 | - -```questdb-sql -SELECT regr_r2(y, x) AS r_squared FROM measurements; +```questdb-sql title="R-squared of fare against distance" demo +SELECT regr_r2(fare_amount, trip_distance) AS r2 FROM trips; ``` -Result: - -| r_squared | -| --------- | -| 0.81 | - -About 81% of the variance in y is explained by a linear relationship with x. +| r2 | +| ------ | +| 0.8562 | -#### Compute R-squared grouped by category +About 86% of the variation in fare is explained by a straight-line relationship +with distance. Because `regr_r2(y, x)` is the square of `corr(y, x)`, the same +value is returned by +`corr(fare_amount, trip_distance) * corr(fare_amount, trip_distance)`. -Using the same sales_data table: +#### Compare fit quality across groups -| category | advertising_spend | sales | -| -------- | ---------------- | ----- | -| A | 1000 | 15000 | -| A | 2000 | 22000 | -| A | 3000 | 28000 | -| B | 1500 | 18000 | -| B | 2500 | 26000 | -| B | 3500 | 31000 | +R-squared is a useful data-quality lens. Splitting the same regression by +`payment_type` shows that ordinary metered trips paid by `Card` or `Cash` track +distance closely, while voided, disputed, and no-charge trips fit far worse: -```questdb-sql -SELECT category, regr_r2(sales, advertising_spend) AS fit_quality -FROM sales_data -GROUP BY category; +```questdb-sql title="R-squared of fare against distance per payment type" demo +SELECT payment_type, regr_r2(fare_amount, trip_distance) AS r2 +FROM trips +ORDER BY r2 DESC; ``` -Both categories show a near-perfect linear relationship between advertising -spend and sales, with R-squared values close to 1. +| payment_type | r2 | +| ------------ | ------ | +| Card | 0.8604 | +| Cash | 0.8600 | +| Unknown | 0.6502 | +| Dispute | 0.4250 | +| No Charge | 0.3964 | +| Voided | 0.3156 | + +The low R-squared values for the anomalous payment types highlight trips whose +fare does not scale with distance. #### Handling null values -The function ignores rows where either x or y is null: +`regr_r2` ignores any row where either argument is null. The query below +supplies six rows, two of which contain a null, so only the four complete (y, x) +pairs are used in the calculation: -```questdb-sql -SELECT regr_r2(y, x) AS r_squared +```questdb-sql title="Null pairs are ignored" +SELECT regr_r2(y, x) AS r2 FROM ( SELECT 1 AS x, 2 AS y - UNION ALL SELECT 2, NULL - UNION ALL SELECT NULL, 4 - UNION ALL SELECT 4, 5 + UNION ALL SELECT 2, 3 + UNION ALL SELECT 3, null + UNION ALL SELECT null, 5 + UNION ALL SELECT 4, 4 + UNION ALL SELECT 5, 6 ); ``` -Only the rows where both x and y are not null are considered in the -calculation. +| r2 | +| ------ | +| 0.9257 | + +The result matches running the query over only the four complete pairs; the two +rows that contain a null are skipped. ## regr_slope From 791897a0462a79c4122b2bd29089ab6c4e524011 Mon Sep 17 00:00:00 2001 From: Sergey Minaev <5072859+jovfer@users.noreply.github.com> Date: Wed, 3 Jun 2026 17:40:42 +0100 Subject: [PATCH 3/4] Use pure-finance FX examples for regr_r2 This is the finance section, so use finance data and mirror the neighbouring regr_slope / regr_intercept example structure. - Replace the NYC-taxi `trips` examples with two examples on the demo `market_data_ohlc_1d` FX table, both using regr_r2(close, open) so the basic and grouped examples share one regression like the neighbours do. - The grouped example uses GROUP BY symbol (matching regr_slope) and surfaces a real signal: the pegged USDHKD scores far lower than the floating majors. - Keep it to two focused examples; null handling is already covered in the function's bullet list. All values verified on https://demo.questdb.io. Co-Authored-By: Claude Opus 4.8 --- documentation/query/functions/finance.md | 90 +++++++++--------------- 1 file changed, 35 insertions(+), 55 deletions(-) diff --git a/documentation/query/functions/finance.md b/documentation/query/functions/finance.md index ce99a6c40..8fb540ca1 100644 --- a/documentation/query/functions/finance.md +++ b/documentation/query/functions/finance.md @@ -394,76 +394,56 @@ has little linear predictive power for y. ### Examples -The following examples use the `trips` table of NYC taxi rides from the -[QuestDB demo instance](https://demo.questdb.io/), where the fare of a ride is -largely driven by the distance travelled. +The following examples use the `market_data_ohlc_1d` table on the +[QuestDB demo instance](https://demo.questdb.io/), which records the daily +`open`, `high`, `low`, and `close` price of each FX pair. -#### Measure how well one variable explains another +#### Measure how well the open explains the close -Treating `fare_amount` as the dependent variable y and `trip_distance` as the -independent variable x measures how much of the fare is explained by distance: +Treating the daily `close` as the dependent variable y and the `open` as the +independent variable x shows how much of the closing price is explained by where +the pair opened: -```questdb-sql title="R-squared of fare against distance" demo -SELECT regr_r2(fare_amount, trip_distance) AS r2 FROM trips; +```questdb-sql title="R-squared of close against open for EURUSD" demo +SELECT regr_r2(close, open) AS r2 +FROM market_data_ohlc_1d +WHERE symbol = 'EURUSD'; ``` | r2 | | ------ | -| 0.8562 | +| 0.7973 | -About 86% of the variation in fare is explained by a straight-line relationship -with distance. Because `regr_r2(y, x)` is the square of `corr(y, x)`, the same -value is returned by -`corr(fare_amount, trip_distance) * corr(fare_amount, trip_distance)`. +About 80% of the variation in the EURUSD daily close is explained by a +straight-line relationship with the open. Because `regr_r2(y, x)` is the square +of `corr(y, x)`, the same value is returned by +`corr(close, open) * corr(close, open)`. -#### Compare fit quality across groups +#### Compare fit quality across instruments -R-squared is a useful data-quality lens. Splitting the same regression by -`payment_type` shows that ordinary metered trips paid by `Card` or `Cash` track -distance closely, while voided, disputed, and no-charge trips fit far worse: +Running the same regression per `symbol` turns R-squared into a +trending-versus-ranging gauge. Freely floating majors drift from day to day, so +the open explains most of the close, while the tightly managed USDHKD peg barely +moves and its small daily fluctuations are mostly noise: -```questdb-sql title="R-squared of fare against distance per payment type" demo -SELECT payment_type, regr_r2(fare_amount, trip_distance) AS r2 -FROM trips +```questdb-sql title="R-squared of close against open per pair" demo +SELECT symbol, regr_r2(close, open) AS r2 +FROM market_data_ohlc_1d +WHERE symbol IN ('USDCHF', 'EURUSD', 'GBPUSD', 'EURGBP', 'USDHKD') +GROUP BY symbol ORDER BY r2 DESC; ``` -| payment_type | r2 | -| ------------ | ------ | -| Card | 0.8604 | -| Cash | 0.8600 | -| Unknown | 0.6502 | -| Dispute | 0.4250 | -| No Charge | 0.3964 | -| Voided | 0.3156 | +| symbol | r2 | +| ------ | ------ | +| USDCHF | 0.8895 | +| EURUSD | 0.7973 | +| GBPUSD | 0.6946 | +| EURGBP | 0.1811 | +| USDHKD | 0.0564 | -The low R-squared values for the anomalous payment types highlight trips whose -fare does not scale with distance. - -#### Handling null values - -`regr_r2` ignores any row where either argument is null. The query below -supplies six rows, two of which contain a null, so only the four complete (y, x) -pairs are used in the calculation: - -```questdb-sql title="Null pairs are ignored" -SELECT regr_r2(y, x) AS r2 -FROM ( - SELECT 1 AS x, 2 AS y - UNION ALL SELECT 2, 3 - UNION ALL SELECT 3, null - UNION ALL SELECT null, 5 - UNION ALL SELECT 4, 4 - UNION ALL SELECT 5, 6 -); -``` - -| r2 | -| ------ | -| 0.9257 | - -The result matches running the query over only the four complete pairs; the two -rows that contain a null are skipped. +The pegged USDHKD scores far lower than the floating pairs, flagging an +instrument whose daily close is largely disconnected from its open. ## regr_slope From b7a909e1f025eb3114999c6b4d91b2248a3afabe Mon Sep 17 00:00:00 2001 From: Sergey Minaev <5072859+jovfer@users.noreply.github.com> Date: Thu, 4 Jun 2026 12:02:34 +0100 Subject: [PATCH 4/4] Drop redundant group by in regr_r2 example --- documentation/query/functions/finance.md | 1 - 1 file changed, 1 deletion(-) diff --git a/documentation/query/functions/finance.md b/documentation/query/functions/finance.md index 8fb540ca1..ac88e7b9b 100644 --- a/documentation/query/functions/finance.md +++ b/documentation/query/functions/finance.md @@ -430,7 +430,6 @@ moves and its small daily fluctuations are mostly noise: SELECT symbol, regr_r2(close, open) AS r2 FROM market_data_ohlc_1d WHERE symbol IN ('USDCHF', 'EURUSD', 'GBPUSD', 'EURGBP', 'USDHKD') -GROUP BY symbol ORDER BY r2 DESC; ```