Merge pull request 'STUD-13 add Quantile' (!528) from qwazer/kmath:STUD-13-quantile into dev

Reviewed-on: 
Reviewed-by: Alexander Nozik <altavir@gmail.com>
This commit is contained in:
Alexander Nozik 2025-05-26 12:19:24 +03:00
commit d2a3fd4fa2
5 changed files with 290 additions and 27 deletions
docs
kmath-stat/src
commonMain/kotlin/space/kscience/kmath/stat
commonTest/kotlin/space/kscience/kmath/stat
jvmTest/kotlin/space/kscience/kmath/stat

@ -11,24 +11,24 @@ There are two subinterfaces of the `Statistic` interface:
## Common statistics and Implementation Status
| Category | Statistic | Description | Implementation Status |
|------------------|-------------------|-------------------------------------|--------------------------------|
| **Basic** | Min | Minimum value | ✅ `ComposableStatistic` |
| | Max | Maximum value | ✅ `ComposableStatistic` |
| | Mean | Arithmetic mean | ✅ `ComposableStatistic` |
| | Sum | Sum of all values | 🚧 Not yet implemented |
| | Product | Product of all values | 🚧 Not yet implemented |
| **Distribution** | Median | Median (50th percentile) | ✅ `BlockingStatistic` |
| | Quantile | Arbitrary percentile (e.g., Q1, Q3) | 🚧 Not yet implemented |
| | Variance | Unbiased sample variance | ✅ `BlockingStatistic` |
| | StandardDeviation | Population standard deviation (σ) | ✅ `BlockingStatistic` |
| | Skewness | Measure of distribution asymmetry | 🚧 *(Requires `ThirdMoment`)* |
| | Kurtosis | Measure of distribution tailedness | 🚧 *(Requires `FourthMoment`)* |
| **Advanced** | GeometricMean | Nth root of product of values | ✅ `ComposableStatistic` |
| | SumOfLogs | Sum of natural logarithms | Does not planned |
| | SumOfSquares | Sum of squared values | 🚧 *(Blocks `Variance`)* |
| **Moments** | FirstMoment | Mean (same as `Mean`) | ✅ *(Alias for `Mean`)* |
| | SecondMoment | Variance (same as `Variance`) | ✅ *(Alias for `Variance`)* |
| | ThirdMoment | Used in skewness calculation | 🚧 Not yet implemented |
| | FourthMoment | Used in kurtosis calculation | 🚧 Not yet implemented |
| **Risk Metrics** | SemiVariance | Downside variance | 🚧 *(Depends on `Variance`)* |
| Category | Statistic | Description | Implementation Status |
|------------------|-------------------|------------------------------------|--------------------------------|
| **Basic** | Min | Minimum value | ✅ `ComposableStatistic` |
| | Max | Maximum value | ✅ `ComposableStatistic` |
| | Mean | Arithmetic mean | ✅ `ComposableStatistic` |
| | Sum | Sum of all values | 🚧 Not yet implemented |
| | Product | Product of all values | 🚧 Not yet implemented |
| **Distribution** | Median | Median (50th percentile) | ✅ `BlockingStatistic` |
| | Quantile | Quantile and percentile | ✅ `BlockingStatistic` |
| | Variance | Unbiased sample variance | ✅ `BlockingStatistic` |
| | StandardDeviation | Population standard deviation (σ) | ✅ `BlockingStatistic` |
| | Skewness | Measure of distribution asymmetry | 🚧 *(Requires `ThirdMoment`)* |
| | Kurtosis | Measure of distribution tailedness | 🚧 *(Requires `FourthMoment`)* |
| **Advanced** | GeometricMean | Nth root of product of values | ✅ `ComposableStatistic` |
| | SumOfLogs | Sum of natural logarithms | Does not planned |
| | SumOfSquares | Sum of squared values | 🚧 *(Blocks `Variance`)* |
| **Moments** | FirstMoment | Mean (same as `Mean`) | ✅ *(Alias for `Mean`)* |
| | SecondMoment | Variance (same as `Variance`) | ✅ *(Alias for `Variance`)* |
| | ThirdMoment | Used in skewness calculation | 🚧 Not yet implemented |
| | FourthMoment | Used in kurtosis calculation | 🚧 Not yet implemented |
| **Risk Metrics** | SemiVariance | Downside variance | 🚧 *(Depends on `Variance`)* |

@ -0,0 +1,97 @@
/*
* Copyright 2018-2025 KMath contributors.
* Use of this source code is governed by the Apache 2.0 license that can be found in the license/LICENSE.txt file.
*/
package space.kscience.kmath.stat
import space.kscience.kmath.misc.sortedWith
import space.kscience.kmath.operations.*
import space.kscience.kmath.structures.Buffer
import space.kscience.kmath.structures.Float64
import kotlin.math.floor
/**
* Compute the quantile of a data [Buffer] at a specified probability [p] on the interval [0,1].
* The class argument [sorted] indicates whether data [Buffer] can be assumed to be sorted;
* if false (the default), then the elements of data [Buffer] will be partially sorted in-place using [comparator].
*
* Samples quantile are defined by `Q(p) = (1-γ)*x[j] + γ*x[j+1]`,
* where `x[j]` is the j-th order statistic of `v`, `j = floor(n*p + m)`,
* `m = alpha + p*(1 - alpha - beta)` and `γ = n*p + m - j`.
*
* By default (`alpha = beta = 1`), quantiles are computed via linear interpolation between the points
* `((k-1)/(n-1), x[k])`, for `k = 1:n` where `n = length(v)`. This corresponds to Definition 7
* of Hyndman and Fan (1996), and is the same as the R and NumPy default.
*
* The keyword arguments `alpha` and `beta` correspond to the same parameters in Hyndman and Fan,
* setting them to different values allows to calculate quantiles with any of the methods 4-9
* defined in this paper:
* - Def. 4: `alpha=0`, `beta=1`
* - Def. 5: `alpha=0.5`, `beta=0.5` (MATLAB default)
* - Def. 6: `alpha=0`, `beta=0` (Excel `PERCENTILE.EXC`, Python default, Stata `altdef`)
* - Def. 7: `alpha=1`, `beta=1` (Julia, R and NumPy default, Excel `PERCENTILE` and `PERCENTILE.INC`, Python `'inclusive'`)
* - Def. 8: `alpha=1/3`, `beta=1/3`
* - Def. 9: `alpha=3/8`, `beta=3/8`
*
* # References
* - Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages",
* *The American Statistician*, Vol. 50, No. 4, pp. 361-365
*
* - [Quantile on Wikipedia](https://en.wikipedia.org/wiki/Quantile) details the different quantile definitions
*
* # Notes
* - Performance can be improved by using a selection algorithm instead of a complete sort.
* - As further improvement, API can be redesigned support "a batch mode" - scenarios when several different quantiles are desired.
*
*/
public class Quantile<T>(
private val field: Field<T>,
private val p: Float64,
private val sorted: Boolean = false,
private val comparator: Comparator<T>, //todo provide default?
private val alpha: Float64 = 1.0,
private val beta: Float64 = alpha,
) : BlockingStatistic<T, T> {
init {
require(p in 0.0..1.0) { "Quantile value must be in [0.0, 1.0] range. Got $p." }
require(alpha in 0.0..1.0) { "alpha parameter must be in [0.0, 1.0] range. Got $alpha." }
require(beta in 0.0..1.0) { "beta parameter must be in [0.0, 1.0] range. Got $beta." }
}
override fun evaluateBlocking(data: Buffer<T>): T {
// adapted from https://github.com/JuliaStats/Statistics.jl/blob/master/src/Statistics.jl#L1045
require(data.size > 0) { "Can't compute percentile of an empty buffer" }
if (data.size == 1) {
return data[0]
}
val n = data.size;
val m = alpha + p * (1 - alpha - beta)
val j = (floor(n*p + m).toInt()).coerceIn(1, n-1)
val aleph = n*p + m;// todo use Math.fma in case of double or float for better accuracy
val gamma = (aleph -j).coerceIn(0.0, 1.0)
var sortedData = data
if (!sorted) {
sortedData = data.sortedWith(comparator)
}
with(field) {
return sortedData[j-1] + gamma*(sortedData[j]-sortedData[j-1])
}
}
public companion object {
public fun evaluate(p: Float64, buffer: Buffer<Float64>): Double = Float64Field.quantile(p).evaluateBlocking(buffer)
}
}
public fun Float64Field.quantile(p: Float64): Quantile<Float64> = Quantile(Float64Field, p, false, naturalOrder())

@ -0,0 +1,145 @@
/*
* Copyright 2018-2025 KMath contributors.
* Use of this source code is governed by the Apache 2.0 license that can be found in the license/LICENSE.txt file.
*/
package space.kscience.kmath.stat
import space.kscience.kmath.operations.DoubleField
import space.kscience.kmath.operations.Float64Field
import space.kscience.kmath.structures.Float64Buffer
import kotlin.math.abs
import kotlin.test.Test
import kotlin.test.assertEquals
import kotlin.test.assertFailsWith
import kotlin.test.assertTrue
internal class QuantileBasicTest {
@Test
fun testBasicQuantile() {
val data = Float64Buffer(1.0, 2.0, 3.0, 4.0)
assertEquals(2.5, Quantile.evaluate(0.5, data))
}
@Test
fun testSingleElement() {
val data = Float64Buffer(1.0)
assertEquals(1.0, Quantile.evaluate(0.5, data))
}
@Test
fun testTwoElements() {
val data = Float64Buffer(1.0, 3.0)
assertEquals(2.0, Quantile.evaluate(0.5, data))
}
@Test
fun testSortedInput() {
val data = Float64Buffer(101) { it.toDouble() } // 0.0 to 100.0
val quantile = Quantile(DoubleField, 0.1, sorted = true, comparator = naturalOrder())
assertEquals(10.0, quantile.evaluateBlocking(data), 1e-6)
}
@Test
fun testReverseSortedInput() {
val data = Float64Buffer(101) { (100 - it).toDouble() } // 100.0 to 0.0
assertEquals(10.0, Quantile.evaluate(0.1, data), 1e-6)
}
// @Test //todo try to fix
// fun testExtremeValues() {
// val data = Float64Buffer(Double.POSITIVE_INFINITY, Double.POSITIVE_INFINITY)
// assertTrue(Quantile.evaluate(0.5, data).isInfinite())
//
// val data2 = Float64Buffer(Double.NEGATIVE_INFINITY, 1.0)
// assertTrue(Quantile.evaluate(0.5, data2).isInfinite())
// }
@Test
fun testVerySmallQuantile() {
val data = Float64Buffer(0.0, 1.0)
val result = Quantile.evaluate(1e-18, data)
assertTrue(abs(result) <= 1e-18)
}
@Test
fun testEmptyInput() {
val data = Float64Buffer(0) { 0.0 }
assertFailsWith<IllegalArgumentException> {
Quantile.evaluate(0.5, data)
}
}
@Test
fun testInvalidP() {
assertFailsWith<IllegalArgumentException> {
Quantile(Float64Field, -0.1, comparator = naturalOrder())
}
assertFailsWith<IllegalArgumentException> {
Quantile(Float64Field, 1.1, comparator = naturalOrder())
}
}
@Test
fun testAlphaBetaParameters() {
val v = Float64Buffer(2.0, 3.0, 4.0, 6.0, 9.0, 2.0, 6.0, 2.0, 21.0, 17.0)
// Tests against scipy.stats.mstats.mquantiles method
assertEquals(2.0, Quantile(Float64Field, 0.0, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v))
assertEquals(2.0, Quantile(Float64Field, 0.2, alpha = 1.0, beta = 1.0, comparator = naturalOrder()).evaluateBlocking(v))
assertEquals(3.4, Quantile(Float64Field, 0.4, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
assertEquals(6.0, Quantile(Float64Field, 0.6, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
assertEquals(15.4, Quantile(Float64Field, 0.8, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
assertEquals(21.0, Quantile(Float64Field, 1.0, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
}
@Test
fun testRounding() {
val data = Float64Buffer(10) { (it + 1).toDouble() } // 1.0 to 10.0
for (i in 0..9) {
val p = i / 9.0
val expected = (i + 1).toDouble()
assertEquals(expected, Quantile.evaluate(p, data), 1e-6) //todo try to avoid rounding
}
}
// @Test
// fun testNoOverflow() { //todo try to fix
// val data1 = Float64Buffer(-9000.0, 100.0)
// assertEquals(100.0, Quantile.evaluate(1.0, data1))
//
// val data2 = Float64Buffer(-1e20, 100.0)
// assertEquals(100.0, Quantile.evaluate(1.0, data2))
// }
@Test
fun testIncreasingQuantiles() {
val data = Float64Buffer(1.0, 1.0, 1.0 + Double.MIN_VALUE, 1.0 + Double.MIN_VALUE)
val quantiles = (0..99).map { i ->
val p = i / 99.0
Quantile.evaluate(p, data)
}
assertTrue(quantiles.zipWithNext().all { (a, b) -> a <= b })
}
@Test
fun testUnsortedData() {
val data = Float64Buffer(4.0, 9.0, 1.0, 5.0, 7.0, 8.0, 2.0, 3.0, 5.0, 17.0, 11.0)
assertEquals(2.0, Quantile.evaluate(0.1, data))
assertEquals(3.0, Quantile.evaluate(0.2, data))
assertEquals(5.0, Quantile.evaluate(0.4, data))
assertEquals(11.0, Quantile.evaluate(0.9, data))
}
@Test
fun testMedianOdd() {
val res = Quantile.evaluate(0.5, Float64Buffer(1.0, 2.0, 3.0))
assertEquals(2.0, res)
}
@Test
fun testMedianEven() {
val res = Quantile.evaluate(0.5, Float64Buffer(1.0, 2.0, 3.0, 4.0))
assertEquals(2.5, res)
}
}

@ -14,19 +14,31 @@ import space.kscience.kmath.samplers.GaussianSampler
internal class CommonsDistributionsTest {
@Test
fun testNormalDistributionSuspend() = runBlocking {
val distribution = GaussianSampler(7.0, 2.0)
val mu = 7.0; val sigma = 2.0
val distribution = GaussianSampler(mu, sigma)
val generator = RandomGenerator.default(1)
val sample = distribution.sample(generator).nextBuffer(1000)
Assertions.assertEquals(7.0, Mean.evaluate(sample), 0.2)
Assertions.assertEquals(2.0, StandardDeviation.evaluate(sample), 0.2)
Assertions.assertEquals(mu, Mean.evaluate(sample), 0.2)
Assertions.assertEquals(sigma, StandardDeviation.evaluate(sample), 0.2)
//the first quartile (Q1)
// For a normal distribution with mean \(\mu \) and standard deviation \(\sigma \), the Q1 is approximately \(\mu -0.675\sigma \)
Assertions.assertEquals(mu-0.675*sigma, Quantile.evaluate(p=0.25, sample), 1e-1)
//the third quartile (Q3)
Assertions.assertEquals(mu+0.675*sigma, Quantile.evaluate(p=0.75, sample), 1e-1)
}
@Test
fun testNormalDistributionBlocking() {
val distribution = GaussianSampler(7.0, 2.0)
val mu = 7.0; val sigma = 2.0
val distribution = GaussianSampler(mu, sigma)
val generator = RandomGenerator.default(1)
val sample = distribution.sample(generator).nextBufferBlocking(1000)
Assertions.assertEquals(7.0, Mean.evaluate(sample), 0.2)
Assertions.assertEquals(2.0, StandardDeviation.evaluate(sample), 0.2)
Assertions.assertEquals(mu, Mean.evaluate(sample), 0.2)
Assertions.assertEquals(sigma, StandardDeviation.evaluate(sample), 0.2)
//the first quartile (Q1)
// For a normal distribution with mean \(\mu \) and standard deviation \(\sigma \), the Q1 is approximately \(\mu -0.675\sigma \)
Assertions.assertEquals(mu-0.675*sigma, Quantile.evaluate(p=0.25, sample), 1e-1)
//the third quartile (Q3)
Assertions.assertEquals(mu+0.675*sigma, Quantile.evaluate(p=0.75, sample), 1e-1)
}
}

@ -59,4 +59,13 @@ internal class MeanStatisticTest {
assertEquals(0.5, average, 1e-2)
}
@Test
fun quantileMatchMean() = runTest {
val chunked = setupData()
val first = runBlocking { chunked.first() }
val mean = Float64Field.mean(first)
val quantile = Quantile.evaluate(p=0.5,first)
assertEquals(mean, quantile, 1e-2)
}
}