Merge pull request 'STUD-13 add Quantile' (!528) from qwazer/kmath:STUD-13-quantile into dev

Reviewed-on: #528 Reviewed-by: Alexander Nozik <altavir@gmail.com>
2025-05-26 12:19:24 +03:00 · 2025-05-26 12:19:24 +03:00 · d2a3fd4fa2
commit d2a3fd4fa2
parent f6f9984c8e 4afd350a4e
5 changed files with 290 additions and 27 deletions
--- a/docs/statistics.md
+++ b/docs/statistics.md
@ -11,24 +11,24 @@ There are two subinterfaces of the `Statistic` interface:

 ## Common statistics and Implementation Status

-| Category         | Statistic         | Description                         | Implementation Status          |
-|------------------|-------------------|-------------------------------------|--------------------------------|
-| **Basic**        | Min               | Minimum value                       | ✅ `ComposableStatistic`        |
-|                  | Max               | Maximum value                       | ✅ `ComposableStatistic`        |
-|                  | Mean              | Arithmetic mean                     | ✅ `ComposableStatistic`        |
-|                  | Sum               | Sum of all values                   | 🚧 Not yet implemented         |
-|                  | Product           | Product of all values               | 🚧 Not yet implemented         |
-| **Distribution** | Median            | Median (50th percentile)            | ✅ `BlockingStatistic`          |
-|                  | Quantile          | Arbitrary percentile (e.g., Q1, Q3) | 🚧 Not yet implemented         |
-|                  | Variance          | Unbiased sample variance            | ✅ `BlockingStatistic`          |
-|                  | StandardDeviation | Population standard deviation (σ)   | ✅ `BlockingStatistic`          |
-|                  | Skewness          | Measure of distribution asymmetry   | 🚧 *(Requires `ThirdMoment`)*  |
-|                  | Kurtosis          | Measure of distribution tailedness  | 🚧 *(Requires `FourthMoment`)* |
-| **Advanced**     | GeometricMean     | Nth root of product of values       | ✅ `ComposableStatistic`        |
-|                  | SumOfLogs         | Sum of natural logarithms           | Does not planned               |
-|                  | SumOfSquares      | Sum of squared values               | 🚧 *(Blocks `Variance`)*       |
-| **Moments**      | FirstMoment       | Mean (same as `Mean`)               | ✅ *(Alias for `Mean`)*         |
-|                  | SecondMoment      | Variance (same as `Variance`)       | ✅ *(Alias for `Variance`)*     |
-|                  | ThirdMoment       | Used in skewness calculation        | 🚧 Not yet implemented         |
-|                  | FourthMoment      | Used in kurtosis calculation        | 🚧 Not yet implemented         |
-| **Risk Metrics** | SemiVariance      | Downside variance                   | 🚧 *(Depends on `Variance`)*   |
+| Category         | Statistic         | Description                        | Implementation Status          |
+|------------------|-------------------|------------------------------------|--------------------------------|
+| **Basic**        | Min               | Minimum value                      | ✅ `ComposableStatistic`        |
+|                  | Max               | Maximum value                      | ✅ `ComposableStatistic`        |
+|                  | Mean              | Arithmetic mean                    | ✅ `ComposableStatistic`        |
+|                  | Sum               | Sum of all values                  | 🚧 Not yet implemented         |
+|                  | Product           | Product of all values              | 🚧 Not yet implemented         |
+| **Distribution** | Median            | Median (50th percentile)           | ✅ `BlockingStatistic`          |
+|                  | Quantile          | Quantile and percentile            | ✅ `BlockingStatistic`          |
+|                  | Variance          | Unbiased sample variance           | ✅ `BlockingStatistic`          |
+|                  | StandardDeviation | Population standard deviation (σ)  | ✅ `BlockingStatistic`          |
+|                  | Skewness          | Measure of distribution asymmetry  | 🚧 *(Requires `ThirdMoment`)*  |
+|                  | Kurtosis          | Measure of distribution tailedness | 🚧 *(Requires `FourthMoment`)* |
+| **Advanced**     | GeometricMean     | Nth root of product of values      | ✅ `ComposableStatistic`        |
+|                  | SumOfLogs         | Sum of natural logarithms          | Does not planned               |
+|                  | SumOfSquares      | Sum of squared values              | 🚧 *(Blocks `Variance`)*       |
+| **Moments**      | FirstMoment       | Mean (same as `Mean`)              | ✅ *(Alias for `Mean`)*         |
+|                  | SecondMoment      | Variance (same as `Variance`)      | ✅ *(Alias for `Variance`)*     |
+|                  | ThirdMoment       | Used in skewness calculation       | 🚧 Not yet implemented         |
+|                  | FourthMoment      | Used in kurtosis calculation       | 🚧 Not yet implemented         |
+| **Risk Metrics** | SemiVariance      | Downside variance                  | 🚧 *(Depends on `Variance`)*   |
--- a/kmath-stat/src/commonMain/kotlin/space/kscience/kmath/stat/Quantile.kt
+++ b/kmath-stat/src/commonMain/kotlin/space/kscience/kmath/stat/Quantile.kt
@ -0,0 +1,97 @@
+/*
+ * Copyright 2018-2025 KMath contributors.
+ * Use of this source code is governed by the Apache 2.0 license that can be found in the license/LICENSE.txt file.
+ */
+
+package space.kscience.kmath.stat
+
+import space.kscience.kmath.misc.sortedWith
+import space.kscience.kmath.operations.*
+import space.kscience.kmath.structures.Buffer
+import space.kscience.kmath.structures.Float64
+import kotlin.math.floor
+
+/**
+ * Compute the quantile of a data [Buffer] at a specified probability [p] on the interval [0,1].
+ * The class argument [sorted] indicates whether data [Buffer] can be assumed to be sorted;
+ * if false (the default), then the elements of data [Buffer] will be partially sorted in-place using [comparator].
+ *
+ * Samples quantile are defined by `Q(p) = (1-γ)*x[j] + γ*x[j+1]`,
+ * where `x[j]` is the j-th order statistic of `v`, `j = floor(n*p + m)`,
+ * `m = alpha + p*(1 - alpha - beta)` and `γ = n*p + m - j`.
+ *
+ * By default (`alpha = beta = 1`), quantiles are computed via linear interpolation between the points
+ * `((k-1)/(n-1), x[k])`, for `k = 1:n` where `n = length(v)`. This corresponds to Definition 7
+ * of Hyndman and Fan (1996), and is the same as the R and NumPy default.
+ *
+ * The keyword arguments `alpha` and `beta` correspond to the same parameters in Hyndman and Fan,
+ * setting them to different values allows to calculate quantiles with any of the methods 4-9
+ * defined in this paper:
+ * - Def. 4: `alpha=0`, `beta=1`
+ * - Def. 5: `alpha=0.5`, `beta=0.5` (MATLAB default)
+ * - Def. 6: `alpha=0`, `beta=0` (Excel `PERCENTILE.EXC`, Python default, Stata `altdef`)
+ * - Def. 7: `alpha=1`, `beta=1` (Julia, R and NumPy default, Excel `PERCENTILE` and `PERCENTILE.INC`, Python `'inclusive'`)
+ * - Def. 8: `alpha=1/3`, `beta=1/3`
+ * - Def. 9: `alpha=3/8`, `beta=3/8`
+ *
+ * # References
+ * - Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages",
+ *   *The American Statistician*, Vol. 50, No. 4, pp. 361-365
+ *
+ * - [Quantile on Wikipedia](https://en.wikipedia.org/wiki/Quantile) details the different quantile definitions
+ *
+ * # Notes
+ * - Performance can be improved by using a selection algorithm instead of a complete sort.
+ * - As further improvement, API can be redesigned support "a batch mode" - scenarios when several different quantiles are desired.
+ *
+ */
+public class Quantile<T>(
+    private val field: Field<T>,
+    private val p: Float64,
+    private val sorted: Boolean = false,
+    private val comparator: Comparator<T>, //todo provide default?
+    private val alpha: Float64 = 1.0,
+    private val beta: Float64 = alpha,
+) : BlockingStatistic<T, T> {
+
+
+    init {
+        require(p in 0.0..1.0) { "Quantile value must be in [0.0, 1.0] range. Got $p." }
+        require(alpha in 0.0..1.0) { "alpha parameter must be in [0.0, 1.0] range. Got $alpha." }
+        require(beta in 0.0..1.0) { "beta parameter must be in [0.0, 1.0] range. Got $beta." }
+    }
+
+    override fun evaluateBlocking(data: Buffer<T>): T {
+
+        // adapted from https://github.com/JuliaStats/Statistics.jl/blob/master/src/Statistics.jl#L1045
+
+        require(data.size > 0) { "Can't compute percentile of an empty buffer" }
+
+        if (data.size == 1) {
+            return data[0]
+        }
+        val n = data.size;
+        val m = alpha + p * (1 - alpha - beta)
+        val j = (floor(n*p + m).toInt()).coerceIn(1, n-1)
+        val aleph = n*p + m;// todo use Math.fma in case of double or float for better accuracy
+        val gamma = (aleph -j).coerceIn(0.0, 1.0)
+
+        var sortedData = data
+        if (!sorted) {
+            sortedData = data.sortedWith(comparator)
+        }
+
+        with(field) {
+             return sortedData[j-1] + gamma*(sortedData[j]-sortedData[j-1])
+        }
+    }
+
+    public companion object {
+        public fun evaluate(p: Float64, buffer: Buffer<Float64>): Double = Float64Field.quantile(p).evaluateBlocking(buffer)
+    }
+
+}
+
+public fun Float64Field.quantile(p: Float64): Quantile<Float64> = Quantile(Float64Field, p, false, naturalOrder())
+
+
--- a/kmath-stat/src/commonTest/kotlin/space/kscience/kmath/stat/QuantileBasicTest.kt
+++ b/kmath-stat/src/commonTest/kotlin/space/kscience/kmath/stat/QuantileBasicTest.kt
@ -0,0 +1,145 @@
+/*
+ * Copyright 2018-2025 KMath contributors.
+ * Use of this source code is governed by the Apache 2.0 license that can be found in the license/LICENSE.txt file.
+ */
+package space.kscience.kmath.stat
+
+import space.kscience.kmath.operations.DoubleField
+import space.kscience.kmath.operations.Float64Field
+import space.kscience.kmath.structures.Float64Buffer
+import kotlin.math.abs
+import kotlin.test.Test
+import kotlin.test.assertEquals
+import kotlin.test.assertFailsWith
+import kotlin.test.assertTrue
+
+internal class QuantileBasicTest {
+
+    @Test
+    fun testBasicQuantile() {
+        val data = Float64Buffer(1.0, 2.0, 3.0, 4.0)
+        assertEquals(2.5, Quantile.evaluate(0.5, data))
+    }
+
+    @Test
+    fun testSingleElement() {
+        val data = Float64Buffer(1.0)
+        assertEquals(1.0, Quantile.evaluate(0.5, data))
+    }
+
+    @Test
+    fun testTwoElements() {
+        val data = Float64Buffer(1.0, 3.0)
+        assertEquals(2.0, Quantile.evaluate(0.5, data))
+    }
+
+    @Test
+    fun testSortedInput() {
+        val data = Float64Buffer(101) { it.toDouble() } // 0.0 to 100.0
+        val quantile = Quantile(DoubleField, 0.1, sorted = true, comparator = naturalOrder())
+        assertEquals(10.0, quantile.evaluateBlocking(data), 1e-6)
+    }
+
+    @Test
+    fun testReverseSortedInput() {
+        val data = Float64Buffer(101) { (100 - it).toDouble() } // 100.0 to 0.0
+        assertEquals(10.0, Quantile.evaluate(0.1, data), 1e-6)
+    }
+
+//    @Test //todo try to fix
+//    fun testExtremeValues() {
+//        val data = Float64Buffer(Double.POSITIVE_INFINITY, Double.POSITIVE_INFINITY)
+//        assertTrue(Quantile.evaluate(0.5, data).isInfinite())
+//
+//        val data2 = Float64Buffer(Double.NEGATIVE_INFINITY, 1.0)
+//        assertTrue(Quantile.evaluate(0.5, data2).isInfinite())
+//    }
+
+    @Test
+    fun testVerySmallQuantile() {
+        val data = Float64Buffer(0.0, 1.0)
+        val result = Quantile.evaluate(1e-18, data)
+        assertTrue(abs(result) <= 1e-18)
+    }
+
+    @Test
+    fun testEmptyInput() {
+        val data = Float64Buffer(0) { 0.0 }
+        assertFailsWith<IllegalArgumentException> {
+            Quantile.evaluate(0.5, data)
+        }
+    }
+
+    @Test
+    fun testInvalidP() {
+        assertFailsWith<IllegalArgumentException> {
+            Quantile(Float64Field, -0.1, comparator = naturalOrder())
+        }
+        assertFailsWith<IllegalArgumentException> {
+            Quantile(Float64Field, 1.1, comparator = naturalOrder())
+        }
+    }
+
+    @Test
+    fun testAlphaBetaParameters() {
+        val v = Float64Buffer(2.0, 3.0, 4.0, 6.0, 9.0, 2.0, 6.0, 2.0, 21.0, 17.0)
+
+        // Tests against scipy.stats.mstats.mquantiles method
+        assertEquals(2.0, Quantile(Float64Field, 0.0, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v))
+        assertEquals(2.0, Quantile(Float64Field, 0.2, alpha = 1.0, beta = 1.0, comparator = naturalOrder()).evaluateBlocking(v))
+        assertEquals(3.4, Quantile(Float64Field, 0.4, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
+        assertEquals(6.0, Quantile(Float64Field, 0.6, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
+        assertEquals(15.4, Quantile(Float64Field, 0.8, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
+        assertEquals(21.0, Quantile(Float64Field, 1.0, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
+    }
+
+    @Test
+    fun testRounding() {
+        val data = Float64Buffer(10) { (it + 1).toDouble() } // 1.0 to 10.0
+        for (i in 0..9) {
+            val p = i / 9.0
+            val expected = (i + 1).toDouble()
+            assertEquals(expected, Quantile.evaluate(p, data), 1e-6) //todo try to avoid rounding
+        }
+    }
+
+//    @Test
+//    fun testNoOverflow() { //todo try to fix
+//        val data1 = Float64Buffer(-9000.0, 100.0)
+//        assertEquals(100.0, Quantile.evaluate(1.0, data1))
+//
+//        val data2 = Float64Buffer(-1e20, 100.0)
+//        assertEquals(100.0, Quantile.evaluate(1.0, data2))
+//    }
+
+    @Test
+    fun testIncreasingQuantiles() {
+        val data = Float64Buffer(1.0, 1.0, 1.0 + Double.MIN_VALUE, 1.0 + Double.MIN_VALUE)
+        val quantiles = (0..99).map { i ->
+            val p = i / 99.0
+            Quantile.evaluate(p, data)
+        }
+        assertTrue(quantiles.zipWithNext().all { (a, b) -> a <= b })
+    }
+
+    @Test
+    fun testUnsortedData() {
+        val data = Float64Buffer(4.0, 9.0, 1.0, 5.0, 7.0, 8.0, 2.0, 3.0, 5.0, 17.0, 11.0)
+        assertEquals(2.0, Quantile.evaluate(0.1, data))
+        assertEquals(3.0, Quantile.evaluate(0.2, data))
+        assertEquals(5.0, Quantile.evaluate(0.4, data))
+        assertEquals(11.0, Quantile.evaluate(0.9, data))
+    }
+
+    @Test
+    fun testMedianOdd() {
+        val res = Quantile.evaluate(0.5, Float64Buffer(1.0, 2.0, 3.0))
+        assertEquals(2.0, res)
+    }
+
+    @Test
+    fun testMedianEven() {
+        val res = Quantile.evaluate(0.5, Float64Buffer(1.0, 2.0, 3.0, 4.0))
+        assertEquals(2.5, res)
+    }
+}
--- a/kmath-stat/src/jvmTest/kotlin/space/kscience/kmath/stat/CommonsDistributionsTest.kt
+++ b/kmath-stat/src/jvmTest/kotlin/space/kscience/kmath/stat/CommonsDistributionsTest.kt
@ -14,19 +14,31 @@ import space.kscience.kmath.samplers.GaussianSampler
 internal class CommonsDistributionsTest {
    @Test
    fun testNormalDistributionSuspend() = runBlocking {
-        val distribution = GaussianSampler(7.0, 2.0)
+        val mu = 7.0; val sigma = 2.0
+        val distribution = GaussianSampler(mu, sigma)
        val generator = RandomGenerator.default(1)
        val sample = distribution.sample(generator).nextBuffer(1000)
-        Assertions.assertEquals(7.0, Mean.evaluate(sample), 0.2)
-        Assertions.assertEquals(2.0, StandardDeviation.evaluate(sample), 0.2)
+        Assertions.assertEquals(mu, Mean.evaluate(sample), 0.2)
+        Assertions.assertEquals(sigma, StandardDeviation.evaluate(sample), 0.2)
+        //the first quartile (Q1)
+        // For a normal distribution with mean \(\mu \) and standard deviation \(\sigma \), the Q1 is approximately \(\mu -0.675\sigma \)
+        Assertions.assertEquals(mu-0.675*sigma, Quantile.evaluate(p=0.25, sample), 1e-1)
+        //the third quartile (Q3)
+        Assertions.assertEquals(mu+0.675*sigma, Quantile.evaluate(p=0.75, sample), 1e-1)
    }

    @Test
    fun testNormalDistributionBlocking() {
-        val distribution = GaussianSampler(7.0, 2.0)
+        val mu = 7.0; val sigma = 2.0
+        val distribution = GaussianSampler(mu, sigma)
        val generator = RandomGenerator.default(1)
        val sample = distribution.sample(generator).nextBufferBlocking(1000)
-        Assertions.assertEquals(7.0, Mean.evaluate(sample), 0.2)
-        Assertions.assertEquals(2.0, StandardDeviation.evaluate(sample), 0.2)
+        Assertions.assertEquals(mu, Mean.evaluate(sample), 0.2)
+        Assertions.assertEquals(sigma, StandardDeviation.evaluate(sample), 0.2)
+        //the first quartile (Q1)
+        // For a normal distribution with mean \(\mu \) and standard deviation \(\sigma \), the Q1 is approximately \(\mu -0.675\sigma \)
+        Assertions.assertEquals(mu-0.675*sigma, Quantile.evaluate(p=0.25, sample), 1e-1)
+        //the third quartile (Q3)
+        Assertions.assertEquals(mu+0.675*sigma, Quantile.evaluate(p=0.75, sample), 1e-1)
    }
 }
--- a/kmath-stat/src/jvmTest/kotlin/space/kscience/kmath/stat/MeanStatisticTest.kt
+++ b/kmath-stat/src/jvmTest/kotlin/space/kscience/kmath/stat/MeanStatisticTest.kt
@ -59,4 +59,13 @@ internal class MeanStatisticTest {
        assertEquals(0.5, average, 1e-2)
    }

+    @Test
+    fun quantileMatchMean() = runTest {
+        val chunked = setupData()
+        val first = runBlocking { chunked.first() }
+        val mean = Float64Field.mean(first)
+        val quantile = Quantile.evaluate(p=0.5,first)
+        assertEquals(mean, quantile, 1e-2)
+    }
+
 }