Merge pull request 'STUD-13 add Quantile' (!528) from qwazer/kmath:STUD-13-quantile into dev
Reviewed-on: #528 Reviewed-by: Alexander Nozik <altavir@gmail.com>
This commit is contained in:
commit
d2a3fd4fa2
docs
kmath-stat/src
commonMain/kotlin/space/kscience/kmath/stat
commonTest/kotlin/space/kscience/kmath/stat
jvmTest/kotlin/space/kscience/kmath/stat
@ -11,24 +11,24 @@ There are two subinterfaces of the `Statistic` interface:
|
||||
|
||||
## Common statistics and Implementation Status
|
||||
|
||||
| Category | Statistic | Description | Implementation Status |
|
||||
|------------------|-------------------|-------------------------------------|--------------------------------|
|
||||
| **Basic** | Min | Minimum value | ✅ `ComposableStatistic` |
|
||||
| | Max | Maximum value | ✅ `ComposableStatistic` |
|
||||
| | Mean | Arithmetic mean | ✅ `ComposableStatistic` |
|
||||
| | Sum | Sum of all values | 🚧 Not yet implemented |
|
||||
| | Product | Product of all values | 🚧 Not yet implemented |
|
||||
| **Distribution** | Median | Median (50th percentile) | ✅ `BlockingStatistic` |
|
||||
| | Quantile | Arbitrary percentile (e.g., Q1, Q3) | 🚧 Not yet implemented |
|
||||
| | Variance | Unbiased sample variance | ✅ `BlockingStatistic` |
|
||||
| | StandardDeviation | Population standard deviation (σ) | ✅ `BlockingStatistic` |
|
||||
| | Skewness | Measure of distribution asymmetry | 🚧 *(Requires `ThirdMoment`)* |
|
||||
| | Kurtosis | Measure of distribution tailedness | 🚧 *(Requires `FourthMoment`)* |
|
||||
| **Advanced** | GeometricMean | Nth root of product of values | ✅ `ComposableStatistic` |
|
||||
| | SumOfLogs | Sum of natural logarithms | Does not planned |
|
||||
| | SumOfSquares | Sum of squared values | 🚧 *(Blocks `Variance`)* |
|
||||
| **Moments** | FirstMoment | Mean (same as `Mean`) | ✅ *(Alias for `Mean`)* |
|
||||
| | SecondMoment | Variance (same as `Variance`) | ✅ *(Alias for `Variance`)* |
|
||||
| | ThirdMoment | Used in skewness calculation | 🚧 Not yet implemented |
|
||||
| | FourthMoment | Used in kurtosis calculation | 🚧 Not yet implemented |
|
||||
| **Risk Metrics** | SemiVariance | Downside variance | 🚧 *(Depends on `Variance`)* |
|
||||
| Category | Statistic | Description | Implementation Status |
|
||||
|------------------|-------------------|------------------------------------|--------------------------------|
|
||||
| **Basic** | Min | Minimum value | ✅ `ComposableStatistic` |
|
||||
| | Max | Maximum value | ✅ `ComposableStatistic` |
|
||||
| | Mean | Arithmetic mean | ✅ `ComposableStatistic` |
|
||||
| | Sum | Sum of all values | 🚧 Not yet implemented |
|
||||
| | Product | Product of all values | 🚧 Not yet implemented |
|
||||
| **Distribution** | Median | Median (50th percentile) | ✅ `BlockingStatistic` |
|
||||
| | Quantile | Quantile and percentile | ✅ `BlockingStatistic` |
|
||||
| | Variance | Unbiased sample variance | ✅ `BlockingStatistic` |
|
||||
| | StandardDeviation | Population standard deviation (σ) | ✅ `BlockingStatistic` |
|
||||
| | Skewness | Measure of distribution asymmetry | 🚧 *(Requires `ThirdMoment`)* |
|
||||
| | Kurtosis | Measure of distribution tailedness | 🚧 *(Requires `FourthMoment`)* |
|
||||
| **Advanced** | GeometricMean | Nth root of product of values | ✅ `ComposableStatistic` |
|
||||
| | SumOfLogs | Sum of natural logarithms | Does not planned |
|
||||
| | SumOfSquares | Sum of squared values | 🚧 *(Blocks `Variance`)* |
|
||||
| **Moments** | FirstMoment | Mean (same as `Mean`) | ✅ *(Alias for `Mean`)* |
|
||||
| | SecondMoment | Variance (same as `Variance`) | ✅ *(Alias for `Variance`)* |
|
||||
| | ThirdMoment | Used in skewness calculation | 🚧 Not yet implemented |
|
||||
| | FourthMoment | Used in kurtosis calculation | 🚧 Not yet implemented |
|
||||
| **Risk Metrics** | SemiVariance | Downside variance | 🚧 *(Depends on `Variance`)* |
|
||||
|
@ -0,0 +1,97 @@
|
||||
/*
|
||||
* Copyright 2018-2025 KMath contributors.
|
||||
* Use of this source code is governed by the Apache 2.0 license that can be found in the license/LICENSE.txt file.
|
||||
*/
|
||||
|
||||
package space.kscience.kmath.stat
|
||||
|
||||
import space.kscience.kmath.misc.sortedWith
|
||||
import space.kscience.kmath.operations.*
|
||||
import space.kscience.kmath.structures.Buffer
|
||||
import space.kscience.kmath.structures.Float64
|
||||
import kotlin.math.floor
|
||||
|
||||
/**
|
||||
* Compute the quantile of a data [Buffer] at a specified probability [p] on the interval [0,1].
|
||||
* The class argument [sorted] indicates whether data [Buffer] can be assumed to be sorted;
|
||||
* if false (the default), then the elements of data [Buffer] will be partially sorted in-place using [comparator].
|
||||
*
|
||||
* Samples quantile are defined by `Q(p) = (1-γ)*x[j] + γ*x[j+1]`,
|
||||
* where `x[j]` is the j-th order statistic of `v`, `j = floor(n*p + m)`,
|
||||
* `m = alpha + p*(1 - alpha - beta)` and `γ = n*p + m - j`.
|
||||
*
|
||||
* By default (`alpha = beta = 1`), quantiles are computed via linear interpolation between the points
|
||||
* `((k-1)/(n-1), x[k])`, for `k = 1:n` where `n = length(v)`. This corresponds to Definition 7
|
||||
* of Hyndman and Fan (1996), and is the same as the R and NumPy default.
|
||||
*
|
||||
* The keyword arguments `alpha` and `beta` correspond to the same parameters in Hyndman and Fan,
|
||||
* setting them to different values allows to calculate quantiles with any of the methods 4-9
|
||||
* defined in this paper:
|
||||
* - Def. 4: `alpha=0`, `beta=1`
|
||||
* - Def. 5: `alpha=0.5`, `beta=0.5` (MATLAB default)
|
||||
* - Def. 6: `alpha=0`, `beta=0` (Excel `PERCENTILE.EXC`, Python default, Stata `altdef`)
|
||||
* - Def. 7: `alpha=1`, `beta=1` (Julia, R and NumPy default, Excel `PERCENTILE` and `PERCENTILE.INC`, Python `'inclusive'`)
|
||||
* - Def. 8: `alpha=1/3`, `beta=1/3`
|
||||
* - Def. 9: `alpha=3/8`, `beta=3/8`
|
||||
*
|
||||
* # References
|
||||
* - Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages",
|
||||
* *The American Statistician*, Vol. 50, No. 4, pp. 361-365
|
||||
*
|
||||
* - [Quantile on Wikipedia](https://en.wikipedia.org/wiki/Quantile) details the different quantile definitions
|
||||
*
|
||||
* # Notes
|
||||
* - Performance can be improved by using a selection algorithm instead of a complete sort.
|
||||
* - As further improvement, API can be redesigned support "a batch mode" - scenarios when several different quantiles are desired.
|
||||
*
|
||||
*/
|
||||
public class Quantile<T>(
|
||||
private val field: Field<T>,
|
||||
private val p: Float64,
|
||||
private val sorted: Boolean = false,
|
||||
private val comparator: Comparator<T>, //todo provide default?
|
||||
private val alpha: Float64 = 1.0,
|
||||
private val beta: Float64 = alpha,
|
||||
) : BlockingStatistic<T, T> {
|
||||
|
||||
|
||||
init {
|
||||
require(p in 0.0..1.0) { "Quantile value must be in [0.0, 1.0] range. Got $p." }
|
||||
require(alpha in 0.0..1.0) { "alpha parameter must be in [0.0, 1.0] range. Got $alpha." }
|
||||
require(beta in 0.0..1.0) { "beta parameter must be in [0.0, 1.0] range. Got $beta." }
|
||||
}
|
||||
|
||||
override fun evaluateBlocking(data: Buffer<T>): T {
|
||||
|
||||
// adapted from https://github.com/JuliaStats/Statistics.jl/blob/master/src/Statistics.jl#L1045
|
||||
|
||||
require(data.size > 0) { "Can't compute percentile of an empty buffer" }
|
||||
|
||||
if (data.size == 1) {
|
||||
return data[0]
|
||||
}
|
||||
val n = data.size;
|
||||
val m = alpha + p * (1 - alpha - beta)
|
||||
val j = (floor(n*p + m).toInt()).coerceIn(1, n-1)
|
||||
val aleph = n*p + m;// todo use Math.fma in case of double or float for better accuracy
|
||||
val gamma = (aleph -j).coerceIn(0.0, 1.0)
|
||||
|
||||
var sortedData = data
|
||||
if (!sorted) {
|
||||
sortedData = data.sortedWith(comparator)
|
||||
}
|
||||
|
||||
with(field) {
|
||||
return sortedData[j-1] + gamma*(sortedData[j]-sortedData[j-1])
|
||||
}
|
||||
}
|
||||
|
||||
public companion object {
|
||||
public fun evaluate(p: Float64, buffer: Buffer<Float64>): Double = Float64Field.quantile(p).evaluateBlocking(buffer)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
public fun Float64Field.quantile(p: Float64): Quantile<Float64> = Quantile(Float64Field, p, false, naturalOrder())
|
||||
|
||||
|
@ -0,0 +1,145 @@
|
||||
/*
|
||||
* Copyright 2018-2025 KMath contributors.
|
||||
* Use of this source code is governed by the Apache 2.0 license that can be found in the license/LICENSE.txt file.
|
||||
*/
|
||||
package space.kscience.kmath.stat
|
||||
|
||||
import space.kscience.kmath.operations.DoubleField
|
||||
import space.kscience.kmath.operations.Float64Field
|
||||
import space.kscience.kmath.structures.Float64Buffer
|
||||
import kotlin.math.abs
|
||||
import kotlin.test.Test
|
||||
import kotlin.test.assertEquals
|
||||
import kotlin.test.assertFailsWith
|
||||
import kotlin.test.assertTrue
|
||||
|
||||
internal class QuantileBasicTest {
|
||||
|
||||
@Test
|
||||
fun testBasicQuantile() {
|
||||
val data = Float64Buffer(1.0, 2.0, 3.0, 4.0)
|
||||
assertEquals(2.5, Quantile.evaluate(0.5, data))
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testSingleElement() {
|
||||
val data = Float64Buffer(1.0)
|
||||
assertEquals(1.0, Quantile.evaluate(0.5, data))
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testTwoElements() {
|
||||
val data = Float64Buffer(1.0, 3.0)
|
||||
assertEquals(2.0, Quantile.evaluate(0.5, data))
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testSortedInput() {
|
||||
val data = Float64Buffer(101) { it.toDouble() } // 0.0 to 100.0
|
||||
val quantile = Quantile(DoubleField, 0.1, sorted = true, comparator = naturalOrder())
|
||||
assertEquals(10.0, quantile.evaluateBlocking(data), 1e-6)
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testReverseSortedInput() {
|
||||
val data = Float64Buffer(101) { (100 - it).toDouble() } // 100.0 to 0.0
|
||||
assertEquals(10.0, Quantile.evaluate(0.1, data), 1e-6)
|
||||
}
|
||||
|
||||
// @Test //todo try to fix
|
||||
// fun testExtremeValues() {
|
||||
// val data = Float64Buffer(Double.POSITIVE_INFINITY, Double.POSITIVE_INFINITY)
|
||||
// assertTrue(Quantile.evaluate(0.5, data).isInfinite())
|
||||
//
|
||||
// val data2 = Float64Buffer(Double.NEGATIVE_INFINITY, 1.0)
|
||||
// assertTrue(Quantile.evaluate(0.5, data2).isInfinite())
|
||||
// }
|
||||
|
||||
@Test
|
||||
fun testVerySmallQuantile() {
|
||||
val data = Float64Buffer(0.0, 1.0)
|
||||
val result = Quantile.evaluate(1e-18, data)
|
||||
assertTrue(abs(result) <= 1e-18)
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testEmptyInput() {
|
||||
val data = Float64Buffer(0) { 0.0 }
|
||||
assertFailsWith<IllegalArgumentException> {
|
||||
Quantile.evaluate(0.5, data)
|
||||
}
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testInvalidP() {
|
||||
assertFailsWith<IllegalArgumentException> {
|
||||
Quantile(Float64Field, -0.1, comparator = naturalOrder())
|
||||
}
|
||||
assertFailsWith<IllegalArgumentException> {
|
||||
Quantile(Float64Field, 1.1, comparator = naturalOrder())
|
||||
}
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testAlphaBetaParameters() {
|
||||
val v = Float64Buffer(2.0, 3.0, 4.0, 6.0, 9.0, 2.0, 6.0, 2.0, 21.0, 17.0)
|
||||
|
||||
// Tests against scipy.stats.mstats.mquantiles method
|
||||
assertEquals(2.0, Quantile(Float64Field, 0.0, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v))
|
||||
assertEquals(2.0, Quantile(Float64Field, 0.2, alpha = 1.0, beta = 1.0, comparator = naturalOrder()).evaluateBlocking(v))
|
||||
assertEquals(3.4, Quantile(Float64Field, 0.4, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
|
||||
assertEquals(6.0, Quantile(Float64Field, 0.6, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
|
||||
assertEquals(15.4, Quantile(Float64Field, 0.8, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
|
||||
assertEquals(21.0, Quantile(Float64Field, 1.0, alpha = 0.0, beta = 0.0, comparator = naturalOrder()).evaluateBlocking(v),1e-6)
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testRounding() {
|
||||
val data = Float64Buffer(10) { (it + 1).toDouble() } // 1.0 to 10.0
|
||||
for (i in 0..9) {
|
||||
val p = i / 9.0
|
||||
val expected = (i + 1).toDouble()
|
||||
assertEquals(expected, Quantile.evaluate(p, data), 1e-6) //todo try to avoid rounding
|
||||
}
|
||||
}
|
||||
|
||||
// @Test
|
||||
// fun testNoOverflow() { //todo try to fix
|
||||
// val data1 = Float64Buffer(-9000.0, 100.0)
|
||||
// assertEquals(100.0, Quantile.evaluate(1.0, data1))
|
||||
//
|
||||
// val data2 = Float64Buffer(-1e20, 100.0)
|
||||
// assertEquals(100.0, Quantile.evaluate(1.0, data2))
|
||||
// }
|
||||
|
||||
@Test
|
||||
fun testIncreasingQuantiles() {
|
||||
val data = Float64Buffer(1.0, 1.0, 1.0 + Double.MIN_VALUE, 1.0 + Double.MIN_VALUE)
|
||||
val quantiles = (0..99).map { i ->
|
||||
val p = i / 99.0
|
||||
Quantile.evaluate(p, data)
|
||||
}
|
||||
assertTrue(quantiles.zipWithNext().all { (a, b) -> a <= b })
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testUnsortedData() {
|
||||
val data = Float64Buffer(4.0, 9.0, 1.0, 5.0, 7.0, 8.0, 2.0, 3.0, 5.0, 17.0, 11.0)
|
||||
assertEquals(2.0, Quantile.evaluate(0.1, data))
|
||||
assertEquals(3.0, Quantile.evaluate(0.2, data))
|
||||
assertEquals(5.0, Quantile.evaluate(0.4, data))
|
||||
assertEquals(11.0, Quantile.evaluate(0.9, data))
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testMedianOdd() {
|
||||
val res = Quantile.evaluate(0.5, Float64Buffer(1.0, 2.0, 3.0))
|
||||
assertEquals(2.0, res)
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testMedianEven() {
|
||||
val res = Quantile.evaluate(0.5, Float64Buffer(1.0, 2.0, 3.0, 4.0))
|
||||
assertEquals(2.5, res)
|
||||
}
|
||||
}
|
@ -14,19 +14,31 @@ import space.kscience.kmath.samplers.GaussianSampler
|
||||
internal class CommonsDistributionsTest {
|
||||
@Test
|
||||
fun testNormalDistributionSuspend() = runBlocking {
|
||||
val distribution = GaussianSampler(7.0, 2.0)
|
||||
val mu = 7.0; val sigma = 2.0
|
||||
val distribution = GaussianSampler(mu, sigma)
|
||||
val generator = RandomGenerator.default(1)
|
||||
val sample = distribution.sample(generator).nextBuffer(1000)
|
||||
Assertions.assertEquals(7.0, Mean.evaluate(sample), 0.2)
|
||||
Assertions.assertEquals(2.0, StandardDeviation.evaluate(sample), 0.2)
|
||||
Assertions.assertEquals(mu, Mean.evaluate(sample), 0.2)
|
||||
Assertions.assertEquals(sigma, StandardDeviation.evaluate(sample), 0.2)
|
||||
//the first quartile (Q1)
|
||||
// For a normal distribution with mean \(\mu \) and standard deviation \(\sigma \), the Q1 is approximately \(\mu -0.675\sigma \)
|
||||
Assertions.assertEquals(mu-0.675*sigma, Quantile.evaluate(p=0.25, sample), 1e-1)
|
||||
//the third quartile (Q3)
|
||||
Assertions.assertEquals(mu+0.675*sigma, Quantile.evaluate(p=0.75, sample), 1e-1)
|
||||
}
|
||||
|
||||
@Test
|
||||
fun testNormalDistributionBlocking() {
|
||||
val distribution = GaussianSampler(7.0, 2.0)
|
||||
val mu = 7.0; val sigma = 2.0
|
||||
val distribution = GaussianSampler(mu, sigma)
|
||||
val generator = RandomGenerator.default(1)
|
||||
val sample = distribution.sample(generator).nextBufferBlocking(1000)
|
||||
Assertions.assertEquals(7.0, Mean.evaluate(sample), 0.2)
|
||||
Assertions.assertEquals(2.0, StandardDeviation.evaluate(sample), 0.2)
|
||||
Assertions.assertEquals(mu, Mean.evaluate(sample), 0.2)
|
||||
Assertions.assertEquals(sigma, StandardDeviation.evaluate(sample), 0.2)
|
||||
//the first quartile (Q1)
|
||||
// For a normal distribution with mean \(\mu \) and standard deviation \(\sigma \), the Q1 is approximately \(\mu -0.675\sigma \)
|
||||
Assertions.assertEquals(mu-0.675*sigma, Quantile.evaluate(p=0.25, sample), 1e-1)
|
||||
//the third quartile (Q3)
|
||||
Assertions.assertEquals(mu+0.675*sigma, Quantile.evaluate(p=0.75, sample), 1e-1)
|
||||
}
|
||||
}
|
||||
|
@ -59,4 +59,13 @@ internal class MeanStatisticTest {
|
||||
assertEquals(0.5, average, 1e-2)
|
||||
}
|
||||
|
||||
@Test
|
||||
fun quantileMatchMean() = runTest {
|
||||
val chunked = setupData()
|
||||
val first = runBlocking { chunked.first() }
|
||||
val mean = Float64Field.mean(first)
|
||||
val quantile = Quantile.evaluate(p=0.5,first)
|
||||
assertEquals(mean, quantile, 1e-2)
|
||||
}
|
||||
|
||||
}
|
||||
|
Loading…
x
Reference in New Issue
Block a user