Show pageOld revisionsBacklinksExport to PDFBack to top This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong. ====== Kolmogorov-Smirnov test ====== The Kolmogorov-Smirnov test is used to test if an [[kb:probstat:empirical_cumulative_distribution]] follows a particular distribution. ===== Glivenko-Cantelli Theorem (Fundamental theorem of statistics) ===== Let $F(t)$ be the true CDF of $X_1, ..., X_n \stackrel{iid}{\sim} X$. Let $F_n(t)$ be the empirical cdf of $X_1, ..., X_n$. Then, $$\sup_{t\in \mathbb{R}}|F_n(t)-F(t)|\xrightarrow[n\rightarrow\infty]{a.s.}0$$ This tells us that as the number of samples increases, the empirical and true CDFs will converge for all values of $t$. ===== Asymptotic normality ===== $$\sqrt{n}(F_n(t)-F(t))\xrightarrow[n\to\infty]{(d)}\mathcal{N}(0,F(t)(1-F(t)))$$ ===== Donsker's Theorem ===== If $F$ is continuous, then $$\sqrt{n}\sup_{t\in\mathbb{R}}|F_n(t)-F(t)|\xrightarrow[n\to\infty]{(d)}\sup_{0<t'<1}|\mathbb{B}(t')|$$ $\mathbb{B}$ is the a Brownian bridge on $[0,1]$, which is defined: $$\mathbb{B}(t') \sim \mathcal{N}(0,t')$$ It is called a Brownian bridge because the values at $t'=0$ and $t'=1$ are pinned at $0$. ===== Hypothesis test setup ===== Let $X_1, ..., X_n$ be i.i.d. random variables and follow the cdf $F$. Let $F^0$ be a continuous cdf. This test that tells us whether those values follow the cdf $F^0$ $$H_0: F = F^0$$ $$H_1: F \neq F^0$$ Let $F_n$ be the [[kb:probstat:empirical_cumulative_distribution|empirical cdf]] of the sample $X_1, ..., X_n$. If $F=F^0$, then $F_n(t)\approx F^0(t)$ for $t\in [0,1]$. The test statistic is: $$T_n = \sup_{t\in\mathbb{R}}|F_n(t)-F^0(t)|$$ By Donsker's theorem, if $H_0$ is true, then, $$\sqrt{n}T_n \xrightarrow [n\to \infty]{(d)} Z$$ where $Z$ is the supremum of a Brownian bridge from $t'=0$ to $t'=1$. The Kolmogorov-Smirnov with asymptotic level $\alpha$ is defined as: $$\delta_\alpha=\mathbb{1}\{T_n > \frac{q_\alpha}{\sqrt{n}}\}$$ where $q_\alpha$ is the ($1-\alpha$) quantile of $Z$. And the p-value is: $$p = \mathbb{P}[Z > T_n]$$ ===== Calculating the KS test statistic ===== Let $X_{(i)}$ be the $i$th smallest sample. ($X_{(1)}\leq X_{(2)}\leq ... X_{(n)}$). Then, the formula for the test statistic $T_n$ becomes: $$T_n=\max_i\{\max(|\frac{i-1}{n}-F^0(X_{(i)})|, |\frac{i}{n}-F^0(X_{(i)})|)\}$$ This formula checks all of the samples (discontinuities in the [[kb:probstat:empirical_cumulative_distribution|empirical cdf]]) and finds the maximum distance between the empirical cdf and the cdf we are checking against ($F^0$). This test statistic is a pivotal statistic because it does not depend on the distribution of $X_i$s. In other words, this test statistic is general for all KS tests, not specific to any one. kb/kolmogorov-smirnov_test.txt Last modified: 2024-04-30 04:03by 127.0.0.1