수식이 깨질 경우 새로고침을 눌러주세요.¶

Probability for Data-Science ⑧¶

F distribution (F 분포)¶

$$ X_1 \sim \chi(n_1),\;\; X_2 \sim \chi(n_2) $$

위 처럼 카이 제곱 분포를 따르는 독립적인 두 개의 확률 변수가 있다.

$$ 각\;\; 확률\;\; 분포의\;\; sample을 \;\; x_1, \;x_2\;라고 할 \; 때, 각각의\;\; 자유도\; n_1, \;n_2로 나누어\; 그\; 비율을\; 구하면\;\; F(n_1, n_2)\; 분포가\;\; 된다. $$$$ \dfrac{x_1 / n_1}{x_2/ n_2} \propto F(n_1, n_2) $$

paramter $$자유도(degree\;\; of\;\; freedom,\;\; df)\;\;\;n_1, \;\; n_2 $$

확률 밀도 함수(pdf) $$ f(x; n_1,n_2) = \dfrac{\sqrt{\dfrac{(n_1\,x)^{n_1}\,\,n_2^{n_2}} {(n_1\,x+n_2)^{n_1+n_2}}}} {x\,\text{B}\!\left(\frac{n_1}{2},\frac{n_2}{2}\right)} $$

$$ 단, \;\;\;(0 < x < \infty) $$

Mean $$\frac{n_2}{n_2-2}$$

Variance
$$ \frac{2\,n_2^2\,(n_1+n_2-2)}{n_1 (n_2-2)^2 (n_2-4)} $$

F 분포의 활용¶

두 분포의 분산이 같은지 검사할 때 사용
두 독립 정규 분포에서 각 각 $n_1$, $n_2$개의 샘플 $x_{1,1}, \cdots, x_{1,n_1}$와 $x_{1,1}, \cdots, x_{1,n_1}$를 뽑은 후, sample의 분산 비율을 구하면 $F(n_1, n_2)$ 분포

# -*- coding: utf-8 -*-

import seaborn as sns
import pandas as pd
import scipy as sp
import matplotlib as mpl
import matplotlib.pylab as plt
import numpy as np
%matplotlib inline

xx = np.linspace(0.03, 3, 1000)

plt.figure(figsize=(12,6))

# 그래프를 지우지 말고 겹쳐 그려라.
plt.hold(True)
plt.plot(xx, sp.stats.f(1,1).pdf(xx), label="F(1,1)")
plt.plot(xx, sp.stats.f(2,1).pdf(xx), label="F(2,1)")
plt.plot(xx, sp.stats.f(5,2).pdf(xx), label="F(5,2)")
plt.plot(xx, sp.stats.f(10,1).pdf(xx), label="F(10,1)")
plt.plot(xx, sp.stats.f(20,20).pdf(xx), label="F(20,20)")

# 범례를 출력해라.
plt.legend()

/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/ipykernel_launcher.py:6: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
    Future behavior will be consistent with the long-time default:
    plot commands add elements without first clearing the
    Axes and/or Figure.
  
/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/matplotlib/__init__.py:917: UserWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.
  warnings.warn(self.msg_depr_set % key)
/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/matplotlib/rcsetup.py:152: UserWarning: axes.hold is deprecated, will be removed in 3.0
  warnings.warn("axes.hold is deprecated, will be removed in 3.0")

<matplotlib.legend.Legend at 0x10c3b4588>

F 분포의 시뮬레이션¶

Ex)

두 독립 정규 분포에서 각각 30, 35개의 sample을 뽑고, 이를 X_1, X_2라고 하자.

두 표본의 sample 분산의 비를 구하자.

이 것을 10,000번 반복 추출하여 F값을 10,000개 구하자.

F(30,35)의 분포 vs 10,000개의 F값으로 그린 히스토그램과 그 히스토그램에 fit된 그래프를 비교해보자.

# 정규 분포에서 sample 추출
df_1 = 30
df_2 = 35
N = 10000

x_1 = sp.stats.norm(1,4)
x_2 = sp.stats.norm(2,5)

X_1 = x_1.rvs(size=(df_1, N))
X_2 = x_2.rvs(size=(df_2, N))

# 정규 분포 → 표준화하여, 카이제곱 분포로 변경
X_1_standard = (X_1 - 1) / 4
X_2_standard = (X_2 - 2) / 5

# 행렬의 각 원소를 제곱하자.
X_1_standard_square = X_1_standard**2
X_2_standard_square = X_2_standard**2

# 카이제곱의 값 계산
Chi_1 = X_1_standard_square.sum(axis=0)
Chi_2 = X_2_standard_square.sum(axis=0)

# f 값 계산
f = (Chi_1 / df_1) / (Chi_2 / df_2)
print(f, len(f))

[ 0.65323939  1.08970071  0.60637228 ...,  1.22957238  1.28550739
  0.90096205] 10000

xx = np.linspace(0, 5, 1000)
random_variable = sp.stats.f(df_1,df_2)

plt.figure(figsize=(12,6))
sns.distplot(f, kde=False, fit=sp.stats.f)
plt.hold(True)
plt.plot(xx, random_variable.pdf(xx), linewidth=5)
plt.legend(["F({},{})".format(df_1,df_2), "Histogram"])

/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:2303: RuntimeWarning: invalid value encountered in double_scalars
  Lhat = muhat - Shat*mu
/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/ipykernel_launcher.py:6: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
    Future behavior will be consistent with the long-time default:
    plot commands add elements without first clearing the
    Axes and/or Figure.
  
/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/matplotlib/__init__.py:917: UserWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.
  warnings.warn(self.msg_depr_set % key)
/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/matplotlib/rcsetup.py:152: UserWarning: axes.hold is deprecated, will be removed in 3.0
  warnings.warn("axes.hold is deprecated, will be removed in 3.0")
/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/scipy/stats/_continuous_distns.py:1429: RuntimeWarning: divide by zero encountered in log
  lPx = m/2 * np.log(m) + n/2 * np.log(n) + (n/2 - 1) * np.log(x)

<matplotlib.legend.Legend at 0x10c5bd588>

t-분포, 카이제곱분포, F-분포¶

카이제곱분포

정규 분포를 따르는 변수의 분산에 대한 신뢰구간을 구할 때 사용

t-분포

정규 분포를 따르는 집단의 평균에 대한 가설 검정할 때 사용 정규 분포를 따르는 두 집단의 평균 차이를 검정할 때 사용

만약 모집단의 모분산을 알고 있다면, z분포를 이용

모분산을 모른다면, t분포를 사용

카이제곱분포

서로 다른 카이제곱 분포의 비율형태로 표현 정규분포를 따르는 두 집단의 분산에 대한 가설 검정을 할 때 사용