수식이 깨질 경우 새로고침을 눌러주세요.¶

Probability for Data-Science ④¶

Binomial distribution 이항 분포¶

성공확률이 $\theta$ 인 베르누이 시도를 N번 하는 경우를 생각해 보자.

가장 운이 좋을 때에는 N번 모두 성공할 것이고 가장 운이 나쁜 경우에는 한 번도 성공하지 못할 것이다.

N번 중 성공한 횟수를 확률 변수(X)라고 한다면 X의 값은 0부터 N까지의 정수 중 하나가 될 것이다.

이러한 확률 변수가 따르는 분포는 이항 분포(Binomial distribution)라고 하면 다음과 같이 표시한다.

$$X∼Bin(x;N,\theta)$$

0 또는 1이 나오는 베르누이 확률 분포를 따르는 확률 변수 Y를 가정¶

$$Y ∼ Bern(y;\theta) \;\; \rightarrow \;\; \sum_{i=1}^{N} y_{i}(\theta) ∼ Bin(x;N,\theta)$$$$ x = \sum_{i=1}^{N} y_i (\theta)$$$$X∼Bin(x;N,\theta) \;=\; \binom N x \theta^x (1-\theta)^{N-x}, \;\; (x = 0, 1, \cdots, N)$$

$$ E[x] = N\theta$$$$ Var[x] = N\theta(1-\theta)$$

pf)
$$ \begin{eqnarray*} E[x] &=& E[\sum_{i=1}^N y_i] \\ &=& \sum_{i=1}^N E[y_i] \\ &=& \sum_{i=1}^N \theta \\ &=& N\theta \end{eqnarray*} $$

$$ \begin{eqnarray*} Var[x] &=& Var[\sum_{i=1}^N y_i] \\ &=& \sum_{i=1}^N Var[y_i] \;\; ∵)\;\; 단, y_i들은 \;\; 서로\;\; 독립이므로 \;\;성립함.\\ &=& \sum_{i=1}^N \theta(1-\theta) \\ &=& N \theta(1-\theta) \end{eqnarray*} $$

이항분포의 시뮬레이션¶

# -*- coding: utf-8 -*-

import seaborn as sns
import pandas as pd
import sympy
from sympy.stats import *
import scipy
import matplotlib as mpl
import matplotlib.pylab as plt
import numpy as np
%matplotlib inline

$$X∼Bin(x;10,0.6)$$

# pmf(확률질량함수) 정의
N = 10
p = 0.6
D = sympy.stats.Binomial('x', N, p)
f = density(D).dict
f

{0: 0.000104857600000000,
 1: 0.00157286400000000,
 2: 0.0106168320000000,
 3: 0.0424673280000000,
 4: 0.111476736000000,
 5: 0.200658124800000,
 6: 0.250822656000000,
 7: 0.214990848000000,
 8: 0.120932352000000,
 9: 0.0403107840000000,
 10: 0.00604661760000000}

# 스타일
with plt.style.context('fivethirtyeight'):
    plt.figure(figsize=(8,5))
    
    ## dict를 막대그래프로 시각화
    plt.bar(range(len(f)), f.values(), align="center")
    
    ## 축 범위 지정
    plt.xlim(-1, 12)
    plt.ylim(0, 0.3)
    
    ## y축 라벨링
    plt.ylabel("P(x)")
    
    ## 타이틀
    plt.title("pmf of Binomial distribution")
    plt.show()

# sympy.stats의 sample 메소드로 100개를 샘플링하
simul = []
for i in range(5000):
    sample(D)
    s = sample(D)
    simul.append(s)

simulation = np.array(simul)
simulation = simulation.astype('int64')
simulation

array([4, 6, 3, ..., 6, 5, 5])

plt.figure(figsize=(8,5))
sns.countplot(simulation)
plt.show()

이론 vs 시뮬레이션¶

# np.bincount → 나오지 않은 숫자까지도 다 포함하기 때문에 사용하는 것이 좋다. / np.unique() → 나오지 않으면 세지 않는다.
## minlength → x값 몇개?(0부터 N까지 이므로 N+1개)
y = np.bincount(simulation, minlength=N+1) / float(len(simulation))
y

array([ 0.    ,  0.0022,  0.0096,  0.0412,  0.1238,  0.2036,  0.2378,
        0.226 ,  0.1136,  0.0378,  0.0044])

# y를 df로
df_simulation = pd.DataFrame({"simulation": y})

# dict 차원이 1인경우 시리즈로 변경후 판다스씌울 것
df_theory = pd.DataFrame(pd.Series(f, name='Theory'), dtype='float64')

# stack() → 컬럼을 인덱스로 내려줘서 아래와 같이 컬럼을 재 조합할 때 사용한다!
df = pd.concat([df_theory, df_simulation], axis=1).stack()
df = df.reset_index()
df.columns = ["value", "type", "ratio"]
df.pivot("value", "type", "ratio")

시각화를 하면 다음과 같다.¶

plt.figure(figsize=(12,8))
sns.barplot(x="value", y="ratio", hue="type", data=df)
plt.show()

scipy를 사용하여 시뮬레이션을 해보자.¶

N = 10
p = 0.6
f = scipy.stats.binom(N, p)
x = np.arange(N + 1)

plt.bar(x, f.pmf(x), align="center")
plt.ylabel("P(x)")
plt.title("pmf of binomial distribution")
plt.show()

np.random.seed(0)
sample = f.rvs(100)
sample

array([ 6,  5,  6,  6,  6,  5,  6,  4,  3,  6,  5,  6,  6,  4,  8,  8,  9,
        5,  5,  4,  3,  5,  6,  5,  8,  5,  8,  4,  6,  6,  7,  5,  6,  6,
        9,  6,  6,  6,  4,  5,  7,  6,  5,  8,  5,  5,  7,  8,  7,  7,  6,
        6,  2,  8,  7,  8,  5,  7,  6,  7,  8,  8,  5,  8,  7,  7,  5,  8,
        4,  8,  3,  6,  3,  6,  5,  9,  7,  8,  7,  8,  7,  6,  8,  5,  6,
        7,  6,  8,  6,  4,  7,  5,  8,  5,  7,  7,  6,  9,  5, 10])

sns.countplot(sample)
plt.show()

scipy가 훨씬 편리하다는 것을 알 수 있다. scipy의 stats를 사용하자!¶

Gaussian Normal Distribution 가우시안 정규 분포¶

연속형 확률 분포 함수

실수의 랜덤 변수 모형에서 가장 일반적으로 사용되는 확률 모형

평균 $\mu$와 분산 $\sigma^2$으로 정의

모든 확률 변수의 샘플평균은 가우시안 정규 분포로 수렴 (Central Limit Theorem)

동일한 표준편차를 가진 분포 중 최대 엔트로피를 가지는 확률 분포

모든 자연 법칙에서 엔트로피는 커지는 경향이 있다. $$ X \sim N(x; \mu, \sigma^2) = \dfrac{1}{\sqrt{2\pi\sigma^2}} \exp \left(-\dfrac{(x-\mu)^2}{2\sigma^2}\right) $$

정규 분포의 시뮬레이션¶

# pdf(확률밀도함수) 정의
## 하이퍼 파라미터
mu = 0
std = 1

## pdf 생성
x = np.linspace(-5, 5, 100)
f = scipy.stats.norm(mu, std)

plt.plot(x, f.pdf(x))
plt.ylabel("p(x)")
plt.title("pdf of normal distribution")
plt.show()

sample = f.rvs(1000)
sample;

sns.distplot(sample, kde=False, fit=scipy.stats.norm)
plt.show()

mu_1 = 1
std_1 = 2
x_1 = np.linspace(-7, 9, 1000)
f_1 = scipy.stats.norm(mu_1, std_1)

# 면적을 칠해라.
plt.fill(x_1, f_1.pdf(x_1))
plt.ylim(0, 0.22);

Q-Q Plot¶

Quantile-Quantile Plot

구체적인 정규 분포 검정(normality test)을 사용하기에 앞서 시작적으로 간단하게 정규 분포를 확인할 수 있는 방법

동일 분위수에 해당하는 정상분포의 값과 주어진 분포의 값을 한 쌍으로 만들어 scatter plot을 그린다.

직선의 형태를 띌수록 정상 분포와 가깝다.

Q-Q 플롯은 분석하고자 하는 샘플의 분포과 정규 분포의 분포 형태를 비교하는 시각적 도구

Q-Q 플롯은 동일 분위수에 해당하는 정상 분포의 값과 주어진 분포의 값을 한 쌍으로 만들어 스캐터 플롯(scatter plot)으로 그린 것

대상 샘플을 크기에 따라 정렬(sort)한다.

각 샘플의 분위수(quantile number)를 구한다.

각 샘플의 분위수와 일치하는 분위수를 가지는 정규 분포 값을 구한다.

대상 샘플과 정규 분포 값을 하나의 쌍으로 생각하여 2차원 공간에 하나의 점(point)으로 그린다.

모든 샘플에 대해 2부터 4까지의 과정을 반복하여 스캐터 플롯과 유사한 형태의 플롯을 완성한다.

비교를 위한 45도 직선을 그린다.

Q-Q Plot을 그려보자.¶

# 정규분포에서 sampling(100)
np.random.seed(0)
x = np.random.randn(100)

plt.figure(figsize=(7,7))

## scipy에서 q-q plot을 그려주는 method
scipy.stats.probplot(x, plot=plt)

## x,y축의 간격 똑같게
plt.axis("equal")
plt.show()

위는 시각적으로 정규분포를 따르는 것으로 판단된다.¶

이후 정규성 검정을 실시(Normality Test)

np.random.seed(0)
# 위와 다른점은 정규분포에서 샘플링하지 않았다.
x = np.random.rand(100)

plt.figure(figsize=(7,7))
scipy.stats.probplot(x, plot=plt)
plt.ylim(-0.5, 1.5)
plt.show()

x = scipy.stats.norm(5, 1).rvs(1000)
scipy.stats.probplot(x, plot=plt);
plt.axis("equal")

(-3.5173484449224093,
 3.5173484449224093,
 1.4981502363249461,
 8.488728322669477)

x = scipy.stats.t(3).rvs(1000)
scipy.stats.probplot(x, plot=plt);

위 처럼 시각적으로 판단하고, 정규성 검정을 하도록 하자.¶

중심 극한 정리(Central Limit Theorem)¶

중심 극한 정리는 어떤 분포를 따르는 확률 변수든 지 상관없이 표본의 수가 커지면 이 분포의 합은 정규 분포와 비슷한 분포를 이루는 현상을 말한다.

중심 극한 정리를 수학적인 용어로 쓰면 다음과 같다.

$$ X_1,\;X_2,\;\cdots, X_n의 \;기댓값과 \;\;분산이 \;\mu, \;\sigma^2으로 \;동일한 \;값을 \;갖으며, \;서로 \;독립인 \;확률 \;변수들이라고\; 하자.\; 분포가 \;어떤 \;분포인지는 \;상관없다. $$

이 분포의 합은 다음과 같다.

$$S_n\;=\;X_1+\cdots+X_n, \;\; 여기서 S_n도 \; 확률 \;변수이다.$$$$S_n의\; 분포는\; n이 \;증가할 \;수록 \;다음과\; 같은\; 정규\; 분포에\; 수렴한다.$$$$\frac{S_n - n\mu}{\sqrt{n}} \; \xrightarrow{d} \; N(0, \sigma^2)$$

중심극한 정리가 성립하는지 살펴보자.¶

x = np.linspace(-2, 2, 100)

plt.figure(figsize=(6,9))

# enumerate → 인덱스를 자동으로 넣어줌 / 아래는 (0,1), (1,2), (2,10)의 for 문이 돌아간다.
for i, N in enumerate([1, 2, 10]):
    X = np.random.rand(1000, N) - 0.5
    S = X.sum(axis=1)/np.sqrt(N)
    plt.subplot(3, 2, 2*i+1)
    sns.distplot(S, bins=10, kde=False, norm_hist=True)
    plt.xlim(-2, 2)
    plt.yticks([])
    plt.subplot(3, 2, 2*i+2)
    scipy.stats.probplot(S, plot=plt)
    
plt.tight_layout()
plt.show()

type	Theory	simulation
value
0	0.000105	0.0000
1	0.001573	0.0022
2	0.010617	0.0096
3	0.042467	0.0412
4	0.111477	0.1238
5	0.200658	0.2036
6	0.250823	0.2378
7	0.214991	0.2260
8	0.120932	0.1136
9	0.040311	0.0378
10	0.006047	0.0044