Visualization with Python - seaborn¶

seaborn¶

Seaborn은 matplotlib을 기반으로 다양한 색상 테마와 통계용 챠트 등의 기능을 추가한 시각화 패키지이다.
matplotlib 패키지에 의존한다.
통계 기능의 경우에는 statsmodels 패키지에 의존한다.
documents
- http://stanford.edu/~mwaskom/software/seaborn/index.html

Style¶

seaborn을 import하면 matplotlib에서 제공하는 기본 스타일이 아닌 Seaborn에서 지정한 디폴트 스타일 집합으로 변경한다.
따라서 동일한 matplotlib 명령을 수행해도 Seaborn을 임포트 한 것과 하지 않은 플롯은 모양이 다르다.

`set`은 스타일을, `set_color_codes`는 기본 색상을 변경한다.¶

import seaborn as sns
sns.set()
sns.set_color_codes()

1차원 데이터 plot¶

실수 값 → 히스토그램 등의 plot
카테고리 값 → count plot

연습 데이터 load

iris = sns.load_dataset("iris")
titanic = sns.load_dataset("titanic")
tips = sns.load_dataset("tips")
flights = sns.load_dataset("flights")

실수 분포 Plot¶

자료의 분포를 묘사하기 위한 것으로 matplotlib의 단순한 히스토그램과 달리 kernel density 및 rug표시 기능 및 다차원 복합 분포 기능 등을 제공한다.
실수 분포 플롯 명령에는 다음과 같은 것들이 있다.
- rugplot
- kdeplot
- distplot
- jointplot
- pairplot

rugplot¶

데이터 위치를 x축 위에 작은 선분으로 나타내어 실제 데이터 분포를 보여준다.

import matplotlib as mpl
import matplotlib.pylab as plt
import numpy as np
%matplotlib inline

np.random.seed(0)
x = np.random.randn(100)

sns.rugplot(x)
plt.show()

kdeplot¶

커널 밀도(kernel density)는 커널이라고 하는 단위 플롯을 겹치는 방법이다.
히스토그램보다 부드러운 형태의 분포 곡선을 보여준다.
커널 밀도 추정에 대한 자세한 내용은 다음 scikit-learn 패키지를 참조하자.
- http://scikit-learn.org/stable/modules/density.html

sns.kdeplot(x)
plt.show()

distplot¶

seaborn의 distplot은 matplotlib의 히스토그램 명령을 대체하여 많이 쓰인다.
러그와 커널 밀도 표시 기능도 가지고 있다.

sns.distplot(x, kde=True, rug=True)
plt.show()

distplot with subplot¶

d = np.random.normal(size=100)
f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True)

sns.distplot(d, kde=False, color="b", ax=axes[0, 0])
sns.distplot(d, hist=False, rug=True, color="r", ax=axes[0, 1])
# kde_kws={"shade": True} → 히스토그램의 모양에 더 근접하게 그린다.
sns.distplot(d, hist=False, color="g", kde_kws={"shade": True}, ax=axes[1, 0])
sns.distplot(d, color="m", ax=axes[1, 1])
plt.show()

countplot¶

countplot 으로 각 카테고리별 데이터 값을 셀 수 있다.

titanic.head()

sns.countplot(x="class", data=titanic)
plt.show()

sns.countplot(x="day", data=tips)
plt.show()

다차원 데이터¶

데이터 변수가 여러 개인 다차원 데이터는 데이터의 종류에 따라 다음과 같은 경우가 있다.¶

분석하고자 하는 데이터가 모두 실수 값인 경우
분석하고자 하는 데이터가 모두 카테고리 값인 경우
분석하고자 하는 데이터가 모두 실수 값과 카테고리 값이 섞여 있는 경우

2차원 실수형 데이터 → jointplot¶

만약 데이터가 2차원이고 모두 연속적인 실수값이라면 스캐터 플롯(scatter plot)을 사용하면 된다.
스캐터 플롯을 그리기 위해서는 seaborn 패키지의 jointplot 명령을 사용한다.
jointplot 명령은 스캐터 플롯뿐 아니라 각 변수의 히스토그램도 동시에 그린다.
documents
- http://seaborn.pydata.org/generated/seaborn.jointplot.html

sns.jointplot(x="sepal_length", y="sepal_width", data=iris)
plt.show()

또한 인수를 바꾸면 커널 밀도의 형태로도 표시할 수 있다.¶

sns.jointplot(x="sepal_length", y="sepal_width", data=iris, kind="kde", space=0, zorder=0, n_levels=6)
plt.show()

다차원 실수형 데이터 → pairplot¶

만약 3차원 이상의 데이터라면 seaborn 패키지의 pairplot 명령을 사용한다.
pairplot은 그리도(grid) 형태로 각 집합의 조합에 대해 히스토그램과 스캐터 플롯을 그린다.
documents
- http://seaborn.pydata.org/generated/seaborn.pairplot.html

sns.pairplot(iris)
plt.show()

만약 카테고리형 데이터가 실수형 데이터와 섞여 있다면 `hue` 인수를 이용하여 카테고리 별로 색상을 다르게 할 수 있다.¶

같은 카테고리인 경우 히스토그램으로 그린다.(대각선)

# hue = "카테고리 컬럼명"
sns.pairplot(iris, hue="species", markers=["o", "s", "D"])
plt.show()

2차원 카테고리 데이터 → heatmap¶

만약 데이터가 2차원이고 모든 값이 카테고리 값이면 heatmap 명령을 사용한다.
documents
- http://seaborn.pydata.org/generated/seaborn.heatmap.html

titanic.head()

# aggfunc="size" → pivot한 카테고리 값만 카운트
titanic_size = titanic.pivot_table(index="class", columns="embark_town", aggfunc="size")
titanic_size

sns.heatmap(titanic_size, annot=True, fmt="d")
plt.show()

2차원 복합 데이터¶

데이터가 2차원이고 실수 값, 카테고리 값이 섞여 있다면 기존의 플롯 이외에도 다음과 같은 분포 플롯들을 이용할 수 있다.
- barplot
- boxplot
- pointplot
- violinplot
- stripplot
- swarmplot

`barplot`¶

카테고리 값에 따른 실수 값의 평균과 표준 편차를 표시하는 기본적인 바 차트를 생성한다.

tips.head()

sns.barplot(x="day", y="total_bill", data=tips)
plt.show()

`boxplot`¶

Box-Whisker Plot을 그려준다.
Box는 실수 값 분포에서 1사분위수와 3사분위수를 뜻한다.
박스 내부의 가로선은 중앙값, 박스 외부의 세로선(Whisker)은 3사분위 수와 1사분위 수 사이의 거리의 1.5배 길이를 의미한다.
세로선 바깥의 점은 아웃라이어(outlier)이다.

sns.boxplot(x="day", y="total_bill", data=tips)
plt.show()

기타 Plot¶

boxplot은 중앙값, 표준 편차 등, 분포의 간략한 특성만 보여준다.
violinplot, stripplot. swarmplot 등은 카테고리 값에 따른 각 분포의 실제 데이터나 전체 형상을 보여준다는 장점이 있다.

sns.violinplot(x="day", y="total_bill", data=tips)
plt.show()

sns.stripplot(x="day", y="total_bill", data=tips, jitter=True)
plt.show()

sns.swarmplot(x="day", y="total_bill", data=tips)
plt.show()

다차원 복합 데이터¶

지금까지 소개한 대부분의 plot은 2차원 이상의 고차원 데이터에 대해서도 분석할 수 있는 기능이 포함되어 있다.
예를 들어 barplot, violinplot, boxplot 에서는 두 가지 카테고리 값에 의한 실수 값의 변화를 보기 위한 hue 인수를 제공한다.

tips = sns.load_dataset("tips")
sns.barplot(x="day", y="total_bill", hue="sex", data=tips)
plt.show()

sns.boxplot(x="day", y="total_bill", hue="sex", data=tips)
plt.show()

tips = sns.load_dataset("tips")

# palette="PRGn" 처럼 색깔지정할 수 있다.
sns.boxplot(x="day", y="total_bill", hue="sex", data=tips, palette="PRGn")

# figure와 축 tick과의 거리 조정
sns.despine(offset=30, trim=True)
plt.show()

sns.violinplot(x="day", y="total_bill", hue="sex", data=tips)
plt.show()

sns.stripplot(x="day", y="total_bill", hue="sex", data=tips, jitter=True)
plt.show()

sns.swarmplot(x="day", y="total_bill", hue="sex", data=tips)
plt.show()

`stripplot`, `violinplot`, `swarmplot` 등 에서는 `split` 옵션으로 시각화 방법을 변경할 수도 있다.¶

sns.violinplot(x="day", y="total_bill", hue="sex", data=tips, split=True)
plt.show()

sns.stripplot(x="day", y="total_bill", hue="sex", data=tips, jitter=True, split=True)
plt.show()

/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/seaborn/categorical.py:2586: UserWarning: The `split` parameter has been renamed to `dodge`.
  warnings.warn(msg, UserWarning)

sns.swarmplot(x="day", y="total_bill", hue="sex", data=tips, split=True)
plt.show()

/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/seaborn/categorical.py:2783: UserWarning: The `split` parameter has been renamed to `dodge`.
  warnings.warn(msg, UserWarning)

`heatmap`을 이용해도 두 개의 카테고리 값에 의한 실수 값 변화를 볼 수 있다.¶

flights_passengers = flights.pivot("month", "year", "passengers")
# annot=True 칸 안에 숫자를 채워라.
sns.heatmap(flights_passengers, annot=True, fmt="d", linewidths=1)
plt.show()

`factorplot`¶

색상(hue)과 행(row) 등을 동시에 사용하여 3개 이상의 카테고리 값에 의한 분포 변화를 보여준다.

sns.factorplot(x="age", y="embark_town", hue="sex", row="class", data=titanic[titanic.embark_town.notnull()],
               size=2, aspect=3.5, kind="violin", split=True)
plt.show()

여러 종류의 차트를 겹쳐서 표시¶

시각화 효과를 높이기 위해

sns.boxplot(x="tip", y="day", data=tips, whis=np.inf)
sns.stripplot(x="tip", y="day", data=tips, jitter=True, color="0.4")
plt.show()

sns.violinplot(x="day", y="total_bill", data=tips, inner=None)
sns.swarmplot(x="day", y="total_bill", data=tips, color="0.9")
plt.show()

Aesthetic Parameters¶

set(context="notebook", style="darkgrid", palette="deep", font="sans-serif", font_scale=1, color_codes=False)
- sns.set_context('notebook')
  - notebook, paper, talk, poster
- sns.set_style('darkgrid')
  - darkgrid, whitegrid, dark, white, ticks
- sns.set_palette('deep')
  - deep, muted, pastel, bright, dark, colorblind

t = np.arange(0., 5., 0.2)
sns.set_palette('deep')
plt.hold(True)
plt.plot(t, t)
plt.plot(t, t**2)
plt.plot(t, t**3)
plt.show()

/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/ipykernel_launcher.py:3: MatplotlibDeprecationWarning: pyplot.hold is deprecated.
    Future behavior will be consistent with the long-time default:
    plot commands add elements without first clearing the
    Axes and/or Figure.
  This is separate from the ipykernel package so we can avoid doing imports until
/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/matplotlib/__init__.py:917: UserWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.
  warnings.warn(self.msg_depr_set % key)
/Users/Leo/.pyenv/versions/anaconda3-4.0.0/envs/code_study/lib/python3.6/site-packages/matplotlib/rcsetup.py:152: UserWarning: axes.hold is deprecated, will be removed in 3.0
  warnings.warn("axes.hold is deprecated, will be removed in 3.0")

Bokeh¶

http://bokeh.pydata.org/en/latest/
interactive web rendering

from bokeh.plotting import figure, output_notebook, show
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]
output_notebook()
p = figure(title="simple line example", x_axis_label='x', y_axis_label='y')
p.line(x, y, legend="Temp.", line_width=2)
show(p)

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

embark_town	Cherbourg	Queenstown	Southampton
class
First	85	2	127
Second	17	3	164
Third	66	72	353

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Visualization with Python - seaborn¶

seaborn¶

Style¶

set은 스타일을, set_color_codes는 기본 색상을 변경한다.¶

1차원 데이터 plot¶

실수 분포 Plot¶

rugplot¶

kdeplot¶

distplot¶

distplot with subplot¶

countplot¶

다차원 데이터¶

데이터 변수가 여러 개인 다차원 데이터는 데이터의 종류에 따라 다음과 같은 경우가 있다.¶

2차원 실수형 데이터 → jointplot¶

또한 인수를 바꾸면 커널 밀도의 형태로도 표시할 수 있다.¶

다차원 실수형 데이터 → pairplot¶

만약 카테고리형 데이터가 실수형 데이터와 섞여 있다면 hue 인수를 이용하여 카테고리 별로 색상을 다르게 할 수 있다.¶

2차원 카테고리 데이터 → heatmap¶

2차원 복합 데이터¶

barplot¶

boxplot¶

기타 Plot¶

다차원 복합 데이터¶

stripplot, violinplot, swarmplot 등 에서는 split 옵션으로 시각화 방법을 변경할 수도 있다.¶

heatmap을 이용해도 두 개의 카테고리 값에 의한 실수 값 변화를 볼 수 있다.¶

factorplot¶

여러 종류의 차트를 겹쳐서 표시¶

Aesthetic Parameters¶

Bokeh¶

Related Posts

`set`은 스타일을, `set_color_codes`는 기본 색상을 변경한다.¶

만약 카테고리형 데이터가 실수형 데이터와 섞여 있다면 `hue` 인수를 이용하여 카테고리 별로 색상을 다르게 할 수 있다.¶

`barplot`¶

`boxplot`¶

`stripplot`, `violinplot`, `swarmplot` 등 에서는 `split` 옵션으로 시각화 방법을 변경할 수도 있다.¶

`heatmap`을 이용해도 두 개의 카테고리 값에 의한 실수 값 변화를 볼 수 있다.¶

`factorplot`¶