핸즈온 머신러닝

핸즈온 머신러닝 - 3

Choi재혁

|2024. 6. 15. 23:14

참고 : 핸즈온 머신러닝 2판

이진 분류기

mnist를 활용한 5, 5아님 분류기
sklearn의 SGDClassifier 모델 사용
- SGD (확률적 경사 하강법)은 매우 큰 데이터셋을 효율적으로 처리하는 장점을 가짐
- SGD는 한번에 하나씩 훈련 샘플을 독립적으로 처리함

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42)
sgd_clf.fit(X_train, y_train_5)

SGDClassifier은 훈련하는데 무작위성을 사용

성능 측정

교차 검증
- skleran의 cross_val_score 함수와 유사한 작업 수행하는 예시

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

# shuffle=False가 기본값이기 때문에 random_state를 삭제하던지 shuffle=True로 지정하라는 경고가 발생합니다.
# 0.24버전부터는 에러가 발생할 예정이므로 향후 버전을 위해 shuffle=True을 지정합니다.
skfolds = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_fold = X_train[test_index]
    y_test_fold = y_train_5[test_index]

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))

위 과정을 cross_val_score로 표현하면

cross_val_score() 함수로 폴드가 3개인 k-겹 교차 검증을 사용해 SGDClassifier 평가

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

클래스를 분류하는 더미 분류기 작성

from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")
>>> array([0.91125, 0.90855, 0.90915])

아무런 학습을 수행하지 않아도 정확도 90% 이상
불균형 데이터셋을 다룰때 이러한 분류 성능 문제가 발생

오차 행렬

분류기의 성능을 평가하는 방법
cross_val_predict() 함수를 통해 사용가능
cross_val_score 함수처럼 cross_val_predict 함수는 k-겹 교차 검증을 수행하지만 평가 점수를 반환하지 않고 각 테스트 폴드에서 얻은 예측을 반환

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

오차 행렬

오차 행렬의 행은 실제 클래스를 의미, 열은 예측한 클래스를 의미
- 밑의 결과의 첫 행은 '5 아님' 이미지에 대한 것 53892개를 '5 아님'으로 정확하게 판단, 687을 5라고 잘못 판단
- 두번째 행은 5에 대한 이미지, 1891을 5아님으로 잘못 판단, 3530을 5라고 맞게 판단

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

>> array([[53892,   687],
           [ 1891,  3530]])

오차 행렬은 많은 정보를 제공하지만 요약된 지표를 통해 빠르게 판단해야 하는 상황이 필요함

정밀도(Precision)

예측을 Positive로 한 대상(FP + TP) 중 예측과 실제 값이 Positive로 일치한 데이터(TP)의 비율
TP / (FP + TP) (TP : 진짜 양성의 수, FP : 가짜 양성의 수)

재현율(Recall)

실제가 Positive인 대상(FN + TP) 중 예측과 실제 값이 Positive로 일치한 데이터(TP)의 비율
TP / (FN + TP)
민감도 라고도 함

정밀도와 재현율

from sklearn.metrics import precision_score, recall_score

# 정밀도
precision_score(y_train_5, y_train_pred)
>> 0.8370879772350012

# 재현율
recall_score(y_train_5, y_train_pred)
>> 0.6511713705958311

전체 5로 판단한 이미지 중에서 83% 정확함
전체 숫자 5중에서 65% 정확함
정밀도와 재현률을 F1 score라고 하는 하나의 값으로 판단, 정밀도와 재현율의 조화 평균

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

>> 0.7325171197343846

상황에 따라서 정밀도가 중요한지 재현율이 중요한지 기준이 바뀜

정밀도/재현율 트레이드오프

임계값을 내리면 재현율이 높아지고 정밀도가 줄어듦
임계값을 올리면 재현율은 줄어들고 정밀도는 높아짐
분류기의 predict 대신 decision_function을 통해 각 셈플의 점수를 얻을수 있음, 이를 통해 원하는 임계값 설정 가능

y_scores = sgd_clf.decision_function([some_digit]) # some_digit는 5의 이미지
y_scores
>> array([2164.22030239])
# 위의 기준값을 통해 적절한 임계값을 설정할수 있지만 cross_val_predict()를 통해 수행가능

cross_val_predict()

훈련 세트에 있는 모든 샘플의 점수를 구함
예측 결과가 아닌 결정 점수를 반환 받도록 지정

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

위의 y_scores를 통해 가능한 모든 임계값에 대해 정밀도와 재현율을 계산할수 있음

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

precision_recall_curve를 통해 트레이드 오프 그래프 그리기

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.legend(loc="center right", fontsize=16) # Not shown in the book
    plt.xlabel("Threshold", fontsize=16)        # Not shown
    plt.grid(True)                              # Not shown
    plt.axis([-50000, 50000, 0, 1])             # Not shown


plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

정밀도 곡성이 재현율 곡선보다 울틍불퉁한 이유는 임계값을 올리더라도 정밀도가 가끔 낮아질때가 있음

다른 방법으로는 재현율에 대한 정밀도 곡선을 작성

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.grid(True)

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.plot([recall_90_precision, recall_90_precision], [0., 0.9], "r:")
plt.plot([0.0, recall_90_precision], [0.9, 0.9], "r:")
plt.plot([recall_90_precision], [0.9], "ro")
save_fig("precision_vs_recall_plot")
plt.show()

만약 정밀도 90%가 목표라면 np.argmax() 사용

# precisions 이 90% 넘을때 그때의 임계값
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]
# 임계값을 넘는  y 값만 선택
y_train_pred_90 = (y_scores >= threshold_90_precision)


precision_score(y_train_5, y_train_pred_90)
>> 0.9
recall_score(y_train_5, y_train_pred_90)
>> 0.47

ROC 곡선

이진 분류에서 사용하는 정밀도/재현률 곡선과 매우 유사한 곡선
ROC 곡선은 정밀도에 대한 재현율 곡선이 아니고 거짓 양성 비율에 대한 진짜 양성비율의 곡선
ROC 곡선은 재현율에 대한 1-특이도 그래프
ROC 곡선을 그리기 위해서는 여러 임계값에서의 TPR, FPR을 계산

from sklearn.metrics import roc_curve

# fpr, tpr 값 얻음
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)


def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # 대각 점선
    plt.axis([0, 1, 0, 1])                                    # Not shown in the book
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16) # Not shown
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    # Not shown
    plt.grid(True)                                            # Not shown

plt.figure(figsize=(8, 6))                                    # Not shown
plot_roc_curve(fpr, tpr)
fpr_90 = fpr[np.argmax(tpr >= recall_90_precision)]           # Not shown
plt.plot([fpr_90, fpr_90], [0., recall_90_precision], "r:")   # Not shown
plt.plot([0.0, fpr_90], [recall_90_precision, recall_90_precision], "r:")  # Not shown
plt.plot([fpr_90], [recall_90_precision], "ro")               # Not shown
save_fig("roc_curve_plot")                                    # Not shown
plt.show()

좋은 분류기는 점선에서 가장 멀리 떨어져있는 지점

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)
>> 0.96

RandomForestClassifier 를 통한 ROC, ROC_AUC 비교

random forest 는 predict_proba 메서드가 존재
- 샘플이 행, 클래스가 열 -> 주어진 클래스에 속할 확률을 담은 배열을 반환

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")

# 양성 클래스에 대한 확률을 점수로 사용, roc_curve 를 통한 fpr, tpr 생성
y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)

plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right", fontsize=16)
plt.show()

roc_auc_score(y_train_5, y_scores_forest)
>> 0.99
y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
precision_score(y_train_5, y_train_pred_forest)
>> 0.99
recall_score(y_train_5, y_train_pred_forest)
>> 0.86

다중 분류

둘 이상의 클래스를 구별
SGD, randomforest, naive bayes 분류기는 다중 분류 가능
로지스틱 회귀나 서포트 벡터 머신 분류기등은 이진 분류만 가능

이진 분류기를 여러 개 사용해 다중 분류로 활용 가능

-> OVR(각 분류기의 결정 점수 중에서 가장 높은 것을 선택)

-> OVO(각 이미지의 조합마다 이진 분류기를 훈련시키는 것, class가 n개일시 N x (N-1) / 2 개 필요)

다중 클래스 분류 작업에 이진 분류 알고리즘을 사용하면 sklearn이 알고리즘에 따라서 OvR, OvO를 실행

from sklearn.svm import SVC

svm_clf = SVC(gamma="auto", random_state=42)
svm_clf.fit(X_train[:1000], y_train[:1000]) 
svm_clf.predict([some_digit])
>> array([5], dtype=uint8)

이 경우에 이진 분류기를 통한 다중분류를 수행했기 때문에 OvO 전략으로 10개의 이진 분류기를 훈련, 각각의 결정 점수를 얻어 점수가 가장 높은 클래스 선택
decision_function을 통한 각 샘플의 점수 반환 (10개가 반환)

some_digit_scores = svm_clf.decision_function([some_digit])
some_digit_scores

>> array([[ 2.81585438,  7.09167958,  3.82972099,  0.79365551,  5.8885703 ,
         9.29718395,  1.79862509,  8.10392157, -0.228207  ,  4.83753243]])

# 가장 높은 샘플 점수
np.argmax(some_digit_scores)
>> 5

# 클래스 호출
svm_clf.classes_
>> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8)

# 가장 높은 샘플점수가 5번 즉 5번 class의 점수가 제일 높음 (some_digit = 5)

OvR, OvO 같은 방법을 강제하는 법은 OneVsOneClassifier, OneVsRestClassifier를 사용
SVC 기반으로 OvR 전략을 사용하는 다중 분류기 제작

from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC(gamma="auto", random_state=42))
ovr_clf.fit(X_train[:1000], y_train[:1000])
ovr_clf.predict([some_digit])
>> array([5], dtype=uint8)
# 생성된 분류기 개수
len(ovr_clf.estimators_)
>> 10

SGDClassifier 훈련

sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
# 클래스 마다의 샘플 값
sgd_clf.decision_function([some_digit])
>> array([[-31893.03095419, -34419.69069632,  -9530.63950739,
          1823.73154031, -22320.14822878,  -1385.80478895,
        -26188.91070951, -16147.51323997,  -4604.35491274,
        -12050.767298  ]])

SGDClassifier는 다중클래스 분류가 가능하기 때문에 별도의 OvR, OvO를 적용할 필요가 없음
5, 3 에 대한 sample 값만 양수 조금더 세밀하게 파악하기 위해 cross_val_score() 사용해 정확도 평가

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
>> array([0.87365, 0.85835, 0.8689])

모든 성능이 84% 이상, scale 적용시 성능이 더 높아질 여지가 있음

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

>> array([0.8983, 0.891 , 0.9018])

에러 분석

이진 분류와 같이 cross_val_predict() 함수를 통한 예측
confusion_matrix() 함수 호출

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

>> array([[5577,    0,   22,    5,    8,   43,   36,    6,  225,    1],
           [   0, 6400,   37,   24,    4,   44,    4,    7,  212,   10],
           [  27,   27, 5220,   92,   73,   27,   67,   36,  378,   11],
           [  22,   17,  117, 5227,    2,  203,   27,   40,  403,   73],
           [  12,   14,   41,    9, 5182,   12,   34,   27,  347,  164],
           [  27,   15,   30,  168,   53, 4444,   75,   14,  535,   60],
           [  30,   15,   42,    3,   44,   97, 5552,    3,  131,    1],
           [  21,   10,   51,   30,   49,   12,    3, 5684,  195,  210],
           [  17,   63,   48,   86,    3,  126,   25,   10, 5429,   44],
           [  25,   18,   30,   64,  118,   36,    1,  179,  371, 5107]])

이미지를 통해 오차 행렬 확인

def plot_confusion_matrix(matrix):
    """If you prefer color and a colorbar"""
    fig = plt.figure(figsize=(8,8))
    ax = fig.add_subplot(111)
    cax = ax.matshow(matrix)
    fig.colorbar(cax)

대부분의 이미지가 올바르게 분류 되었음을 확인 가능
에러 부분에만 초점을 맞추기 위해 오차 행렬의 각 값에 대응되는 클래스의 이미지 개수로 나눠서 에러 비율을 비교
이미지가 많은 클래스가 상대적으로 안좋게 시각화

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

# 다른 항목 유지 + 주대각선만 0으로 채워서 그래프 시각화
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

행은 실제 클래스를 의미, 열은 예측한 클래스를 나타냄

8번 그림을 보면 raw 인 실제 클래스 8로는 예측이 잘 수행되었지만, col 인 예측 클래스 8은 좋지 않음으로 많은 이미지가 class 8로 잘못 분류되었음
- 이를 통해서 8로 잘못 분류되는 문제를 줄이도록 개선해야하는 필요성이 커짐
- 추가적인 8관련 이미지를 더 추가, 8 이미지에 대한 특성 추가

cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
save_fig("error_analysis_digits_plot")
plt.show()

왼쪽 5x5 두개는 3으로 분류된 이미지, 오른쪽 5x5는 5로 분류된 이미지
- 가장큰 원인은 선형 모델인 SGDClassifier를 사용했기 때문, 선형 분류기는 클래스마다 픽셀에 가중치를 할당하고 새로운 이미지에 대해 단순히 픽셀 강도의 가중치 합을 클래스 점수로 계산

다중 레이블 분류

여러 개의 이진 라벨을 출력하는 분류 시스템을 다중 레이블 분류
KNeighborsClassifier은 다중 레이블 분류를 지원
- 예측을 수행시 레이블 두 개가 출력

from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

knn_clf.predict([some_digit])
>> array([[False,  True]]) # 7이상이 아니고 홀수

모든 레이블에 대한 f1 score 계산
- 모든 가중치가 동일하다고 보고 계산하는것, 일부 사진이 더 많다면 클래스의 지지도(target label에 속한 sample 수)를 가중치로 주는 것 -> average="weighted"로 설정

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average="macro")

다중 출력 분류

다중 레이블 분류에서 한 레이블이 다중 클래스가 될수 있도록 일반화 한 것(즉, 값 두 개 이상 가질수 있음)
이미지에서 잡음을 제거하는 시스템을 예시로 수행

# randint()를 통해 잡음추가
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test

# 시각화
some_index = 0
plt.subplot(121); plot_digit(X_test_mod[some_index])
plt.subplot(122); plot_digit(y_test_mod[some_index])
plt.show()

분류기를 통해 잡음 제거 이미지 분류기 생성

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)

x 가 잡음이 있는 이미지, y가 잡음없는 이미지 인 상태로 훈련

결과적으로 잡음 없는 이미지가 선택

연습문제

1.97% 정확도의 MNIST 분류기

from sklearn.model_selection import GridSearchCV

param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]

knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, param_grid, cv=5, verbose=3)
grid_search.fit(X_train, y_train)

grid_search.best_params_
>> {'n_neighbors': 4, 'weights': 'distance'}

grid_search.best_score_
>> 0.97

from sklearn.metrics import accuracy_score

y_pred = grid_search.predict(X_test)
accuracy_score(y_test, y_pred)
>> 0.9714

2.데이터 증식

from scipy.ndimage.interpolation import shift

def shift_image(image, dx, dy):
    image = image.reshape((28, 28))
    shifted_image = shift(image, [dy, dx], cval=0, mode="constant")
    return shifted_image.reshape([-1])


image = X_train[1000]
shifted_image_down = shift_image(image, 0, 5)
shifted_image_left = shift_image(image, -5, 0)

plt.figure(figsize=(12,3))
plt.subplot(131)
plt.title("Original", fontsize=14)
plt.imshow(image.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(132)
plt.title("Shifted down", fontsize=14)
plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(133)
plt.title("Shifted left", fontsize=14)
plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.show()

# 증분
X_train_augmented = [image for image in X_train]
y_train_augmented = [label for label in y_train]

for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
    for image, label in zip(X_train, y_train):
        X_train_augmented.append(shift_image(image, dx, dy))
        y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

# 셔플
shuffle_idx = np.random.permutation(len(X_train_augmented))
X_train_augmented = X_train_augmented[shuffle_idx]
y_train_augmented = y_train_augmented[shuffle_idx]


knn_clf = KNeighborsClassifier(**grid_search.best_params_)
knn_clf.fit(X_train_augmented, y_train_augmented)


y_pred = knn_clf.predict(X_test)
accuracy_score(y_test, y_pred)
>> 0.9763

3.타이타닉 데이터셋

승객의 나이, 성별, 승객 등급, 승선 위치 같은 속성을 기반으로 하여 승객의 생존 여부를 예측하는 것이 목표
데이터 다운 & 로드

import os
import urllib.request

TITANIC_PATH = os.path.join("datasets", "titanic")
DOWNLOAD_URL = "https://raw.githubusercontent.com/rickiepark/handson-ml2/master/datasets/titanic/"


# 데이터 다운
def fetch_titanic_data(url=DOWNLOAD_URL, path=TITANIC_PATH):
    if not os.path.isdir(path):
        os.makedirs(path)
    for filename in ("train.csv", "test.csv"):
        filepath = os.path.join(path, filename)
        if not os.path.isfile(filepath):
            print("Downloading", filename)
            urllib.request.urlretrieve(url + filename, filepath)

fetch_titanic_data()

# 데이터 불러오기

import pandas as pd

def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.path.join(titanic_path, filename)
    return pd.read_csv(csv_path)

train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")

데이터 열 설명 및 확인

- PassengerId: 각 승객의 고유 식별자.
- Survived: 타깃입니다. 0은 생존하지 못한 것이고 1은 생존을 의미합니다.
- Pclass: 승객 등급. 1, 2, 3등석.
- Name, Sex, Age: 이름 그대로 의미입니다.
- SibSp: 함께 탑승한 형제, 배우자의 수.
- Parch: 함께 탑승한 자녀, 부모의 수.
- Ticket: 티켓 아이디
- Fare: 티켓 요금 (파운드)
- Cabin: 객실 번호
- Embarked: 승객이 탑승한 곳. C(Cherbourg), Q(Queenstown), S(Southampton)

PassengerId 열을 인덱스 열로 지정

train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")

누락 및 통계치 확인

null 은 없고 전체 38% 정도가 생존, 평균 Fare은 32.20 , 평균 나이는 30보다 적음

타겟및 범주형 데이터 확인

# 0,1로 이루어진 target
train_data["Survived"].value_counts() 
>>  0    549
    1    342

# 범주형 데이터 확인
train_data["Pclass"].value_counts() # 승객 등급
>>  3    491
    1    216
    2    184
train_data["Sex"].value_counts() # 성별
>> male      577
   female    314
train_data["Embarked"].value_counts() # 탑승한 곳 C=Cherbourg, Q=Queenstown, S=Southampton.
>>  S    644
    C    168
    Q     77

수치 특성을 처리하기 위한 파이프라인

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ])

# nan을 중앙값으로 변환후 standardscaler 적용

범주형 특성 처리를 위한 파이프라인

cat_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

# nan을 최빈값으로 설정후 OneHotEncoder를 통해 수치형으로 변환

열별 다른 전처리 기법 사용 (범주형은 범주로, 수치형은 수치로)

from sklearn.compose import ColumnTransformer

num_attribs = ["Age", "SibSp", "Parch", "Fare"]
cat_attribs = ["Pclass", "Sex", "Embarked"]

preprocess_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])


# ColumnTransformer를 통해서 열별로 다른 전처리 방법 사용


# 최종 전처리 결과
X_train = preprocess_pipeline.fit_transform(train_data[num_attribs + cat_attribs])
X_train

>> array([[-0.56573646,  0.43279337, -0.47367361, ...,  0.        ,
             0.        ,  1.        ],
           [ 0.66386103,  0.43279337, -0.47367361, ...,  1.        ,
             0.        ,  0.        ],
           [-0.25833709, -0.4745452 , -0.47367361, ...,  0.        ,
             0.        ,  1.        ],
           ...,
           [-0.1046374 ,  0.43279337,  2.00893337, ...,  0.        ,
             0.        ,  1.        ],
           [-0.25833709, -0.4745452 , -0.47367361, ...,  1.        ,
             0.        ,  0.        ],
           [ 0.20276197, -0.4745452 , -0.47367361, ...,  0.        ,
             1.        ,  0.        ]])

# target 
y_train = train_data["Survived"]

RandomForestClassifier 를 통한 학습

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_clf.fit(X_train, y_train)

# 예측
X_test = preprocess_pipeline.transform(test_data[num_attribs + cat_attribs])
y_pred = forest_clf.predict(X_test)

# 성능평가
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_clf, X_train, y_train, cv=10)
forest_scores.mean()

>> 0.809


from sklearn.svm import SVC

svm_clf = SVC(gamma="auto")
svm_scores = cross_val_score(svm_clf, X_train, y_train, cv=10)
svm_scores.mean()

>> 0.824

두 모델 사이의 성능 시각화 to boxplot()
- 1사분위가 Q1이고 3사분위가 Q3이라면 사분위수 범위는 IQR=Q3−Q1(박스의 높이)
- Q1−1.5×IQR 보다 낮거나 Q3+1.5×IQR 보다 높은 점수는 이상치로 간주

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot([1]*10, svm_scores, ".")
plt.plot([2]*10, forest_scores, ".")
plt.boxplot([svm_scores, forest_scores], labels=("SVM","Random Forest"))
plt.ylabel("Accuracy", fontsize=14)
plt.show()

추가적인 전처리 방법으로는 그리드 탐색, 모델비교, 범주형을 구간으로 변경

'Study > Self Education' 카테고리의 다른 글

핸즈온 머신러닝 - 6 (0)	2024.06.18
핸즈온 머신러닝 - 5 (0)	2024.06.17
핸즈온 머신러닝 - 4 (2)	2024.06.17
핸즈온 머신러닝 - 2 (0)	2024.06.13
핸즈온 머신러닝 - 1 (1)	2024.06.13