카테고리 없음

[ML] ML Models 3 - 실습(with Boosting Model)

dohyeon2 2025. 4. 2. 15:27

# 필요한 라이브러리 불러오기
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

In [52]:

! pwd

/Users/dohyeon

1. 물고기 데이터 불러오기¶

In [53]:

# 1. 데이터 불러오기(제가 풀고자하는 문제와 가장 유사한 Fish Market Dataset을 사용하였습니다.
# 데이터셋은 kaggle에서 받을 수 있습니다.
# https://www.kaggle.com/datasets/vipullrathod/fish-market?resource=download) 
data = pd.read_csv('/Users/dohyeon/Fish.csv')

2. 데이터 확인¶

In [54]:

# 2. 데이터 확인
print(data.head())
print(data.info())
print(data['Species'].unique())

  Species  Weight  Length1  Length2  Length3   Height   Width
0   Bream   242.0     23.2     25.4     30.0  11.5200  4.0200
1   Bream   290.0     24.0     26.3     31.2  12.4800  4.3056
2   Bream   340.0     23.9     26.5     31.1  12.3778  4.6961
3   Bream   363.0     26.3     29.0     33.5  12.7300  4.4555
4   Bream   430.0     26.5     29.0     34.0  12.4440  5.1340
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  159 non-null    object 
 1   Weight   159 non-null    float64
 2   Length1  159 non-null    float64
 3   Length2  159 non-null    float64
 4   Length3  159 non-null    float64
 5   Height   159 non-null    float64
 6   Width    159 non-null    float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB
None
['Bream' 'Roach' 'Whitefish' 'Parkki' 'Perch' 'Pike' 'Smelt']

해석¶

데이터는 총 6개의 독립변수(범주형 1개,연속형 5개)와 1개의 종속변수(weight)로 이루어져 있습니다.
Dataset은 모든 feature가 159개, null값 없이 이루어져 있습니다.
범주형 데이터(Species)는 object 형식, 연속형 데이터(Length1, Length2, Length3, Height, Width, Weight)는 float64 형식입니다.
물고기 종(범주형 데이터)은 총 7종이 존재합니다.

각 feature가 뜻하는 바는 다음과 같습니다.¶

Length1: 머리부터 꼬리뿌리까지 (꼬리 제외)
Length2: 머리부터 꼬리가 갈라지는 지점까지
Length3: 머리부터 꼬리 끝까지
Height: 몸통 높이
Width: 몸통 두께

In [55]:

sns.pairplot(data)

Out[55]:

<seaborn.axisgrid.PairGrid at 0x3111d1950>

No description has been provided for this image

해석¶

이 그래프를 해석하는건 "매우" 중요합니다.
각 독립변수(length 1,2,3, width, height)와 종속변수(weight)와의 관계를 살펴보겠습니다.

1. Weight - length1/length2/length3 분석¶

각 독립변수는 weight와 log함수 형태를 띕니다.
length 1,2,3은 서로 매우 큰 상관관계를 가지고 있습니다.
polynomial, tree 기반 모델에서는 큰 문제가 없겠지만 선형회귀 모델에서는 다중 공선성 문제가 발생할 수 있습니다.
따라서 셋중 하나의 feature만 선택하는 방법과 length1/length3 처럼 체형의 비율을 새로운 feature로 도입할 수 있습니다.

2. Weight - height/width¶

weight-height 그래프는 형태는 length들과 유사하지만 분산이 더 큼
weight-width 그래프는 형태는 length들과 유사하지만 분산이 더 큼

3. 독립변수 간 관계¶

Length1 vs Length2 vs Length3¶

Length1, Length2, Length3 거의 완벽한 선형관계
그말은 즉 다중 공선성 문제가 발생할 확률이 큼

Height vs Length¶

Height도 Length 계열과 약간의 양의 관계
Height가 Length 계열 feature의 보조적인 느낌

Width vs Length¶

Widht는 Length 계열과 약한 양의 관계
Width는 다른 feature들에 비해 독립적인 정보 제공할 듯

In [56]:

# 독립변수 간 상관관계 분석
# 사용할 feature 선택
features = ['Length1', 'Length2', 'Length3', 'Height', 'Width']

# 상관계수 행렬 계산
corr_matrix = data[features].corr()

# 히트맵 시각화
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Heatmap of Features', fontsize=14)
plt.tight_layout()
plt.show()

해석¶

Length1, Length2, Length3 는 거의 동일한 정보 (다중공선성 위험 매우 큼)
Width 와 Length3 간의 높은 상관관계: Width 는 부피적 feature 인데 길이 기반 feature 와 연관됨
Height 는 다른 feature 들에 비해 상대적으로 독립성이 더 있음

In [57]:

# 다중공선성 검증을 위한 VIF 분석
from statsmodels.stats.outliers_influence import variance_inflation_factor

# 사용할 feature 선택
features = ['Length1', 'Length2', 'Length3', 'Height', 'Width']

# 데이터 스케일링 (VIF 분석 전에는 스케일 맞추는 것이 좋음)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data[features])

# VIF 계산
vif_data = pd.DataFrame()
vif_data['Feature'] = features
vif_data['VIF'] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]

# 출력
print(vif_data)

   Feature          VIF
0  Length1  1681.496487
1  Length2  2084.257828
2  Length3   422.468251
3   Height    14.570087
4    Width    12.275361

일반적으로 VIF 기준:

1 미만: 상관관계 거의 없음
1 ~ 5: 보통 수준
5 ~ 10: 다중공선성 의심
10 이상: 다중공선성 심각
100 이상: 정말 심각 (특히 1000 이상은 거의 완전히 종속적)

해석¶

Length1, Length2, Length3 전부 400 이상 ~ 2000 이상 → 거의 완전히 종속된 feature임
Height, Width도 다중공선성이 존재함. 그러나 Length만큼 심하지는 않음

결론¶

Length는 Length3 빼고 나머지 삭제
Height, Width는 남겨두기

In [58]:

import matplotlib.pyplot as plt
import seaborn as sns

# Set the style
sns.set(style="whitegrid")

# Define consistent color mapping
species_counts = data['Species'].value_counts()
species = species_counts.index.tolist()

# Define a color palette manually
palette = sns.color_palette('viridis', n_colors=len(species))
color_mapping = dict(zip(species, palette))

# Prepare the figure
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Species count bar plot (use manual colors)
sns.countplot(x='Species', data=data, palette=color_mapping, ax=axes[0])
axes[0].set_title('Count of Each Fish Species')
axes[0].set_xlabel('Fish Species')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

# Pie chart of species distribution (use same colors)
axes[1].pie(
    species_counts,
    labels=species_counts.index,
    autopct='%1.1f%%',
    colors=[color_mapping[sp] for sp in species]
)
axes[1].set_title('Species Distribution (Pie Chart)')

# ---- Numeric variables in grid ----

numeric_cols = ['Weight', 'Length1', 'Length2', 'Length3', 'Height', 'Width']

# Prepare the figure: 총 2행 3열 그리드
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

for ax, col in zip(axes.flatten(), numeric_cols):
    sns.histplot(data[col], kde=True, color='steelblue', ax=ax)
    ax.set_title(f'Histogram of {col}')
    ax.set_xlabel(col)
    ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

/var/folders/lw/v0tmr64s129cr6tf3cy8skqc0000gn/T/ipykernel_10705/2274834865.py:19: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x='Species', data=data, palette=color_mapping, ax=axes[0])

해석¶

1. 물고기 종류별 샘플 수¶

Perch 종이 샘플수가 가장 많음
데이터 불균형이 존재하기 때문에 모델링 시 종별 편향 고려해야 함

2. Weight¶

오른쪽으로 꼬리가 긴 right-skewed 분포
대부분의 물고기가 상대적으로 가벼움(0~500g).
극소수 데이터가 1000g 넘음

3. Length1,2,3¶

세 feature 모두 매우 유사한 분포를 띔

4. Height¶

약간의 right-skewed 분포
대부분 5~7.5 cm임

5. Width¶

약간의 right-skewed 분포지만 다른 feature들 보다 평탄함
대부분이 2.5~4.5cm 집중

결론¶

대부분 right-skewed 분포를 따르므로 log변환 or 정규화 필요
Length 1,2,3은 중복 feature이므로 하나 선택할 것
Height,Width는 ㄱㅊ
Weight, Height, Length3에 이상치 있음 -> 모델 학습시 robust 방법 or 이상치 제거 필요

In [59]:

import matplotlib.pyplot as plt
import seaborn as sns

# 스타일 세팅
sns.set(style="whitegrid")

# 시각화할 numeric feature 리스트
numeric_cols = ['Weight', 'Length1', 'Length2', 'Length3', 'Height', 'Width']

# Species 컬러 매핑
species = data['Species'].unique()
palette = sns.color_palette('viridis', n_colors=len(species))
color_mapping = dict(zip(species, palette))

# Boxplot
plt.figure(figsize=(20, 18))
for idx, col in enumerate(numeric_cols, 1):
    plt.subplot(3, 2, idx)
    sns.boxplot(x='Species', y=col, hue='Species', data=data, palette=color_mapping, legend=False)
    plt.title(f'Boxplot of {col} by Species')
    plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Violinplot
plt.figure(figsize=(20, 18))
for idx, col in enumerate(numeric_cols, 1):
    plt.subplot(3, 2, idx)
    sns.violinplot(x='Species', y=col, hue='Species', data=data, palette=color_mapping, legend=False)
    plt.title(f'Violinplot of {col} by Species')
    plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Species별 numeric feature 분포를 시각화하는 이유¶

Species 별로 물고기의 특징이 다르면 Weight 예측할 때 Species 를 feature 로 넣어야 하는지 결정할 수 있습니다.
Species 별로 어떤 feature 가 더 중요하게 작용하는지 보면 모델링 전략도 달라집니다.

해석¶

Weight 그래프를 보면 종별 weight 차이가 극명함
따라서 species는 weight 예측에 필수적인 feature임
Length 그래프를 보면 Length3 하나만으로 Weight를 잘 설명할 수 있음
Height 그래프를 보면 종별 높이 차이가 극명함
Width 그패르를 보면 종별 너비 차이가 극명함

결론¶

Species feature는 필수적으로 사용해야 함.
다만 one-hot 인코딩, Target 인코딩 고민 필요

3. 전처리¶

In [60]:

# Species 범주형 변수 처리
fish_data = pd.get_dummies(data, columns=['Species'], drop_first=True)
fish_data

Out[60]:

	Weight	Length1	Length2	Length3	Height	Width	Species_Parkki	Species_Perch	Species_Pike	Species_Roach	Species_Smelt	Species_Whitefish
0	242.0	23.2	25.4	30.0	11.5200	4.0200	False	False	False	False	False	False
1	290.0	24.0	26.3	31.2	12.4800	4.3056	False	False	False	False	False	False
2	340.0	23.9	26.5	31.1	12.3778	4.6961	False	False	False	False	False	False
3	363.0	26.3	29.0	33.5	12.7300	4.4555	False	False	False	False	False	False
4	430.0	26.5	29.0	34.0	12.4440	5.1340	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...
154	12.2	11.5	12.2	13.4	2.0904	1.3936	False	False	False	False	True	False
155	13.4	11.7	12.4	13.5	2.4300	1.2690	False	False	False	False	True	False
156	12.2	12.1	13.0	13.8	2.2770	1.2558	False	False	False	False	True	False
157	19.7	13.2	14.3	15.2	2.8728	2.0672	False	False	False	False	True	False
158	19.9	13.8	15.0	16.2	2.9322	1.8792	False	False	False	False	True	False

159 rows × 12 columns

In [64]:

# One-hot Encoding
encoder = OneHotEncoder(sparse_output=False, drop='first')
species_encoded = encoder.fit_transform(data[['Species']])
species_encoded_df = pd.DataFrame(species_encoded, columns=encoder.get_feature_names_out(['Species']))

# 원본 데이터에 인코딩된 Species 추가 (중복 feature 제외)
data_encoded = pd.concat([data.drop('Species', axis=1), species_encoded_df], axis=1)

# Feature set 1: 모든 feature 사용
# data.columns 에는 원래 feature 가 있으므로 species_encoded_df.columns 만 붙이면 됨!
features_full = list(data.columns.drop('Species')) + list(species_encoded_df.columns)

# Feature set 2: 핵심 feature만 사용
features_selected = ['Length3', 'Height', 'Width'] + list(species_encoded_df.columns)

print("features_full =", features_full)
print("features_selected =", features_selected)

features_full = ['Weight', 'Length1', 'Length2', 'Length3', 'Height', 'Width', 'Species_Parkki', 'Species_Perch', 'Species_Pike', 'Species_Roach', 'Species_Smelt', 'Species_Whitefish']
features_selected = ['Length3', 'Height', 'Width', 'Species_Parkki', 'Species_Perch', 'Species_Pike', 'Species_Roach', 'Species_Smelt', 'Species_Whitefish']

In [67]:

# 결과 저장용 딕셔너리
results_comparison = {}

def train_and_evaluate(features, label):
    # 데이터 분할
    X_train, X_test, y_train, y_test = train_test_split(
        data_encoded[features], data_encoded['Weight'], test_size=0.2, random_state=42
    )

    # 스케일링
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    model_results = {}

    # (1) Linear Regression
    lr = LinearRegression()
    lr.fit(X_train_scaled, y_train)
    y_pred_lr = lr.predict(X_test_scaled)
    model_results['Linear Regression'] = {
        'MSE': mean_squared_error(y_test, y_pred_lr),
        'R2': r2_score(y_test, y_pred_lr)
    }

    # (2) Polynomial Regression (degree=2)
    poly_pipeline = Pipeline([
        ('poly_features', PolynomialFeatures(degree=2)),
        ('scaler', StandardScaler()),
        ('linear_regression', LinearRegression())
    ])
    poly_pipeline.fit(X_train, y_train)
    y_pred_poly = poly_pipeline.predict(X_test)
    model_results['Polynomial Regression (Degree 2)'] = {
        'MSE': mean_squared_error(y_test, y_pred_poly),
        'R2': r2_score(y_test, y_pred_poly)
    }

    # (3) XGBoost (빠른 기본 파라미터)
    xgb = XGBRegressor(random_state=42, verbosity=0)
    xgb.fit(X_train, y_train)
    y_pred_xgb = xgb.predict(X_test)
    model_results['XGBoost'] = {
        'MSE': mean_squared_error(y_test, y_pred_xgb),
        'R2': r2_score(y_test, y_pred_xgb)
    }

    # 결과 저장
    results_comparison[label] = model_results

# Feature set 1: 모든 feature 사용
train_and_evaluate(features_full, 'All Features')

# Feature set 2: 핵심 feature만 사용
train_and_evaluate(features_selected, 'Selected Features (Length3, Height, Width)')

# 결과 정리
results_df_comparison = pd.concat({k: pd.DataFrame(v).T for k, v in results_comparison.items()})
results_df_comparison.index.names = ['Feature Set', 'Model']

# 결과 출력
print(results_df_comparison)

                                                                                      MSE  \
Feature Set                                Model                                            
All Features                               Linear Regression                 1.016185e-24   
                                           Polynomial Regression (Degree 2)  4.306167e-25   
                                           XGBoost                           6.150184e+02   
Selected Features (Length3, Height, Width) Linear Regression                 7.888579e+03   
                                           Polynomial Regression (Degree 2)  1.934711e+03   
                                           XGBoost                           5.672772e+03   

                                                                                   R2  
Feature Set                                Model                                       
All Features                               Linear Regression                 1.000000  
                                           Polynomial Regression (Degree 2)  1.000000  
                                           XGBoost                           0.995676  
Selected Features (Length3, Height, Width) Linear Regression                 0.944540  
                                           Polynomial Regression (Degree 2)  0.986398  
                                           XGBoost                           0.960118

해석¶

핵심 feature (Length3, Height, Width) 만으로도 상당히 좋은 예측 성능. 다만, full feature set 보다는 약간 부족함
즉, Length1, Length2 는 Weight 예측에 기여하고 있음
Full feature 에서 Linear, Polynomial 모델 모두 MSE e-24 → 거의 머신 오차 수준
데이터가 단순하고 수가 적을 때 선형 모델이 거의 완벽하게 맞춤.
Polynomial Regression (Degree 2) 가 가장 안정적
특히 핵심 feature 만 사용했을 때 Polynomial Regression 성능이 뛰어남.
feature 간 상호작용(곱, 제곱 term) 이 Weight 예측에 효과적이라는 것을 입증.
XGBoost 는 전반적으로 Polynomial Regression 에 비해 살짝 아쉬움
XGBoost 가 보통 복잡하고 큰 데이터셋에서 장점을 보이는데, 지금 데이터는 상대적으로 단순하고 작기 때문에 성능이 떨어지는 것으로 보임

최종 결론¶

제가 사용한 데이터셋은 6개의 독립변수(범주형 1, 연속형 5)로 1개의 종속변수(체중)로 이루어져 있고, 총 159개의 샘플이 존재합니다. 따라서 XGBoost와 같은 Boosting model을 사용하기에는 데이터셋 크기가 매우 작습니다. 각 모델을 비교해본 결과 전체 feature를 사용하는 것이 핵심 feature를 사용하는 것보다 좋은 성능을 보였고, Polynomial regression model이 가장 좋은 성능을 보였습니다.

In [ ]:

Reference

[1] https://www.kaggle.com/code/munmun2004/house-prices-for-begginers

현재글[ML] ML Models 3 - 실습(with Boosting Model)

dohyeon's log