[Porto Seguro’s Safe Driver Prediction] 데이터 탐색하기

Porto Seguro’s Safe Driver Prediction

대회에서 제공하는 데이터에는 몇가지 특징이 있다.

1. 컬럼 이름(ind,reg, car, calc)을 가지고 Grouping이 된다

2. 컬럼 이름에 '_bin' 은 Binary Features , '_cat' 은 Categorical Features를 의미한다.

3. 컬럼 이름에 아무것도 안붙어있으면 Continuous or Ordinal Features를 의미한다.

4. '-1' 은 Null 값을 의미한다.

데이터는 총 57개의 X 데이터가 주어진다(ID,Target 제외).

이는 Interval (21개) / Ordinal(16개) / Binary(18개) Variables로 나뉜다.

1. Interval Variables : ps_reg_03, ps_car_12, ps_car_15 컬럼에 Missing이 있다.

2. Ordinal variables : ps_car_11 에 missing이 있다.

3. Binary variables : 99% 이상이 Target =0인 컬럼이 4개 있다.

(ps_ind_10_bin /ps_ind_11_bin/ps_ind_12_bin/ ps_ind_13_bin)

Target 값은 Target = 1의 비율이 3.645% 인 Strong Imbalanced 된 상태이다.

[ Data Quality Checks]

1. Missing value

- 총 12개 컬럼에 Missing value 가 있다.

- continuous 는 mean으로 변경하고, ordinal은 mode로 바꾸자

- 너무 Missing value가 많은 컬럼은 삭제 처리 'ps_car_03_cat', 'ps_car_05_cat'

2. the Cardinality of the categorical variables 확인

- dummy variables로 변경해줘야 하는 데 category가 너무 많으면 오버핏, 연산속도등에서 문제 가 생긴다.

- Only ps_car_11_cat has many distinct values, although it is still reasonable.

- ps_car_11_cat을 다루기 위한 코드가 또 있다.

# Script by https://www.kaggle.com/ogrellier

# Code: https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features

Exploratory Data Visualization

[ categorical values ]

- missing values를 많이 포함하고 있다. 이것 또한 하나의 카테고리로 보는 것이 더 현명 할 것이다.

- 컬럼과 데이터가 모두 encoding 되어 있어서 해석하기가 쉽지는않다.

[ Interval variables ]

- variables 간의 correlation을 확인한다. heatmap을 활용하는 것이 좋다.

- 대체로 reg, car 안에서는 서로 관계성이 있지만, calc는 전혀 관련이 없다.

- reg 카테고리에서는 02-03 간의 관계가 있다.

- car에서는 13-12가 관계가 높고 , 12-14 , 13- 15도 관계가 높다. 12와 관련이 있을 것같다.

- calc는 전혀 관련이 없다. 서로

-> PCA를 통해 차원 축소를 가능하다. 서로 관계성을 끊는 것이다. - 테스트 해볼 가치가 있다.

Feature engineering

[ Creating dummy variables ]

- pd.get_dummies(train, columns=v, drop_first=True)

[Creating interaction variables]

- PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

v = meta[(meta.level == 'interval') & (meta.keep)].index
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
interactions = pd.DataFrame(data=poly.fit_transform(train[v]), columns=poly.get_feature_names(v))
interactions.drop(v,axis =1, inplace = True) # Remove the original columns

# Concat the interaction variables to the train data
print('Before creating interactions we have {} variables in train'.format(train.shape[1]))
train = pd.concat([train, interactions], axis=1)
print('After creating interactions we have {} variables in train'.format(train.shape[1]))


# from https://www.kaggle.com/bertcarremans/data-preparation-exploration

Feature Selection

[ Removing features with low or zero variance ]

- sklearn has a handy method to remove features with no or a very low variance : VarianceThreshold. By default it removes features with zero variance

- 28 variables have too low variance

selector = VarianceThreshold(threshold=.01)
selector.fit(train.drop(['id', 'target'], axis=1)) # Fit to train without id and target variables

f = np.vectorize(lambda x : not x) # Function to toggle boolean array elements

v = train.drop(['id', 'target'], axis=1).columns[f(selector.get_support())]
print('{} variables have too low variance.'.format(len(v)))
print('These variables are {}'.format(list(v)))

[ Selecting features with a Random Forest and Selectfrom model ]

- SelectFromModel 을 활용해서 컬럼 selection 하는 거

- SelectFromModel --- 컬럼

sfm = SelectFromModel(rf, threshold='median', prefit=True)
print('Number of features before selection: {}'.format(X_train.shape[1]))
n_features = sfm.transform(X_train).shape[1]
print('Number of features after selection: {}'.format(n_features))
selected_vars = list(feat_labels[sfm.get_support()])

* Mutual Information plots

https://bab2min.tistory.com/546

* Reference

https://www.kaggle.com/bertcarremans/data-preparation-exploration

http://seaborn.pydata.org/examples/many_pairwise_correlations.html

https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features

https://www.kaggle.com/arthurtok/interactive-porto-insights-a-plot-ly-tutorial

https://www.kaggle.com/aharless/xgboost-cv-lb-284?scriptVersionId=1682522

저작자표시

'Kaggle 대회' 카테고리의 다른 글

[Porto Seguro’s Safe Driver Prediction] Improve (0)	2020.02.23
[Porto Seguro’s Safe Driver Prediction] Baseline 구축 (0)	2020.02.11
[Porto Seguro’s Safe Driver Prediction] 대회 소개 / 지니계수 란? (0)	2020.02.02
[APTOS 2019 Blindness Detection] 대회 소개 (0)	2020.01.14
[Santander Product Recommendation] 전혀 다른 모델 활용하기(MLP) (0)	2019.12.25

사자처럼 우아하게

[Porto Seguro’s Safe Driver Prediction] 데이터 탐색하기

Porto Seguro’s Safe Driver Prediction

Exploratory Data Visualization

Feature engineering

Feature Selection

'Kaggle 대회' 카테고리의 다른 글

댓글

티스토리툴바

[Porto Seguro’s Safe Driver Prediction] 데이터 탐색하기

Porto Seguro’s Safe Driver Prediction

Exploratory Data Visualization

Feature engineering

Feature Selection

'Kaggle 대회' 카테고리의 다른 글

관련글

댓글

티스토리툴바