result of preprocessing data

Posted on 2021-11-08 Disqus:

전처리 결과

movie_train 데이터의 전처리 결과를 분석했다. (I summarized the result of preprocessing movie_train data here.)

전처리 이전 num_actor라는 변수는 상관도가 낮은 변수이기 때문에 전처리 이전에 삭제 후 전처리를 진행한다. (I will continue preprocessing after removing num_actor column because this column is less correlated.)
범주형 데이터를 가진 컬럼들은 인코딩을 해주고 수치형 데이터로 변환되고 난 후 상관관계를 확인했을 때 날짜관련 데이터 부분들은 삭제해준다. (After encoding the columns that has nominal data as label and with onehot encoding and converted into numeric data, the columns related to date are removed.)
'scaled_director', 'scaled_genre', 'scaled_distributor' columns are also removed because of the low correlation.
최종적으로 이용해도 될만한 컬럼들은 box_off_num, dir_prve_bfnum, dir_prev_num, num_staff, screening_rat (onehot encoding columns), time이다. (물론 회귀분석에서 다중공선성 여부로 전체 컬럼들을 가지고 다시 확인할 생각이다.)
(The columns that are able to utilize are ultimately box_off_num, dir_prve_bfnum, dir_prev_num, num_staff, screening_rat (onehot encoding columns), time. (Of course, I also recheck the total columns with multicollinearity later in Linear regression modeling.)