EDA 1st

After preprocessing the missing values

The next step is EDA after preprocessing the missing values in DataFrame.

Figuring out the lowest correlation, then dropping the columns that are less correlated.

The following codes are for EDA.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
movie_train.corr()
sns.heatmap(data = movie_train.corr(), annot=True)

movie_train.corr()[movie_train.corr() >= 0.3]

movie_train.drop(columns = 'num_actor', inplace=True)
sns.heatmap(data = movie_train.corr(), annot=True)

col_mt = movie_train.columns.difference(['num_staff'])
movie_train[col_mt].corr()
sns.pairplot(movie_train[col_mt])

### Title, distributor, genre, release_time, screening_rat, director
distributor = movie_train.distributor.value_counts().sort_values(ascending=True)
distributor = distributor[distributor > 2]
distributor_name = distributor.index.tolist()
distributor_cnt = distributor.values.tolist()

plt.rc('font', family = "Malgun Gothic")
fig=plt.figure(figsize = (15,10))
ax = fig.add_subplot(1,1,1)
sns.barplot(distributor_name, distributor_cnt, data = movie_train)
plt.xlabel('배급사', fontsize = 15)
plt.title('영화 배급사 수', fontsize = 20)
ax.set_xticklabels(distributor_name, rotation=75)
plt.show()
#------------------
genre = movie_train.genre.value_counts().sort_values(ascending=True)
genre = genre[genre > 1]
genre_name = genre.index.tolist()
genre_cnt = genre.values.tolist()

plt.rc('font', family = "Malgun Gothic")
fig=plt.figure(figsize = (15,10))
ax = fig.add_subplot(1,1,1)
sns.barplot(genre_name, genre_cnt, data = movie_train)
plt.xlabel('장르', fontsize = 15)
plt.title('장르별 영화 수', fontsize = 20)
ax.set_xticklabels(genre_name, rotation=75)
plt.show()
#--------------------
rate = movie_train.screening_rat.value_counts().sort_values(ascending=True)
rate = rate[rate > 1]
rate_name = rate.index.tolist()
rate_cnt = rate.values.tolist()

plt.rc('font', family = "Malgun Gothic")
fig=plt.figure(figsize = (15,10))
ax = fig.add_subplot(1,1,1)
sns.barplot(rate_name, rate_cnt, data = movie_train)
plt.xlabel('등급', fontsize = 15)
plt.title('등급별 영화 수', fontsize = 20)
ax.set_xticklabels(rate_name, rotation=75)
plt.show()
#----------------------
director = movie_train.director.value_counts().sort_values(ascending=True)
director = director[director > 1]
director_nm = director.index.tolist()
director_cnt = director.values.tolist()

plt.rc('font', family = "Malgun Gothic")
fig=plt.figure(figsize = (15,10))
ax = fig.add_subplot(1,1,1)
sns.barplot(director_nm, director_cnt, data = movie_train)
plt.xlabel('감독명', fontsize = 15)
plt.title('감독별 영화 수', fontsize = 20)
ax.set_xticklabels(director_nm, rotation=75)
plt.show()

I added the visualization codes about the nominal data.

I am not done yet!

The codes are just the first step of EDA.