My Blog for coding and note

https://github.com/hesthers

0%

2nd Day of Toy Project

  • 키노라이츠 사이트 크롤링: 날짜, 넷플릭스 및 티빙에서 방영된 컨텐츠 이름들
  • 리스트로 저장 후 데이터 프레임 등으로 변환
  • 이후 SNS에서 크롤링한 해당 컨텐츠 명들로 해시태그 수 등 크롤링 (화제성 파악 목적)

BEGIN toy project

Raw meeting report

요구사항

기획 회의 리포트

<<아이데이션>>

주제, 가설 (목표세우기) — 사이트, 수집 가능여부, 추출방안, 아이디어 및 어려움

프로젝트 리포트 작성 내용:

목표 — 한줄로 작성
선정이유 (분석 동기 및 이유)
데이터 수집 - 수집방법/출처

  • 이슈사항 (데이터 전처리, 어려움 등…)

데이터 확인 — 결측값 등 설정
데이터 분포

변수명 통일

토이 프로젝트 주제 및 방향

  • 콘텐츠 화제성 비교 분석
    OTT 플랫폼의 오리지널 컨텐츠의 ott 시장에 미칠 영향력 분석 (디지털 미디어 컨텐츠의 화제성 파악)

  • 관련 사이트:
    연관 사이트 크롤링으로 데이터 수집 (관련 정보 — 배급사, 개봉 혹은 방영 연도 등…)
    트렌드 사이트에서 중요 정보 파악 후 크롤링 등으로 수집
    sns에서 해시태그 등으로 화제성 요인 파악하기

Python Crawling Practice

  • Example with Instagram with Selenium module in the python:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import time
from bs4 import BeautifulSoup
import urllib

driver = webdriver.Chrome('./chromedriver_win32/chromedriver.exe')
url = 'https://www.instagram.com/'
driver.get(url)
time.sleep(4)

ins_id = '[Enter your email for instagram id]' #or use 'input' command here!
pw = '[Enter your email for instagram pw]'

driver.find_element_by_xpath('[copy xpath here]').send_keys(ins_id)
driver.find_element_by_xpath('[copy xpath here]'.send_keys(pw)
driver.find_element_by_xpath('[copy xpath here]').click()
time.sleep(3)

driver.find_element_by_xpath('[copy xpath here]').click()
time.sleep(3)

driver.find_element_by_xpath('[copy xpath here]').click()
time.sleep(4)

key_word = input('키워드를 입력하세요 : ')
driver.find_element_by_xpath('[copy xpath here]').send_keys(key_word)
time.sleep(4)

driver.find_element_by_xpath('[copy xpath here]').click()

for i in range(50):
if i == 0:
driver.find_element_by_xpath('[copy xpath here]').click()

html = BeautifulSoup(driver.page_source, 'html.parser')
html.select('div > div.C4VMK > span')

tag = [tag.text.strip('#') for tag in html.select('a.xil3i')]
comment = [comment.text for comment in html.select('div > div.C4VMK > span')]
like = [l.text for l in html.select('section.EDfFK.ygqzn > div > div > a > span')]

image = driver.find_element_by_css_selector('[copy selector here related to the image]')
image_url = image.get_attribute('src')
urllib.request.urlretrieve(image_url, './test_image.jpg')
time.sleep(5)
print('크롤링 시작')
driver.find_element_by_xpath('[copy xpath here]').click()

else:
html = BeautifulSoup(driver.page_source, 'html.parser')
html.select('div > div.C4VMK > span')

total_tag = [tag.text.strip('#') for tag in html.select('a.xil3i')]
extra_comment = [comment.text for comment in html.select('div > div.C4VMK > span')]
total_like = [l.text for l in html.select('section.EDfFK.ygqzn > div > div > a > span')]

try:
image = driver.find_element_by_css_selector('[copy selector here related to the image]')
image_url = image.get_attribute('src')
urllib.request.urlretrieve(image_url, f'./test_image{i}.jpg')

except Exception as error:
image = driver.find_element_by_css_selector('[copy selector here related to the image]')
image_url = image.get_attribute('src')
urllib.request.urlretrieve(image_url, f'./test_image{i}.jpg')

driver.find_element_by_xpath('[copy xpath here]').click()
time.sleep(10)
print(f'크롤링 중 {1+i}')
  • No unauthorized usage and copy (This is against!)

Crawling pt.3

Today github blog’s topic is about crawling without BeautifulSoup module. Generally, when using crawling technique, I use BeautifulSoup module. Unfortunately, most websites has the informational security problems of the crawling, so the crawling is blocked. To avoid the block of the website, use crawling technique of the dynamic web page.

  • Use F12 key on your keyboard and find the referer webpage address and user agent on the network panel.

  • Write python codes of the crawling the following:

1
2
3
4
5
6
7
8
9
10
11
url = '[web address to crawl]'
info = {
'referer': '[main webpage address]',
'user-agent': '[user agent on the network panel of the developer webpage]'
}
response = requests.get(url, headers=info)
# response.text

import json
data = json.loads(response.text)
data

You can access the dynamic webpage that is blocked by java script(js).

Tomorrow Plans

  • This is just for planning note of mine.

Things to do tomorrow

  1. Solve the mock test on a book (certificate) — a series of 3~4 mock exams

  2. Study the machine learning on online lecture and write new worksheet.

  3. Review Ch.1 and solve the practice questions

Study much harder and keep going! D-day is left 6 days… 加油!!

Today What I Studied (TIL)

  • Big Data Analysis Engineer Certificate: Solved the practice questions and Reviewed what I studied before (pt.2 about statistics part)

I bought new practice questions book yesterday, so I started solving the questions today!

This post is just for self note.

빅분기 시험이 곧 다가왔다. 1주일하고 2일?? 총 9일 정도 남았다. 그래서 과목별 중요하다고 느끼는 주제들만 살짝 적어볼까 싶다. 사실은 적을 주제가 딱히 없어서다.. 너무 영어로만 포스팅을 적다보니 한글로도 적어보고 싶었다.

1과목: 데이터 아키텍처(특히 HDFS 부분들), SW 수집기술, 개인정보 및 비식별화, 사회적인 IT관련 이슈들, 데이터 거버넌스

2과목: 기술, 추론 통계 파트, 파생변수, 확률 계산, EDA, 데이터 분류

3과목: 분석기법들 (인공신경망, 회귀분석, SVM, 비정형데이터, 시계열, 앙상블 기법, 랜덤포레스트, 군집분석)

4과목: 평가지표(confusion matrix), 교차 검증, 적합도 검정, 모델 선정 및 평가, 시각화 (종류 및 특징들), 분석 결과 활용, 비즈니스 기여도 평가 (평가기법 부분), 분석모형 모니터링 및 리모델링 구분

Interesting Article Metaverse

I found the interesting news article about Metaverse.

This is the link: News Link

Metaverse is one of topics that I have had an interest in. In financial field, especially digital finance, most companies have been interested in this topic.

If metaverse is boosted in various field, the business would have the new type of market.

Today What I Studied

I studied the national authorized certificate (Big Data Engineer Certificate)

  • Solved Pt.3 questions (about 50 questions)
  • studied pt.4 Evaluating analysis of the modeling
  1. confusion matrix
  2. ROC curve
  3. standards of the evaluation of analyzing the modeling
  4. cross validation - Holdout, K-Fold, LOOCV
  5. parametric significance test
  6. goodness of fit test

I also watched online lecture videos about machine learning(linear regression)

  • Topic: AirBnB

Today What I Studied (TIS)

  1. I studied the national authorized certificate (Big Data Engineer Certificate)
  • pt.3 Bigdata modeling — analysis technique (From unstructured data to non-parametric statistics)
  1. I also watched online lecture videos about data visualization with seaborn module of python and web crawling part.

  2. I practiced time series data analysis in python. (posted it on my Naver blog)

Crawling Practice pt.2

오늘은 아주아주 재미있는 주제로 도전해보려고 한다… 바로바로 !!!!! 크롤링이다!!

이번에는 주제 단어를 입력해서 포털 사이트를 연결해서 관련 주제에 대한 정보들을 불러오는 걸 해볼 생각이다. (크롤링 수업 때 배운 내용과 코드를 활용해 반복문으로 바꿔서 추가로 해봤다.)

주제는 드라마다… 드라마 제목을 입력받아 관련 정보를 출력하고 리스트 등에 저장해서 이걸 분석해볼 생각이다.. (사실 정보관련 해서 문제 발생을 방지하기 위해 이 포스팅을 복제하거나 하는 그런 건 방지할 필요가 있을 것 같다… 윤리적인 부분 포함해서 이 포스팅으로 개인적인 이익을 얻거나 하는 일은 전혀 없다!!!!)

크롤링하기 전에 당연히 BeautifulSoup 패키지가 있어야하고 크롤링 준비를 위한 코드들을 확인해야한다..

이번에 해볼 분석은 우선 드라마 이름을 입력하면 해당 드라마의 줄거리와 배역 정보를 출력하고 이에 대한 관련된 키워드들이 포함된 네이버 카페, 블로그의 링크를 뽑아보려고 한다. 단, 드라마는 10개만 저장해서 뽑을 생각이다. 이 정보들을 전부 csv파일에 저장할 생각이다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import requests 
from bs4 import BeautifulSoup
import time
import pandas as pd

cnt = 0
drama_nm = []
drama_txt = []
drama_info = []
drama_link = []

while True:
cnt += 1

drama = input('드라마 이름을 입력하세요 : ')
drama_nm.append(drama)

url = f'https://search.daum.net/search?w=tot&DA=YZR&t__nil_searchbox=btn&sug=&sugo=&sq=&o=&q={drama}'
response = requests.get(url)
html = BeautifulSoup(response.text, 'html.parser')
time.sleep(2)

drama_summary = html.select('dd.cont')[0].text
drama_info.append(drama_summary)

print(f'드라마 줄거리 : {drama_summary}')
print()

print(' 배역 배우')
print(' ------------------')
for cont in html.select('div#tv_casting li')[1:]:
drama_info.append(cont.text)
print(cont.text)
print()

url1 = f'https://search.naver.com/search.naver?where=nexearch&sm=top_sug.pre&fbm=1&acr=1&acq=%EA%B2%80%EC%9D%80&qdt=0&ie=utf8&query={drama}'
res = requests.get(url1)
html1 = BeautifulSoup(res.text, 'html.parser')
html1.select('a')

for txt in html1.select('a'):
if drama in txt.text:
if ('cafe' in txt.attrs['href']) or ('blog' in txt.attrs['href']):
drama_txt.append(txt.text)
drama_link.append(txt.attrs['href'])
print(f"{txt.text}: 링크 => {txt.attrs['href']}")
else:
pass
else:
pass

time.sleep(5)
dr_l = {'drama_nm': drama_nm, 'drama_txt': drama_txt, 'drama_info': drama_info, 'drama_link': drama_link}
for l, k in zip(dr_l.values(), dr_l.keys()):
dr = pd.DataFrame(l)
dr.to_csv(f'{k}.csv', index=False, encoding = 'utf8')

if cnt > 10:
break
print()
time.sleep(2)
print(drama_nm)
print()
print(drama_info)
print()
print(drama_txt)
print()
print(drama_link)
  • Against unauthorized copy and deployment.

Crawling Practice with the topic Drama

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
kw = input('키워드를 입력하세요 : ')
url = f'https://search.naver.com/search.naver?where=nexearch&sm=top_sug.pre&fbm=1&acr=1&acq=%EA%B2%80%EC%9D%80&qdt=0&ie=utf8&query={kw}'
response = requests.get(url)
html = BeautifulSoup(response.text, 'html.parser')
html.select('a')

for txt in html.select('a'):
if kw in txt.text:
if ('cafe' in txt.attrs['href']) or ('blog' in txt.attrs['href']):
print(txt.text, txt.attrs['href'])
else:
pass
else:
pass

drama = input('드라마 이름을 입력하세요 : ')
url = f'https://search.daum.net/search?w=tot&DA=YZR&t__nil_searchbox=btn&sug=&sugo=&sq=&o=&q={drama}'
response = requests.get(url)
html = BeautifulSoup(response.text, 'html.parser')
drama_summary = html.select('dd.cont')[0].text
print(f'드라마 줄거리 : {drama_summary}')
print()

print(' 배역 배우')
print(' ------------------')
for cont in html.select('div#tv_casting li')[1:]:
print(cont.text)
  • I used the crawling python code to print drama info.
  • 무단복제 금지(Against unauthorized copies)

VISUALIZATION

Python has provided various visualization tools. You can draw various graphs in python.
Before you draw graphs or plots, you have to import modules for visualization.

1
2
import matplotlib.pyplot as plt
import seaborn as sns

Difference between matplotlib and seaborn

Matplotlib is not much aesthetic as much as seaborn module. If you want to make your graphs more beautiful, I recommend you would use seaborn module.
Moreover, you also use much more various graphs that matplotlib module does not provide.

TYPES of graphs

  1. Line Plot
  2. Bar Plot
  3. Histogram
  4. Scatter Plot
  5. Box Plot
  • Extra graphs that seaborn module provides
  1. countplot
  2. distplot
  3. jointplot
  4. pairplot

AND … kdeplot etc,… more graphs!!!

FIND HERE! Matplotlib website & Plotly website & Seaborn website

Using graphs

It is important to use the proper graphs to analyze the data better depending on the proper purpose.

For example, if you want to express the density (which point are show more or less on graph), you should use the histogram. Or when you want to express the purchase amount by language that the international consumers use on graph, you should use the barplot or countplot of seaborn module.

Fancy Indexing (Masking) of Python

  • Through bool type of array, indexing the multidimensional array
  • When extracting some data samples that are needed for the data analysis
  • You will use this method a lot for data analysis

Combining two tables

  • use the code merge or concat and create the pivot table
  • based on the key, merge tables (SQL: Join)
  1. inner/outer join
  2. left_on/right_on: the different key names
  3. left/right
  • concat: just combine two tables based on axis (axis = 0: concat based on the rows / axis = 1: concat based on the columns)

Editing Indexes/Columns

  • Editing Indexes: use the code .reset_index(drop = True(or False), inplace = True(or False))
  • Editing Columns: use the code .drop('[column name]', axis = 1) or del (data_name)['column name']
  • Changing Column name: .rename(columns = {'orginal column name':'column name to change'}, inplace = True(or False))

If you do not want to use the variable name, use the inplace = True!

Data Sampling and Analysis

  • When using the basic indexing, slicing, the conditional sampling(masking/fancy indexing), the data analysis will be nice!!
  • You can also extract the meaningful information in those data.

Pandas with the high-quality analysis

  • usage of the apply function with the lambda function
  • make the function with def and apply it to columns and apply function
  • broad casting and data masking

Python Pandas

What it is …

  • A kind of python packages to deal with data as table format
  • This package is good for data analysis.
  • Most for data scientists

Importing this package

  • command import pandas as pd

DataFrame

  • one of data type in python
  • table format
  • provides various statistical, visualization functions
  • used to read and save data as files — support various file format (e.g. csv, xlsx …)
    1
    2
    3
    pd.read_csv('./[file_name]', encoding = 'utf8')

    pd.to_csv('./[file_name]', encoding = 'utf8', index = False)
    If you are using Mac Book, you do not have to write ‘utf8’ to encode files.

Indexing, Slicing data

  • This is the same as of List type, Numpy.
  • iloc and loc is added in data frame type.
  • You can also find out the data type of vector-data is the series type when the code for columns or rows is written.
  • You can also find the data based on the columns. (or one column)
  • Fancy Indexing can be applied to here!!

To be continued…