Crawling without BeautifulSoup

Crawling pt.3

Today github blog’s topic is about crawling without BeautifulSoup module. Generally, when using crawling technique, I use BeautifulSoup module. Unfortunately, most websites has the informational security problems of the crawling, so the crawling is blocked. To avoid the block of the website, use crawling technique of the dynamic web page.

  • Use F12 key on your keyboard and find the referer webpage address and user agent on the network panel.

  • Write python codes of the crawling the following:

1
2
3
4
5
6
7
8
9
10
11
url = '[web address to crawl]'
info = {
'referer': '[main webpage address]',
'user-agent': '[user agent on the network panel of the developer webpage]'
}
response = requests.get(url, headers=info)
# response.text

import json
data = json.loads(response.text)
data

You can access the dynamic webpage that is blocked by java script(js).