爬虫
下方课件区域方向键控制翻页,f
键全屏。
分析实例代码
静态页面
Bilibili 关键词搜索结果抓取示例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39 | import requests
from bs4 import BeautifulSoup
ua = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
r = requests.get('https://search.bilibili.com/all?keyword=python', headers=ua)
html = BeautifulSoup(r.text, 'html.parser')
video_list = html.select('li.video-item.matrix')
result = []
for video in video_list:
video_info = {}
url_element = video.select('div.info > div.headline.clearfix > a')
video_info['title'] = url_element[0].text
video_info['url'] = url_element[0]['href']
play_count_element = video.select(
'div.info > div.tags > span.watch-num')
video_info['play_count'] = play_count_element[0].text.strip()
danmu_element = video.select('div.info > div.tags > span.hide')
video_info['danmu_count'] = danmu_element[0].text.strip()
upload_time_element = video.select('div.info > div.tags > span.time')
video_info['upload_date'] = upload_time_element[0].text.strip()
up_url_element = video.select('div.info > div.tags > span > a.up-name')
video_info['author'] = up_url_element[0].text
video_info['author_url'] = up_url_element[0]['href']
result.append(video_info)
print(result)
|
JavaScript 渲染
网易云音乐歌单抓取示例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30 | from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://music.163.com/#/discover/playlist')
driver.switch_to.frame("contentFrame")
result = []
try:
ul = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.ID, "m-pl-container")))
li_list = ul.find_elements_by_css_selector('li')
for li in li_list:
song_list = {}
a = li.find_elements_by_css_selector('div.u-cover.u-cover-1 > a.msk')
song_list['title'] = a[0].get_attribute('title')
song_list['url'] = a[0].get_attribute('href')
result.append(song_list)
finally:
driver.quit()
print(result)
|
数据接口
知乎精华回答抓取示例。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 | import requests
import json
ua = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
r = requests.get(
'https://www.zhihu.com/api/v4/topics/19551137/feeds/essence?limit=10&offset=0',
headers=ua)
data = json.loads(r.text)
print(data)
|
项目 2 - 爬虫
抓取 (尽量用 Scrap) 自己感兴趣的站点,保存数据到 csv 中。
作业链接: https://classroom.github.com/a/P4PAr-MC