爬虫思路分析
爬取网址:http://www.kugou.com/yy/rank/home/1-8888.html
网页版的酷狗不能手动翻页进行下一步的浏览,但通过观察第一页的URL:
http://www.kugou.com/yy/rank/home/1-8888.html
http://www.kugou.com/yy/rank/home/2-8888.html
尝试把数字1改成2,发现恰好返回的是第二页的信息。所以爬取TOP500的歌曲只需要改home后面的数字即可。
安装第三方库
这里的实验环境是PyCharm。
-
- 打开File->Default Settings
-
- 按图所示,打开Project Interpreter选择工程,在点击加号,添加
beautifulsoup4
,lxml
,requests
库。
- 按图所示,打开Project Interpreter选择工程,在点击加号,添加
request headers的获取
有时爬虫需要加入请求头来伪装成浏览器,以便于更好的抓取数据。在浏览器中按F12打开开发者工具,刷新后找到User-Agent
进行复制,如图所示。
完整代码
import requests
from bs4 import BeautifulSoup
import time
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'}
##获取信息
def get_info(url):
r = requests.get(url, headers=headers)
print(r.status_code)
soup = BeautifulSoup(r.text, "lxml")
ranks = soup.select("span.pc_temp_num")
# for rank in ranks:
# print(rank.text.strip())
titles = soup.select("a.pc_temp_songname")
# for title in titles:
# singer = title.text.split("-")[0].strip()
# song = title.text.split("-")[1].strip()
# print(singer,song)
times = soup.select("span.pc_temp_time")
# for time in times:
# time = time.text.strip()
# print(time)
for rank, title, time in zip(ranks, titles, times):
data = {
'rank': rank.text.strip(),
'singer': title.text.split("-")[0].strip(),
'song': title.text.split("-")[1].strip(),
'time': time.text.strip()
}
print(data)
if __name__ == "__main__":
urls = ["http://www.kugou.com/yy/rank/home/{}-8888.html".format(i) for i in range(1, 24)] ##构造URL的请求页面
for url in urls:
get_info(url)
time.sleep(1)