dufaxing To be a better man

Python爬取酷狗TOP500的数据

2019-01-31


爬虫思路分析

爬取网址:http://www.kugou.com/yy/rank/home/1-8888.html

网页版的酷狗不能手动翻页进行下一步的浏览,但通过观察第一页的URL:

http://www.kugou.com/yy/rank/home/1-8888.html
http://www.kugou.com/yy/rank/home/2-8888.html

尝试把数字1改成2,发现恰好返回的是第二页的信息。所以爬取TOP500的歌曲只需要改home后面的数字即可。


安装第三方库

这里的实验环境是PyCharm。

    1. 打开File->Default Settings

k1aWtg.png

    1. 按图所示,打开Project Interpreter选择工程,在点击加号,添加beautifulsoup4,lxml,requests库。

k1a4pj.png


request headers的获取

有时爬虫需要加入请求头来伪装成浏览器,以便于更好的抓取数据。在浏览器中按F12打开开发者工具,刷新后找到User-Agent进行复制,如图所示。

k1wfQs.png


完整代码

import requests
from bs4 import BeautifulSoup
import time

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'}


##获取信息
def get_info(url):
    r = requests.get(url, headers=headers)
    print(r.status_code)
    soup = BeautifulSoup(r.text, "lxml")

    ranks = soup.select("span.pc_temp_num")
    # for rank in ranks:
    #     print(rank.text.strip())

    titles = soup.select("a.pc_temp_songname")
    # for title in titles:
    #     singer = title.text.split("-")[0].strip()
    #     song = title.text.split("-")[1].strip()
    #     print(singer,song)

    times = soup.select("span.pc_temp_time")
    # for time in times:
    #     time = time.text.strip()
    #     print(time)

    for rank, title, time in zip(ranks, titles, times):
        data = {
            'rank': rank.text.strip(),
            'singer': title.text.split("-")[0].strip(),
            'song': title.text.split("-")[1].strip(),
            'time': time.text.strip()
        }
        print(data)


if __name__ == "__main__":
    urls = ["http://www.kugou.com/yy/rank/home/{}-8888.html".format(i) for i in range(1, 24)]  ##构造URL的请求页面
    for url in urls:
        get_info(url)
        time.sleep(1)

k10CFO.png


下一篇 ROS Network配置

Comments

Content