爬虫思路分析
爬取网址:http://www.kugou.com/yy/rank/home/1-8888.html
网页版的酷狗不能手动翻页进行下一步的浏览,但通过观察第一页的URL:
http://www.kugou.com/yy/rank/home/1-8888.html
http://www.kugou.com/yy/rank/home/2-8888.html
尝试把数字1改成2,发现恰好返回的是第二页的信息。所以爬取TOP500的歌曲只需要改home后面的数字即可。
安装第三方库
这里的实验环境是PyCharm。
- 
    - 打开File->Default Settings
 
- 
    - 按图所示,打开Project Interpreter选择工程,在点击加号,添加beautifulsoup4,lxml,requests库。
 
- 按图所示,打开Project Interpreter选择工程,在点击加号,添加
request headers的获取
有时爬虫需要加入请求头来伪装成浏览器,以便于更好的抓取数据。在浏览器中按F12打开开发者工具,刷新后找到User-Agent进行复制,如图所示。
完整代码
import requests
from bs4 import BeautifulSoup
import time
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'}
##获取信息
def get_info(url):
    r = requests.get(url, headers=headers)
    print(r.status_code)
    soup = BeautifulSoup(r.text, "lxml")
    ranks = soup.select("span.pc_temp_num")
    # for rank in ranks:
    #     print(rank.text.strip())
    titles = soup.select("a.pc_temp_songname")
    # for title in titles:
    #     singer = title.text.split("-")[0].strip()
    #     song = title.text.split("-")[1].strip()
    #     print(singer,song)
    times = soup.select("span.pc_temp_time")
    # for time in times:
    #     time = time.text.strip()
    #     print(time)
    for rank, title, time in zip(ranks, titles, times):
        data = {
            'rank': rank.text.strip(),
            'singer': title.text.split("-")[0].strip(),
            'song': title.text.split("-")[1].strip(),
            'time': time.text.strip()
        }
        print(data)
if __name__ == "__main__":
    urls = ["http://www.kugou.com/yy/rank/home/{}-8888.html".format(i) for i in range(1, 24)]  ##构造URL的请求页面
    for url in urls:
        get_info(url)
        time.sleep(1)



