python提取喜马拉雅《妖神记》音频

  • A+
所属分类:python

python提取喜马拉雅《妖神记》音频

由于听书比看书方便,正好喜马拉雅上面有《妖神记》的音频,下面我们就来将音频一次性提取到本地。

1.构造分页地址

以该音频的首页地址“http://www.ximalaya.com/6974731/album/3268363/”我们可以看到截至目前总共有3个分页共248个音频。
我们先定义一个函数构造3个分页的地址。代码如下:

def get_urls():
    for i in range(1,4):
        url = 'http://www.ximalaya.com/6974731/album/3268363?page={}'.format(i)

2.提取sound_id

接着对于每个分页我们要提取出每个音频的sound_id,为啥需要提取sound_id呢?后面介绍。
我们通过F12可以看到,该页面是用列表存放了音频id等信息,大致层次div>ul>li.我们通过在sound_id行右键,选择'Copy->Copy selector',可以拷贝到我们需要用到的层次信息‘#mainbox > div.mainbox_wrapper > div.mainbox_left > div > div.personal_body > div.album_soundlist > ul > li:nth-child(1)’。将拷贝出来的信息修改成‘#mainbox > div.mainbox_wrapper > div.mainbox_left > div > div.personal_body > div.album_soundlist > ul > li’然后用于我们的python代码,用来找到sound_id所在并提取sound_id。代码如下:

def get_soundids(url):
    res = requests.get(url, headers=headers).text
    soup = BeautifulSoup(res,  "lxml")
    sounds = soup.select('#mainbox > div.mainbox_wrapper > div.mainbox_left > div > div.personal_body\
     > div.album_soundlist > ul > li')
    for soundinfo in sounds:
        soundid = soundinfo.get('sound_id')

3.提取真实地址并下载

有了sound_id后,我们需要通过sound_id获取到音频的真是地址信息。为啥我们可以通过sound_id获取到音频信息呢?在F12的开发者模式下, 选中Network选项,然后选择XHR。然后随便点击其中的一个音频进行播放,我们可以看到在XHR列表数据中出现了类似10577888.json的文件,这个文件刚好就是以对应的sound_id名字命名的json文件。选中后点击Headers,我们可以看到
Request URL:
http://www.ximalaya.com/tracks/10577888.json的请求地址。并且在Response中可以看到如下的json数据,

{
    "id":10577888,
    "play_path_64":"http://audio.xmcdn.com/group16/M0B/C4/46/wKgDbFZk-YCh3zDAAJlXf_LYS0A511.m4a",
    "play_path_32":"http://audio.xmcdn.com/group11/M00/D0/69/wKgDbVZk-XjgcN2kADqIlKzzbF0371.m4a",
    "play_path":"http://audio.xmcdn.com/group16/M0B/C4/46/wKgDbFZk-YCh3zDAAJlXf_LYS0A511.m4a",
    "duration":1242,
    "title":"\u5996\u795e\u8bb0001\uff08\u65b0\u4e66\u4e0a\u7ebf\uff01\uff01\u8bf7\u5404\u4f4d\u7ec6\u7ec6\u54c1\u5473\uff09",
    "nickname":"\u76d7\u8f9b",
    "uid":6974731,
    "waveform":"group16/M0B/C4/65/wKgDalZk-aLRh_1WAAAKuOUsL3k4974.js",
    "upload_id":"u_9339247",
    "cover_url":"http://fdfs.xmcdn.com/group13/M00/C5/10/wKgDXVZk4aCi7v6HAASg3eWo0_I829.jpg",
    "cover_url_142":"http://fdfs.xmcdn.com/group13/M00/C5/10/wKgDXVZk4aCi7v6HAASg3eWo0_I829_web_large.jpg",
    "formatted_created_at":"12\u67087\u65e5 11:14",
    "is_favorited":null,
    "play_count":347395,
    "comments_count":619,
    "shares_count":2,
    "favorites_count":1327,
    "album_id":3268363,
    "album_title":"\u5996\u795e\u8bb0",
    "intro":null,
    "have_more_intro":false,
    "time_until_now":"2\u5e74\u524d",
    "category_name":"book",
    "category_title":"\u6709\u58f0\u4e66",
    "played_secs":null,
    "is_paid":false,
    "is_free":null,
    "price":null,
    "discounted_price":null
}

这样的json数据里面就包含了对应音频的真实地址,play_path_64对应的值就是音频的真实地址。所以我们编写如下代码构造出Request URL并且从返回的json文件中提取出音频地址,再从真实地址下载到本地。

def get_soundlink(soundid):
    soundreq = 'http://www.ximalaya.com/tracks/{}.json'.format(soundid)
    res = requests.get(soundreq, headers=headers).text
    dic = json.loads(res)
    title = re.match('妖神记\d*',dic.get('title')).group()
    m4aurl = dic.get('play_path_64')
    print(m4aurl, title)
    f = requests.get(m4aurl)
    with open("妖神记\\{}.m4a".format(title), "wb") as code:
        code.write(f.content)

4.最后给出完整的全部代码:

完整代码如下:


# python提取喜马拉雅《妖神记》音频 import requests import re import json import os from bs4 import BeautifulSoup headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)\ Chrome/59.0.3071.112 Safari/537.36 Vivaldi/1.91.867.48' } def get_soundlink(soundid): soundreq = 'http://www.ximalaya.com/tracks/{}.json'.format(soundid) res = requests.get(soundreq, headers=headers).text dic = json.loads(res) title = re.match('妖神记\d*',dic.get('title')).group() m4aurl = dic.get('play_path_64') print(m4aurl, title) f = requests.get(m4aurl) with open("妖神记\\{}.m4a".format(title), "wb") as code: code.write(f.content) def get_soundids(url): print(url) res = requests.get(url, headers=headers).text soup = BeautifulSoup(res, "lxml") sounds = soup.select('#mainbox > div.mainbox_wrapper > div.mainbox_left > div > div.personal_body\ > div.album_soundlist > ul > li') for soundinfo in sounds: soundid = soundinfo.get('sound_id') get_soundlink(soundid) def get_urls(): for i in range(1,4): url = 'http://www.ximalaya.com/6974731/album/3268363?page={}'.format(i) get_soundids(url) if not os.path.exists("妖神记"): os.mkdir("妖神记") get_urls()
百分购

发表评论取消回复

您必须才能发表评论!