- A+
python提取喜马拉雅《妖神记》音频
由于听书比看书方便,正好喜马拉雅上面有《妖神记》的音频,下面我们就来将音频一次性提取到本地。
1.构造分页地址
以该音频的首页地址“http://www.ximalaya.com/6974731/album/3268363/”我们可以看到截至目前总共有3个分页共248个音频。
我们先定义一个函数构造3个分页的地址。代码如下:
def get_urls():
for i in range(1,4):
url = 'http://www.ximalaya.com/6974731/album/3268363?page={}'.format(i)
2.提取sound_id
接着对于每个分页我们要提取出每个音频的sound_id,为啥需要提取sound_id呢?后面介绍。
我们通过F12可以看到,该页面是用列表存放了音频id等信息,大致层次div>ul>li.我们通过在sound_id行右键,选择'Copy->Copy selector',可以拷贝到我们需要用到的层次信息‘#mainbox > div.mainbox_wrapper > div.mainbox_left > div > div.personal_body > div.album_soundlist > ul > li:nth-child(1)’。将拷贝出来的信息修改成‘#mainbox > div.mainbox_wrapper > div.mainbox_left > div > div.personal_body > div.album_soundlist > ul > li’然后用于我们的python代码,用来找到sound_id所在并提取sound_id。代码如下:
def get_soundids(url):
res = requests.get(url, headers=headers).text
soup = BeautifulSoup(res, "lxml")
sounds = soup.select('#mainbox > div.mainbox_wrapper > div.mainbox_left > div > div.personal_body\
> div.album_soundlist > ul > li')
for soundinfo in sounds:
soundid = soundinfo.get('sound_id')
3.提取真实地址并下载
有了sound_id后,我们需要通过sound_id获取到音频的真是地址信息。为啥我们可以通过sound_id获取到音频信息呢?在F12的开发者模式下, 选中Network选项,然后选择XHR。然后随便点击其中的一个音频进行播放,我们可以看到在XHR列表数据中出现了类似10577888.json的文件,这个文件刚好就是以对应的sound_id名字命名的json文件。选中后点击Headers,我们可以看到
Request URL:
http://www.ximalaya.com/tracks/10577888.json的请求地址。并且在Response中可以看到如下的json数据,
{
"id":10577888,
"play_path_64":"http://audio.xmcdn.com/group16/M0B/C4/46/wKgDbFZk-YCh3zDAAJlXf_LYS0A511.m4a",
"play_path_32":"http://audio.xmcdn.com/group11/M00/D0/69/wKgDbVZk-XjgcN2kADqIlKzzbF0371.m4a",
"play_path":"http://audio.xmcdn.com/group16/M0B/C4/46/wKgDbFZk-YCh3zDAAJlXf_LYS0A511.m4a",
"duration":1242,
"title":"\u5996\u795e\u8bb0001\uff08\u65b0\u4e66\u4e0a\u7ebf\uff01\uff01\u8bf7\u5404\u4f4d\u7ec6\u7ec6\u54c1\u5473\uff09",
"nickname":"\u76d7\u8f9b",
"uid":6974731,
"waveform":"group16/M0B/C4/65/wKgDalZk-aLRh_1WAAAKuOUsL3k4974.js",
"upload_id":"u_9339247",
"cover_url":"http://fdfs.xmcdn.com/group13/M00/C5/10/wKgDXVZk4aCi7v6HAASg3eWo0_I829.jpg",
"cover_url_142":"http://fdfs.xmcdn.com/group13/M00/C5/10/wKgDXVZk4aCi7v6HAASg3eWo0_I829_web_large.jpg",
"formatted_created_at":"12\u67087\u65e5 11:14",
"is_favorited":null,
"play_count":347395,
"comments_count":619,
"shares_count":2,
"favorites_count":1327,
"album_id":3268363,
"album_title":"\u5996\u795e\u8bb0",
"intro":null,
"have_more_intro":false,
"time_until_now":"2\u5e74\u524d",
"category_name":"book",
"category_title":"\u6709\u58f0\u4e66",
"played_secs":null,
"is_paid":false,
"is_free":null,
"price":null,
"discounted_price":null
}
这样的json数据里面就包含了对应音频的真实地址,play_path_64对应的值就是音频的真实地址。所以我们编写如下代码构造出Request URL并且从返回的json文件中提取出音频地址,再从真实地址下载到本地。
def get_soundlink(soundid):
soundreq = 'http://www.ximalaya.com/tracks/{}.json'.format(soundid)
res = requests.get(soundreq, headers=headers).text
dic = json.loads(res)
title = re.match('妖神记\d*',dic.get('title')).group()
m4aurl = dic.get('play_path_64')
print(m4aurl, title)
f = requests.get(m4aurl)
with open("妖神记\\{}.m4a".format(title), "wb") as code:
code.write(f.content)
4.最后给出完整的全部代码:
完整代码如下:
# python提取喜马拉雅《妖神记》音频
import requests
import re
import json
import os
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)\
Chrome/59.0.3071.112 Safari/537.36 Vivaldi/1.91.867.48'
}
def get_soundlink(soundid):
soundreq = 'http://www.ximalaya.com/tracks/{}.json'.format(soundid)
res = requests.get(soundreq, headers=headers).text
dic = json.loads(res)
title = re.match('妖神记\d*',dic.get('title')).group()
m4aurl = dic.get('play_path_64')
print(m4aurl, title)
f = requests.get(m4aurl)
with open("妖神记\\{}.m4a".format(title), "wb") as code:
code.write(f.content)
def get_soundids(url):
print(url)
res = requests.get(url, headers=headers).text
soup = BeautifulSoup(res, "lxml")
sounds = soup.select('#mainbox > div.mainbox_wrapper > div.mainbox_left > div > div.personal_body\
> div.album_soundlist > ul > li')
for soundinfo in sounds:
soundid = soundinfo.get('sound_id')
get_soundlink(soundid)
def get_urls():
for i in range(1,4):
url = 'http://www.ximalaya.com/6974731/album/3268363?page={}'.format(i)
get_soundids(url)
if not os.path.exists("妖神记"):
os.mkdir("妖神记")
get_urls()