2017-12-30 36 views
1

PythonでBeautifulSoupを使用して、 "script"タグのコードから "SNG_TITLE"と "ART_NAME"の値を抽出します。PythonでBeautifulSoupを使用してスクリプトタグからデータを抽出

<script>window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276641","UPLOAD_ID":0,"SNG_TITLE":"Heathens","ART_ID":"647650","PROVIDER_ID":"3","ART_NAME":"Twenty One Pilots","ARTISTS":[{"ART_ID":"647650","ROLE_ID":"0","ARTISTS_SONGS_ORDER":"1","ART_NAME":"Twenty One Pilots","ART_PICTURE":"259dcf52853363d79753ec301377645d","SMARTRADIO":"1","RANK":"487762","LOCALES":[],"__TYPE__":"artist"}],"ALB_ID":"13371165","ALB_TITLE":"Heathens","TYPE":0,"MD5_ORIGIN":"5cea723b83af1ff0a62d65d334b978d4","VIDEO":false,"DURATION":"195","ALB_PICTURE":"3dfc8c9e406cf1bba8ce0695a44a9b7e","ART_PICTURE":"259dcf52853363d79753ec301377645d","RANK_SNG":"967143","SMARTRADIO":"1","FILESIZE_AAC_64":0,"FILESIZE_MP3_64":"0","FILESIZE_MP3_128":"3135946","FILESIZE_MP3_256":0,"FILESIZE_MP3_320":"7839868","FILESIZE_FLAC":"21777150","FILESIZE":"3135946","GAIN":"-12","MEDIA_VERSION":"4","DISK_NUMBER":"1","TRACK_NUMBER":"1","VERSION":"","EXPLICIT_LYRICS":"0","RIGHTS":{"STREAM_ADS_AVAILABLE":true,"STREAM_ADS":"2000-01-01","STREAM_SUB_AVAILABLE":true,"STREAM_SUB":"2000-01-01"},"ISRC":"USAT21601930","DATE_ADD":1497886149,"HIERARCHICAL_TITLE":"","SNG_CONTRIBUTORS":{"mainartist":["Twenty One Pilots"],"engineer":["Adam Hawkins"],"mixer":["Adam Hawkins"],"masterer":["Chris Gehringer"],"drums":["Josh Dun"],"producer":["Mike Elizondo","Tyler Joseph"],"programmer":["Mike Elizondo","Tyler Joseph"],"vocals":["Tyler Joseph"],"writer":["Tyler Joseph"]},"LYRICS_ID":30553991,"__TYPE__":"song"},{"SNG_ID":"99976952","PRODUCT_TRACK_ID":"171067651","UPLOAD_ID":0,"SNG_TITLE":"Stressed Out","ART_ID":"647650","PROVIDER_ID":"3","ART_NAME":"Twenty One Pilots","ARTISTS":[{"ART_ID":"647650","ROLE_ID":"0","ARTISTS_SONGS_ORDER":"1","ART_NAME":"Twenty One Pilots", ...</script> 

(スクリプト全体が貼り付けるには長すぎる)コードのアイデアは、特定のページで見つけることができ、すべての曲とアーティスト名をユーザー名をプリントアウトすることです。

import requests 
from bs4 import BeautifulSoup 

base_url = 'https://www.deezer.com/en/profile/1589856782/loved' 

r = requests.get(base_url) 

soup = BeautifulSoup(r.text, 'html.parser') 

user_name = soup.find(class_='user-name') 
print(user_name.text) 

これはユーザー名を表示します。

for script in soup.find_all('script'): 
    print(script.contents) 

私が正しく理解していれば、必要なスクリプトは辞書なので、見つけてその内容を取得するだけです。問題は私が具体的にを正確に見つける方法がわからないことです。 "スクリプト"です。独自の属性や何も持たないものはありません。だから私は、ページ上のすべてのスクリプトを見つけ出し、その内容を印刷するループを試みましたが、さらに進める方法はわかりません。

ページ上でこの特定の「スクリプト」のみを見つけるにはどうすればよいですか?別の方法で値にアクセスできますか?

+0

"window .__ DZR_APP_STATE__"を含むスクリプト要素を抽出しますか? – RussellB

+0

コード内のスクリプトを数えます - 場所を変更せず、正しいものを得るためにインデックスを使用します。 3番目のスクリプト 'soup.find_all( 'script')[2]' – furas

+0

BTW:scriptは通常の文字列ですので、標準の文字列関数を使用してチェックします。つまり、 'script.contents:' ' – furas

答えて

1

スクリプトはコード内の場所を変更しないため、スクリプトをカウントしてインデックスを使用して正しいスクリプトを取得できます。

スクリプトは標準文字列です。つまり、標準の文字列関数、つまり、

if '{"loved"' in script.text: 

両方の方法とのコード - 私は、文字列の一部だけを表示するように[:100]を使用しています。

import requests 
from bs4 import BeautifulSoup 

base_url = 'https://www.deezer.com/en/profile/1589856782/loved' 

r = requests.get(base_url) 

soup = BeautifulSoup(r.text, 'html.parser') 

all_scripts = soup.find_all('script') 

print('--- first method ---') 
print(all_scripts[6].text[:100]) 

print('--- second method ---') 
for number, script in enumerate(all_scripts): 
    if '{"loved"' in script.text: 
     print(number, script.text[:100]) 

結果:

--- first method --- 
window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276 
--- second method --- 
6 window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276 

EDIT:正しいスクリプトを持っている場合は、あなただけJSON文字列を取得するためにスライスを使用して、Pythonの辞書に変換するモジュールjsonを使用することができ、その後、 touはデータを取得できます

import requests 
from bs4 import BeautifulSoup 
import json 

base_url = 'https://www.deezer.com/en/profile/1589856782/loved' 

r = requests.get(base_url) 

soup = BeautifulSoup(r.text, 'html.parser') 

all_scripts = soup.find_all('script') 

data = json.loads(all_scripts[6].get_text()[27:]) 

print('key:', data.keys()) 
print('key:', data['TAB'].keys()) 
print('key:', data['DATA'].keys()) 
print('---') 

for item in data['TAB']['loved']['data']: 
    print('ART_NAME:', item['ART_NAME']) 
    print('SNG_TITLE:', item['SNG_TITLE']) 
    print('---') 

結果:私の理解が正しければ、あなたはそれで「SNG_TITLE」とだけscript要素を

key: dict_keys(['TAB', 'DATA']) 
key: dict_keys(['loved']) 
key: dict_keys(['USER', 'FOLLOW', 'FOLLOWING', 'HAS_BLOCKED', 'IS_BLOCKED', 'IS_PUBLIC', 'CURATOR', 'IS_PERSONNAL', 'NB_FOLLOWER', 'NB_FOLLOWING']) 
--- 
ART_NAME: Twenty One Pilots 
SNG_TITLE: Heathens 
--- 
ART_NAME: Twenty One Pilots 
SNG_TITLE: Stressed Out 
--- 
ART_NAME: Linkin Park 
SNG_TITLE: Numb 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: Animal I Have Become 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: Painkiller 
--- 
ART_NAME: Slipknot 
SNG_TITLE: Before I Forget 
--- 
ART_NAME: Slipknot 
SNG_TITLE: Duality 
--- 
ART_NAME: Skrillex 
SNG_TITLE: Make It Bun Dem 
--- 
ART_NAME: Skrillex 
SNG_TITLE: Bangarang (feat. Sirah) 
--- 
ART_NAME: Limp Bizkit 
SNG_TITLE: Break Stuff 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: I Hate Everything About You 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: Time of Dying 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: I Am Machine 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: Riot 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: So What 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: Pain 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: Tell Me Why 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: Chalk Outline 
--- 
ART_NAME: Three Days Grace 
SNG_TITLE: Gone Forever 
--- 
ART_NAME: Slipknot 
SNG_TITLE: The Devil In I 
--- 
ART_NAME: Linkin Park 
SNG_TITLE: No More Sorrow 
--- 
ART_NAME: Linkin Park 
SNG_TITLE: Bleed It Out 
--- 
ART_NAME: The Doors 
SNG_TITLE: Roadhouse Blues 
--- 
ART_NAME: The Doors 
SNG_TITLE: Riders On The Storm 
--- 
ART_NAME: The Doors 
SNG_TITLE: Break On Through (To The Other Side) 
--- 
ART_NAME: The Doors 
SNG_TITLE: Alabama Song (Whisky Bar) 
--- 
ART_NAME: The Doors 
SNG_TITLE: People Are Strange 
--- 
ART_NAME: My Chemical Romance 
SNG_TITLE: Welcome to the Black Parade 
--- 
ART_NAME: My Chemical Romance 
SNG_TITLE: Teenagers 
--- 
ART_NAME: My Chemical Romance 
SNG_TITLE: Na Na Na [Na Na Na Na Na Na Na Na Na] 
--- 
ART_NAME: My Chemical Romance 
SNG_TITLE: Famous Last Words 
--- 
ART_NAME: The Doors 
SNG_TITLE: Soul Kitchen 
--- 
ART_NAME: The Black Keys 
SNG_TITLE: Lonely Boy 
--- 
ART_NAME: Katy Perry 
SNG_TITLE: I Kissed a Girl 
--- 
ART_NAME: Katy Perry 
SNG_TITLE: Hot N Cold 
--- 
ART_NAME: Katy Perry 
SNG_TITLE: E.T. 
--- 
ART_NAME: Linkin Park 
SNG_TITLE: Given Up 
--- 
ART_NAME: My Chemical Romance 
SNG_TITLE: Dead! 
--- 
ART_NAME: My Chemical Romance 
SNG_TITLE: Mama 
--- 
ART_NAME: My Chemical Romance 
SNG_TITLE: The Sharpest Lives 
--- 
1

たい。

あなたはreを使用すると、以下のように自分の興味のある分野で唯一のスクリプト要素を取得することができます:

import requests 
from bs4 import BeautifulSoup 
import re 

base_url = 'https://www.deezer.com/en/profile/1589856782/loved' 

r = requests.get(base_url) 

soup = BeautifulSoup(r.text, 'html.parser') 

user_name = soup.find(class_='user-name') 
print(user_name.text) 

for script in soup(text=re.compile(r'SNG_TITLE')): 
    print(script.parent) 

EDIT:

@furas答えが見つけてjsonを使用して完全なソリューションです'SNG_TITLE'と 'ART_TITLE'。私の答えはあなたが 'SNG_TITLE'のスクリプトだけを見つけるのを助けます。両方を組み合わせて、より良いコードを得ることができます。

関連する問題