BeautifulSoupを使用しているPython Webクローラー、URLを取得するのに問題がある

リンク内のすべてのURLリンクを取得するために動的Webクローラーを構築しようとしています。これまでの章のすべてのリンクを取得することができましたが、各章のセクションリンクを実行しようとすると、出力が何も出力されません。BeautifulSoupを使用しているPython Webクローラー、URLを取得するのに問題がある

私が使用したコード：

#########################Chapters####################### 

import requests 
from bs4 import BeautifulSoup, SoupStrainer 
import re 


base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/" 

for title in range (1,4): 
url = base_url.format(title=title) 
r = requests.get(url) 

for link in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')): 
    if link.has_attr('href'): 
    if 'chapt' in link['href']: 
     href = "http://law.justia.com" + link['href'] 
     leveltwo(href) 

#########################Sections####################### 

def leveltwo(item_url): 
r = requests.get(item_url) 
soup = BeautifulSoup((r.content),"html.parser") 
section = soup.find('div', {'class': 'primary-content' }) 
for sublinks in section.find_all('a'): 
     sectionlinks = sublinks.get('href') 
     print (sectionlinks)

出典

2016-04-08 CHballer

をあなたのコードにいくつかのマイナーな変更を加えるだけで、私はそれがセクションを実行し、出力するために取得することができました。主に、インデントを修正し、の前に関数を定義する必要がありました。

#########################Chapters####################### import requests from bs4 import BeautifulSoup, SoupStrainer import re def leveltwo(item_url): r = requests.get(item_url) soup = BeautifulSoup((r.content),"html.parser") section = soup.find('div', {'class': 'primary-content' }) for sublinks in section.find_all('a'): sectionlinks = sublinks.get('href') print (sectionlinks) base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/" for title in range (1,4): url = base_url.format(title=title) r = requests.get(url) for link in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')): try: if 'chapt' in link['href']: href = "http://law.justia.com" + link['href'] leveltwo(href) else: continue except KeyError: continue #########################Sections#######################

出力：あなたがブロックを除いて、あなたはa[href]を選択するだけのhref年代やCSSを使用したアンカータグを選択するために見つけるか、find_allでhref=Trueを使用することができます/任意の試みを必要としない

/codes/alabama/2015/title-3/chapter-1/section-3-1-1/index.html /codes/alabama/2015/title-3/chapter-1/section-3-1-2/index.html /codes/alabama/2015/title-3/chapter-1/section-3-1-3/index.html /codes/alabama/2015/title-3/chapter-1/section-3-1-4/index.html etc.

出典

2016-04-08 19:36:42 n1c9

ありがとうございます！私はあなたがしたことを見る。チャプターのために私のパート1の機能を定義し、セクション1のセクションを参照するためにセクション "レベル2"の機能を定義すると、機能するでしょうか？ – CHballer

ええ、もしあなたが正しく理解していれば。両方の 'セクション'が関数で囲まれていれば読むのがもっとうまくいくだろう; – n1c9

haha、thats私が言ったもの、もう一度感謝！ – CHballer

以下のように、チャプターリンクは最初にulにあり、の記事タグのIDは#maincontentで、まったくフィルターする必要はありません。

base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/" 
import requests 
from bs4 import BeautifulSoup 

def leveltwo(item_url): 
    r = requests.get(item_url) 
    soup = BeautifulSoup(r.content, "html.parser") 
    section_links = [a["href"] for a in soup.select('div .primary-content a[href]')] 
    print (section_links) 



for title in range(1, 4): 
    url = base_url.format(title=title) 
    r = requests.get(url) 
    for link in BeautifulSoup(r.content, "html.parser").select("#maincontent ul:nth-of-type(1) a[href]"): 
     href = "http://law.justia.com" + link['href'] 
     leveltwo(href)

find_all(.., href=True)を渡すだけで、アンカータグをフィルタして、hrefを持つものだけを選択するだけです。

出典

2016-04-08 20:20:59

BeautifulSoupを使用しているPython Webクローラー、URLを取得するのに問題がある

答えて

関連する問題