2016-09-17 4 views
-2

私は頻繁に訪問して「ベストアドバイス」を読むサイトがあります。ここで私が好きなテキストを簡単に抽出する方法は...特定のサイトのWebページからテキストを収集する

import urllib2 
from bs4 import BeautifulSoup 

mylist=list() 

myurl='http://www.apartmenttherapy.com/carols-east-side-cottage-house-tour-194787' 
s=urllib2.urlopen(myurl) 
soup = BeautifulSoup(s) 

hello = soup.find(text='Best Advice: ') 
mylist.append(hello.next) 

しかし、どのようにすべてのページからテキストスニペットを収集するのですか?


私はこの単純なGoogleのクエリを使用して、すべてのページを検索することができます

...

サイト:http://www.apartmenttherapy.com

Google検索は、Pythonで使用することができるAPIを持っていますか? 私はこの問題のために一度の簡単な解決策を探しています。だから私は、このタスクを完了させるにはあまりにも多くのパッケージをインストールしない方がよいでしょう。

+0

RTFM BeautifulSoupおよび/またはlxmlの

は仕事をし、次のコードを参照してください。例えば。 [後者はXPathをサポートしています](http://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup)。 –

+0

そのサイトからのアドバイスはすべてサイト固有のものなので、Googleのような一般的なソリューションは存在しません。誰がそのサイトは、任意の時点で公開されているDBのすべてのアドバイスを持っていると言った?サイトを探検し、パターンを探し、外部の手段でアドバイスを抽出する方法があるかどうかを試してみてください。 –

+0

つまり、情報を持つページを見つけるためにgoogleを使用することはできますが、重複がなく、各結果で参照されるページが索引付けされてから更新されていないことを保証するものではありません。 –

答えて

1

あなたはBeautifulSoupマニュアル最初を読んでも、ネットワークフローを検査するために、ウェブ開発者に対しツールを使用することを学ぶことがあります。一度行わ

、あなたは、我々はすべての家インデックスページを取得するために、Xに1ページから繰り返すことができ、と仮定すると、GETリクエストhttp://www.apartmenttherapy.com/search?page=1&q=House+Tour&type=all

と一緒に家のリストを得ることができることを見ることができます。

各インデックスページには、正確に15個のURLを追加してリストに追加できます。

完全なURLリストを取得したら、各URLをスクラップして、それぞれのURLに「最高のアドバイス」テキストを表示することができます。

import time 
import requests 
import random 
from bs4 import BeautifulSoup 

#here we get a list of all url to scrap 
url_list=[] 
max_index=2 

for page_index in range(1,max_index): 

    #get index page 
    html=requests.get("http://www.apartmenttherapy.com/search?page="+str(page_index)+"&q=House+Tour&type=all").content 

    #iterate over teaser 
    for teaser in BeautifulSoup(html).findAll('a',{'class':'SimpleTeaser'}): 

     #add link to url list 
     url_list.append(teaser['href']) 

    #sleep a litte to avoid overload/ to be smart 
    time.sleep(random.random()/2.) # respect server side load 

    #here I break because it s just an example (it does not required to scrap all index page) 
    break #comment this break in production 


#here we show list 
print url_list 


#we iterate over url to get the advice 
mylist=[] 
for url in url_list: 

    #get teaser page 
    html=requests.get(url).content 

    #find best advice text 
    hello = BeautifulSoup(html).find(text='Best Advice: ') 

    #print advice 
    print "advice for",url,"\n","=>", 

    #try to add next text to mylist 
    try: 
     mylist.append(hello.next) 
    except: 
     pass 

    #sleep a litte to avoid overload/ to be smart 
    time.sleep(random.random()/2.) # respect server side load 

#show list of advice 
print mylist 

出力は次のとおりです:

['http://www.apartmenttherapy.com/house-tour-a-charming-comfy-california-cottage-228229', 'http://www.apartmenttherapy.com/christinas-olmay-oh-my-house-tour-house-tour-191725', 'http://www.apartmenttherapy.com/house-tour-a-rustic-refined-ranch-house-227896', 'http://www.apartmenttherapy.com/caseys-grown-up-playhouse-house-tour-215962', 'http://www.apartmenttherapy.com/allison-and-lukes-comfortable-and-eclectic-apartment-house-tour-193440', 'http://www.apartmenttherapy.com/melissas-eclectic-austin-bungalow-house-tour-206846', 'http://www.apartmenttherapy.com/kates-house-tour-house-tour-197080', 'http://www.apartmenttherapy.com/house-tour-a-1940s-art-deco-apartment-in-australia-230294', 'http://www.apartmenttherapy.com/house-tour-an-art-filled-mid-city-new-orleans-house-227667', 'http://www.apartmenttherapy.com/jeremys-light-and-heavy-home-house-tour-201203', 'http://www.apartmenttherapy.com/mikes-cabinet-of-curiosities-house-tour-201878', 'http://www.apartmenttherapy.com/house-tour-a-family-dream-home-in-illinois-227791', 'http://www.apartmenttherapy.com/stephanies-greenwhich-gemhouse-96295', 'http://www.apartmenttherapy.com/masha-and-colins-worldly-abode-house-tour-203518', 'http://www.apartmenttherapy.com/tims-desert-light-box-house-tour-196764'] 
advice for http://www.apartmenttherapy.com/house-tour-a-charming-comfy-california-cottage-228229 
=> advice for http://www.apartmenttherapy.com/christinas-olmay-oh-my-house-tour-house-tour-191725 
=> advice for http://www.apartmenttherapy.com/house-tour-a-rustic-refined-ranch-house-227896 
=> advice for http://www.apartmenttherapy.com/caseys-grown-up-playhouse-house-tour-215962 
=> advice for http://www.apartmenttherapy.com/allison-and-lukes-comfortable-and-eclectic-apartment-house-tour-193440 
=> advice for http://www.apartmenttherapy.com/melissas-eclectic-austin-bungalow-house-tour-206846 
=> advice for http://www.apartmenttherapy.com/kates-house-tour-house-tour-197080 
=> advice for http://www.apartmenttherapy.com/house-tour-a-1940s-art-deco-apartment-in-australia-230294 
=> advice for http://www.apartmenttherapy.com/house-tour-an-art-filled-mid-city-new-orleans-house-227667 
=> advice for http://www.apartmenttherapy.com/jeremys-light-and-heavy-home-house-tour-201203 
=> advice for http://www.apartmenttherapy.com/mikes-cabinet-of-curiosities-house-tour-201878 
=> advice for http://www.apartmenttherapy.com/house-tour-a-family-dream-home-in-illinois-227791 
=> advice for http://www.apartmenttherapy.com/stephanies-greenwhich-gemhouse-96295 
=> advice for http://www.apartmenttherapy.com/masha-and-colins-worldly-abode-house-tour-203518 
=> advice for http://www.apartmenttherapy.com/tims-desert-light-box-house-tour-196764 
=> [u"If you make a bad design choice or purchase, don't be afraid to change it. Try and try again until you love it.\n\t", u" Sisal rugs. They clean up easily and they're very understated. Start with very light colors and add colors later.\n", u"Bring in what you love, add dimension and texture to your walls. Decorate as an individual and not to please your neighbor or the masses. Trends are fun but I love elements of timeless interiors. Include things from any/every decade as well as mixing styles. I'm convinced it's the hardest way to decorate without looking like you are living in a flea market stall. Scale, color, texture, and contrast are what I focus on. For me it takes some toying around, and I always consider how one item affects the next. Consider space and let things stand out by limiting what surrounds them.", u'You don\u2019t need to invest in \u201cdecor\u201d and nothing needs to match. Just decorate with the special things (books, cards, trinkets, jars, etc.) that you\u2019ve collected over the years, and be organized. I honestly think half the battle of having good home design is keeping a neat house. The other half is just displaying stuff that is special to you. Stuff that has a story and/or reminds you of people, ideas, and places that you love. One more piece of advice - the best place to buy picture frames is Goodwill. Pick a frame in decent condition, and just paint it to complement your palette. One last piece of advice\u2014 decor need not be pricey. I ALWAYS shop consignment and thrift, and then I repaint and customize as I see fit.\n', u'From my sister \u2014 to use the second bedroom as my room, as it is dark and quiet, both of which I need in order to sleep.\n', u'Collect things that you love in your travels throughout life. I tend to purchase ceramics when travelling, sometimes a collection of bowls\u2026 not so easy transporting in the suitcase, but no breakages yet!\n\t', u'Keep things authentic to the character of your home and to the character of your family. Then, you can never go wrong!\n\t', u'Contemporary architecture does not require contemporary furnishings.\n'] 
0

あなたはJS-有効な方法のようにこするを使用する必要があり、ここで説明: http://koaning.io/dynamic-scraping-with-python.html

+0

コンテンツは検索結果ページでフラットなので、この中でjs対応のスクレイピングを使用する必要はありません –

関連する問題