抽出されたリンクがサブパスであるかどうかを評価する方法

私はいくつかのページをクロールするためにscrapyを使用しています。私はPython 2.7を使用しています。スパイダーがレスポンスオブジェクトを返し、ページにあるURLを検査しています。私は指定した場所のサブパスであるURLだけを追跡するようにスパイダーを制限したい。抽出されたリンクがサブパスであるかどうかを評価する方法

例えば、私はクモのみレスポンスオブジェクトから抽出された

リンク異なる規則の数に従うwww.google.com/policies/privacy/、以下のリンクをクリックして次のように指定したいです。

など。

../../policies/privacy/
../../policies/privacy/example/collect-information.html
#infocollect
/インターナショナル/ en_uk /ポリシー/privacy/google_privacy_policy_en_uk.pdf
//myaccount.google.com/
https://support.google.com/policies/troubleshooter/2990837?hl=en-GB

私はワットすることはできませんそれを行う方法をorkアウト。私は文字列に対して単純なfindメソッドを使用しました。それは堅牢ではないか、それは私にとっては賢明なものです。

import scrapy 

class googleSpider(scrapy.Spider): 
    name = "google" 
    allowed_domains = ["google.co.uk"] 
    start_urls = [ 
     "http://www.google.co.uk/intl/en/policies/privacy/" 
    ] 

    def parse(self, response): 
     for href in response.xpath('//a/@href').extract(): 
      if href.find('/policies/privacy/') != -1: 
       yield scrapy.Request(response.urljoin(href), callback=self.parse_dir_contents) 

    def parse_dir_contents(self, response): 
     pass

出典

2016-06-22 Donal Mee

あなたがこれまでに試したScrapyコードを共有することができます：

はその後、それはあなたが.extract_links(response)

チェックからこのscrapy shell例を取得したリンクのうち、Requestsを構築する問題ですか？ –

確かに。私は問題に適切に取り組んでいないことがわかります。 –

これにはを使用できます。デフォルトのリンクはリンクを正規化します。

$ scrapy shell https://www.google.com/policies/privacy/ 
2016-06-22 18:03:19 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot) 
(...edited...) 
2016-06-22 18:03:20 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/policies/privacy/> (referer: None) 
(...edited...) 

>>> from scrapy.linkextractors import LinkExtractor 
>>> for l in LinkExtractor().extract_links(response): 
...  print(l.url) 
... 
https://www.google.com/ 
(...edited...) 
https://support.google.com/accounts/answer/32046?hl=en 
https://www.google.com/trends/ 
https://www.youtube.com/trendsmap 
https://privacy.google.com/?hl=en 
https://www.google.com/policies/technologies/location-data/ 
https://www.google.com/policies/technologies/wallet/ 
https://www.google.com/policies/technologies/voice/ 
https://www.google.com/safetycenter/families/start/ 
https://www.google.com/intl/en/about/ 
https://www.google.com/intl/en/policies/privacy/ 
https://www.google.com/intl/en/policies/terms/ 

>>> for l in LinkExtractor().extract_links(response): 
...  if response.url in l.url: 
...   print(l.url) 
... 
https://www.google.com/policies/privacy/ 
https://www.google.com/policies/privacy/frameworks/ 
https://www.google.com/policies/privacy/key-terms/ 
https://www.google.com/policies/privacy/partners/ 
https://www.google.com/policies/privacy/archive/ 
https://www.google.com/policies/privacy/example/more-relevant-search-results.html 
https://www.google.com/policies/privacy/example/connect-with-people.html 
https://www.google.com/policies/privacy/example/sharing-with-others.html 
https://www.google.com/policies/privacy/example/ads-youll-find-most-useful.html 
https://www.google.com/policies/privacy/example/the-people-who-matter-most.html 
https://www.google.com/policies/privacy/example/credit-card.html 
https://www.google.com/policies/privacy/example/collect-information.html 
https://www.google.com/policies/privacy/example/view-and-interact-with-our-ads.html 
https://www.google.com/policies/privacy/example/device-specific-information.html 
https://www.google.com/policies/privacy/example/device-identifiers.html 
https://www.google.com/policies/privacy/example/phone-number.html 
https://www.google.com/policies/privacy/example/may-collect-and-process-information.html 
https://www.google.com/policies/privacy/example/sensors.html 
https://www.google.com/policies/privacy/example/wifi-access-points-and-cell-towers.html 
https://www.google.com/policies/privacy/example/our-partners.html 
https://www.google.com/policies/privacy/example/advertising-services.html 
https://www.google.com/policies/privacy/example/linked-with-information-about-visits-to-multiple-sites.html 
https://www.google.com/policies/privacy/example/provide-services.html 
https://www.google.com/policies/privacy/example/maintain-services.html 
https://www.google.com/policies/privacy/example/protect-services.html 
https://www.google.com/policies/privacy/example/develop-new-ones.html 
https://www.google.com/policies/privacy/example/protect-google-and-our-users.html 
https://www.google.com/policies/privacy/example/limit-sharing-or-visibility-settings.html 
https://www.google.com/policies/privacy/example/improve-your-user-experience.html 
https://www.google.com/policies/privacy/example/combine-personal-information.html 
https://www.google.com/policies/privacy/example/to-make-it-easier-to-share.html 
https://www.google.com/policies/privacy/example/may-not-function-properly.html 
https://www.google.com/policies/privacy/example/sharing.html 
https://www.google.com/policies/privacy/example/removing-your-content.html 
https://www.google.com/policies/privacy/example/access-to-your-personal-information.html 
https://www.google.com/policies/privacy/example/legal-process.html 
https://www.google.com/policies/privacy/example/we-may-share.html 
https://www.google.com/policies/privacy/example/to-show-trends.html

出典

2016-06-22 16:11:25

こんにちはポール、ありがとう。それは本当に私の問題を解決しました。私は、リンクを扱うための組み込みメソッドがなければならないことを知っていました。どうもありがとう。 –

抽出されたリンクがサブパスであるかどうかを評価する方法

答えて

関連する問題