スクリプトからの治療を実行中 - ハング

私はスクリプトから治療法を実行しようとしています。here。それはthisスニペットを使用することを提案しましたが、私はそれを無期限にハングアップします。これはバージョン.10で書かれました。現在の安定版とまだ互換性がありますか？スクリプトからの治療を実行中 - ハング

出典

2011-06-27 ciferkey

この質問と回答は、更新の準備ができている可能性があります。ここに[Scrapyの最新のスニペット]（http://scrapy.readthedocs.org/en/0.16/topics/practices.html）があります。それはうまくいきますが、私にとっての問題は次のようなものになります。どのようにTwisted reactorを停止し、完了したら移動しますか？ – bahmait

from scrapy import signals, log 
from scrapy.xlib.pydispatch import dispatcher 
from scrapy.crawler import CrawlerProcess 
from scrapy.conf import settings 
from scrapy.http import Request 

def handleSpiderIdle(spider): 
    '''Handle spider idle event.''' # http://doc.scrapy.org/topics/signals.html#spider-idle 
    print '\nSpider idle: %s. Restarting it... ' % spider.name 
    for url in spider.start_urls: # reschedule start urls 
     spider.crawler.engine.crawl(Request(url, dont_filter=True), spider) 

mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': 'mybot.pipeline.validate.ValidateMyItem'} # global settings http://doc.scrapy.org/topics/settings.html 

settings.overrides.update(mySettings) 

crawlerProcess = CrawlerProcess(settings) 
crawlerProcess.install() 
crawlerProcess.configure() 

class MySpider(BaseSpider): 
    start_urls = ['http://site_to_scrape'] 
    def parse(self, response): 
     yield item 

spider = MySpider() # create a spider ourselves 
crawlerProcess.queue.append_spider(spider) # add it to spiders pool 

dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?) 

log.start() # depends on LOG_ENABLED 
print "Starting crawler." 
crawlerProcess.start() 
print "Crawler stopped."

UPDATE：

あなたはまた、クモあたりの設定は、この例を参照する必要がある場合：クモのためのファイルの設定の

for spiderConfig in spiderConfigs: 
    spiderConfig = spiderConfig.copy() # a dictionary similar to the one with global settings above 
    spiderName = spiderConfig.pop('name') # name of the spider is in the configs - i can use the same spider in several instances - giving them different names 
    spiderModuleName = spiderConfig.pop('spiderClass') # module with the spider is in the settings 
    spiderModule = __import__(spiderModuleName, {}, {}, ['']) # import that module 
    SpiderClass = spiderModule.Spider # spider class is named 'Spider' 
    spider = SpiderClass(name = spiderName, **spiderConfig) # create the spider with given particular settings 
    crawlerProcess.queue.append_spider(spider) # add the spider to spider pool

例：

name = punderhere_com  
allowed_domains = plunderhere.com 
spiderClass = scraper.spiders.plunderhere_com 
start_urls = http://www.plunderhere.com/categories.php?

出典

2011-06-28 07:09:57 warvariuc

[this]（https://gist.github.com/1051117）トレースバックを取得しました。私の治療プロジェクトはスクレーパーと呼ばれています。それが問題だろうか？ – ciferkey

私はそれが問題だと思います。これは実際のプロジェクトからのものです。スクレーパーへの参照を削除することができます。あなたはスパイダーのためのいくつかの設定が必要です。 – warvariuc

スクレーパーへの参照を削除した後、プロジェクトの設定をインポートする方法を教えてください。 – ciferkey

スクリプトからの治療を実行中 - ハング

答えて

関連する問題