Python Scraper - ターゲットが404'dの場合にソケットエラーが発生する

データをコンパイルしてXLS形式に出力するためのWebスクラップを作成中にエラーが発生しました。スクラップしたいドメインのリストをもう一度テストすると、ソケットエラーを受け取ったときにプログラムは困惑します。壊れたWebサイトを解析してwhileループを継続する「if」ステートメントを見つけることを願っています。何か案は？Python Scraper - ターゲットが404'dの場合にソケットエラーが発生する

workingList = xlrd.open_workbook(listSelection) 
workingSheet = workingList.sheet_by_index(0) 
destinationList = xlwt.Workbook() 
destinationSheet = destinationList.add_sheet('Gathered') 
startX = 1 
startY = 0 
while startX != 21: 
    workingCell = workingSheet.cell(startX,startY).value 
    print '' 
    print '' 
    print '' 
    print workingCell 
    #Setup 
    preSite = 'http://www.'+workingCell 
    theSite = urlopen(preSite).read() 
    currentSite = BeautifulSoup(theSite) 
    destinationSheet.write(startX,0,workingCell)

そして、ここでエラーです：

Traceback (most recent call last): 
    File "<pyshell#2>", line 1, in <module> 
    homeMenu() 
    File "C:\Python27\farming.py", line 31, in homeMenu 
    openList() 
    File "C:\Python27\farming.py", line 79, in openList 
    openList() 
    File "C:\Python27\farming.py", line 83, in openList 
    openList() 
    File "C:\Python27\farming.py", line 86, in openList 
    homeMenu() 
    File "C:\Python27\farming.py", line 34, in homeMenu 
    startScrape() 
    File "C:\Python27\farming.py", line 112, in startScrape 
    theSite = urlopen(preSite).read() 
    File "C:\Python27\lib\urllib.py", line 84, in urlopen 
    return opener.open(url) 
    File "C:\Python27\lib\urllib.py", line 205, in open 
    return getattr(self, name)(url) 
    File "C:\Python27\lib\urllib.py", line 342, in open_http 
    h.endheaders(data) 
    File "C:\Python27\lib\httplib.py", line 951, in endheaders 
    self._send_output(message_body) 
    File "C:\Python27\lib\httplib.py", line 811, in _send_output 
    self.send(msg) 
    File "C:\Python27\lib\httplib.py", line 773, in send 
    self.connect() 
    File "C:\Python27\lib\httplib.py", line 754, in connect 
    self.timeout, self.source_address) 
    File "C:\Python27\lib\socket.py", line 553, in create_connection 
    for res in getaddrinfo(host, port, 0, SOCK_STREAM): 
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

出典

2012-01-14 Kyle Hikalea

うーん、私のインターネット接続がダウンしているとき、私は取得エラーのように見えます。 HTTP 404エラーは、接続したときに取得するものですが、指定したURLが見つからないことがあります。

例外を処理するif文はありません。ここではデモンストレーションです：

import urllib 

def getconn(url): 
    try: 
     conn = urllib.urlopen(url) 
     return conn, None 
    except IOError as e: 
     return None, e 

urls = """ 
    qwerty 
    http://www.foo.bar.net 
    http://www.google.com 
    http://www.google.com/nonesuch 
    """ 
for url in urls.split(): 
    print 
    print url 
    conn, exc = getconn(url) 
    if conn: 
     print "connected; HTTP response is", conn.getcode() 
    else: 
     print "failed" 
     print exc.__class__.__name__ 
     print str(exc) 
     print exc.args

出力：

qwerty 
failed 
IOError 
[Errno 2] The system cannot find the file specified: 'qwerty' 
(2, 'The system cannot find the file specified') 

http://www.foo.bar.net 
failed 
IOError 
[Errno socket error] [Errno 11004] getaddrinfo failed 
('socket error', gaierror(11004, 'getaddrinfo failed')) 

http://www.google.com 
connected; HTTP response is 200 

http://www.google.com/nonesuch 
connected; HTTP response is 404

注これまでのところ、我々は単に接続をオープンしたことを、あなたはアップデートtry/except construct.

を使用して "キャッチ"、それらをする必要があります。次に、HTTPレスポンスコードをチェックし、検索する価値のあるものがあるかどうかを判断することです。conn.read()

出典

2012-01-14 05:00:17

存在しないドメインへの接続を試みるとスクリプトが壊れてしまい、シャットダウンします。 –

Python Scraper - ターゲットが404'dの場合にソケットエラーが発生する

答えて

関連する問題