2012-05-30 9 views
7

ありがとうございます。エラーgzipストリームでpycurlを使用しているときに "Extra data:line 2 column 1"

背景: 私はJSON形式でデータを返すストリーミングAPIフィードを読み取ろうとし、その後pymongo collectionにこのデータを格納しています。ストリーミングAPIには"Accept-Encoding" : "Gzip"ヘッダーが必要です。何が起こっている

コードはjson.loadsと出力に失敗する - すべてのJSONオブジェクトを解析中Extra data: line 2 column 1 - line 4 column 1 (char 1891 - 5597)(以下エラーログを参照してください)

これは発生しません - それはランダムに起こります。

私の推測では、適切なJSONオブジェクトごとに奇妙なJSONオブジェクトが発生していると思います。

私は参照how to use pycurl if requested data is sometimes gzipped, sometimes not?Encoding error while deserializing a json object from Googleを行っていますが、これまでこのエラーを解決するには失敗しました。

誰かが私を助けてくれますか?

エラーログ: 注:以下JSONオブジェクトの生ダンプは、基本的には、CRLF/LF(単数または複数)を解決せず、文字列の生の表現を印刷repr()方法を使用しています。


'{"id":"tag:search.twitter.com,2005:207958320747782146","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:493653150","link":"http://www.twitter.com/Deathnews_7_24","displayName":"Death News 7/24","postedTime":"2012-02-16T01:30:12.000Z","image":"http://a0.twimg.com/profile_images/1834408513/deathnewstwittersquare_normal.jpg","summary":"Crashes, Murders, Suicides, Accidents, Crime and Naturals Death News From All Around World","links":[{"href":"http://www.facebook.com/DeathNews724","rel":"me"}],"friendsCount":56,"followersCount":14,"listedCount":1,"statusesCount":1029,"twitterTimeZone":null,"utcOffset":null,"preferredUsername":"Deathnews_7_24","languages":["tr"]},"verb":"post","postedTime":"2012-05-30T22:15:02.000Z","generator":{"displayName":"web","link":"http://twitter.com"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/Deathnews_7_24/statuses/207958320747782146","body":"Kathi Kamen Goldmark, Writers\xe2\x80\x99 Catalyst, Dies at 63 http://t.co/WBsNlNtA","object":{"objectType":"note","id":"object:search.twitter.com,2005:207958320747782146","summary":"Kathi Kamen Goldmark, Writers\xe2\x80\x99 Catalyst, Dies at 63 http://t.co/WBsNlNtA","link":"http://twitter.com/Deathnews_7_24/statuses/207958320747782146","postedTime":"2012-05-30T22:15:02.000Z"},"twitter_entities":{"urls":[{"display_url":"nytimes.com/2012/05/30/boo\xe2\x80\xa6","indices":[52,72],"expanded_url":"http://www.nytimes.com/2012/05/30/books/kathi-kamen-goldmark-writers-catalyst-dies-at-63.html","url":"http://t.co/WBsNlNtA"}],"hashtags":[],"user_mentions":[]},"gnip":{"language":{"value":"en"},"matching_rules":[{"value":"url_contains: nytimes.com","tag":null}],"klout_score":11,"urls":[{"url":"http://t.co/WBsNlNtA","expanded_url":"http://www.nytimes.com/2012/05/30/books/kathi-kamen-goldmark-writers-catalyst-dies-at-63.html?_r=1"}]}}\r\n{"id":"tag:search.twitter.com,2005:03638785","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:178760897","link":"http://www.twitter.com/Mobanu","displayName":"Donald Ochs","postedTime":"2010-08-15T16:33:56.000Z","image":"http://a0.twimg.com/profile_images/1493224811/small_mobany_Logo_normal.jpg","summary":"","links":[{"href":"http://www.mobanuweightloss.com","rel":"me"}],"friendsCount":10272,"followersCount":9698,"listedCount":30,"statusesCount":725,"twitterTimeZone":"Mountain Time (US & Canada)","utcOffset":"-25200","preferredUsername":"Mobanu","languages":["en"],"location":{"objectType":"place","displayName":"Crested Butte, Colorado"}},"verb":"post","postedTime":"2012-05-30T22:15:02.000Z","generator":{"displayName":"twitterfeed","link":"http://twitterfeed.com"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/Mobanu/statuses/03638785","body":"Mobanu: Can Exercise Be Bad for You?: Researchers have found evidence that some people who exercise do worse on ... http://t.co/mTsQlNQO","object":{"objectType":"note","id":"object:search.twitter.com,2005:03638785","summary":"Mobanu: Can Exercise Be Bad for You?: Researchers have found evidence that some people who exercise do worse on ... http://t.co/mTsQlNQO","link":"http://twitter.com/Mobanu/statuses/03638785","postedTime":"2012-05-30T22:15:02.000Z"},"twitter_entities":{"urls":[{"display_url":"nyti.ms/KUmmMa","indices":[116,136],"expanded_url":"http://nyti.ms/KUmmMa","url":"http://t.co/mTsQlNQO"}],"hashtags":[],"user_mentions":[]},"gnip":{"language":{"value":"en"},"matching_rules":[{"value":"url_contains: nytimes.com","tag":null}],"klout_score":12,"urls":[{"url":"http://t.co/mTsQlNQO","expanded_url":"http://well.blogs.nytimes.com/2012/05/30/can-exercise-be-bad-for-you/?utm_medium=twitter&utm_source=twitterfeed"}]}}\r\n' 
json exception: Extra data: line 2 column 1 - line 4 column 1 (char 1891 - 5597) 

ヘッダ出力:


HTTP/1.1 200 OK 

Content-Type: application/json; charset=UTF-8 

Vary: Accept-Encoding 

Date: Wed, 30 May 2012 22:14:48 UTC 

Connection: close 

Transfer-Encoding: chunked 

Content-Encoding: gzip 

get_stream.py:


#!/usr/bin/env python 
import sys 
import pycurl 
import json 
import pymongo 

STREAM_URL = "https://stream.test.com:443/accounts/publishers/twitter/streams/track/Dev.json" 
AUTH = "userid:passwd" 

DB_HOST = "127.0.0.1" 
DB_NAME = "stream_test" 

class StreamReader: 
    def __init__(self): 
     try: 
      self.count = 0 
      self.buff = "" 
      self.mongo = pymongo.Connection(DB_HOST) 
      self.db = self.mongo[DB_NAME] 
      self.raw_tweets = self.db["raw_tweets_gnip"] 
      self.conn = pycurl.Curl() 
      self.conn.setopt(pycurl.ENCODING, 'gzip') 
      self.conn.setopt(pycurl.URL, STREAM_URL) 
      self.conn.setopt(pycurl.USERPWD, AUTH) 
      self.conn.setopt(pycurl.WRITEFUNCTION, self.on_receive) 
      self.conn.setopt(pycurl.HEADERFUNCTION, self.header_rcvd) 
      while True: 
       self.conn.perform() 
     except Exception as ex: 
      print "error ocurred : %s" % str(ex) 

    def header_rcvd(self, header_data): 
     print header_data 

    def on_receive(self, data): 
     temp_data = data 
     self.buff += data 
     if data.endswith("\r\n") and self.buff.strip(): 
      try: 
       tweet = json.loads(self.buff, encoding = 'UTF-8') 
       self.buff = "" 
       if tweet: 
        try: 
         self.raw_tweets.insert(tweet) 
        except Exception as insert_ex: 
         print "Error inserting tweet: %s" % str(insert_ex) 
        self.count += 1 

       if self.count % 10 == 0: 
        print "inserted "+str(self.count)+" tweets" 
      except Exception as json_ex: 
       print "json exception: %s" % str(json_ex) 
       print repr(temp_data) 



stream = StreamReader() 

固定コード:


def on_receive(self, data): 
     self.buff += data 
     if data.endswith("\r\n") and self.buff.strip(): 
      # NEW: Split the buff at \r\n to get a list of JSON objects and iterate over them 
      json_obj = self.buff.split("\r\n") 
      for obj in json_obj: 
       if len(obj.strip()) > 0: 
        try: 
         tweet = json.loads(obj, encoding = 'UTF-8') 
        except Exception as json_ex: 
         print "JSON Exception occurred: %s" % str(json_ex) 
         continue 
+1

ありがとうございます!私はあなたに飲み物を借りている、あなたは私のストレスを解決! – vgoklani

答えて

7

ダンプされた文字列をjsbeatuifierに貼り付けてください。

これは実際には2つのjsonオブジェクトで、1つではなく、json.loadsでは処理できないことがわかります。

これらは、\r\nで区切られているため、簡単に分割する必要があります。

data引数がon_receiveに渡されると、改行が含まれている場合は、必ずしも\r\nで終わることはありません。これは、文字列の途中にある可能性もあることを示しているので、データチャンクの終わりを見るだけでは十分ではありません。

+0

おかげで、それは完璧に機能しました!人々が将来参照するための「固定コード」の下に新しいロジックを追加する。 –

関連する問題