ツイートの短縮URL（https://t.co）を読む最速の方法

私は何百ものツイートを読んでおり、これらのツイートの短縮URLを調べています。ツイートの短縮URL（https://t.co）を読む最速の方法

するだけの簡単なコードフロー：

スレッド（S）：

def worker(tweets): 
    for tweet in tweets: 
     find shortened urls in tweet 
     result= process these urls (read the response body)

メインプロセス：今のところ

Block till all threads are done 
Collect these results

、私は時間を短縮マルチスレッドを使用していますある程度。たとえば、URLが短縮された100個のつぶやきの場合、処理には約320秒かかります。しかし、私がマルチスレッドを使用すると、この時間を50スレッドで約24秒にすることができます。その後、私が理解できる100本のスレッドに行っても速度の向上はありません。

私のコードで主に時間がかかる部分は、短縮されたURLを読み込み、実際のURLに解決してから、それをグースで読むことです。

実際のコード：

def worker(result_queue, input_tweet_queue): 

queue_full = True 
while queue_full: 
    try: 
     item = input_tweet_queue.get(False) 
     urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', 
          item['text']) 
     for url in urls: 
      try: 
       res = urllib2.urlopen(url) 
       actual_url = res.geturl() 
       if 'twitter.com' not in actual_url: 
        g = Goose() # Its defined somewhere else in real 
        content_of_url = g.extract(actual_url) 
        result = process(content_of_url) 
        result_queue.put(result) 
      except: 
       invalid_url = True 
    except Queue.Empty: 
     queue_full = False

メイン処理：

result_queue = Queue.Queue() 
input_tweet_queue = Queue.Queue() 

# Place all tweets in a queue 
for item in tweets: 
    tweet_queue.put(item) 

# Create threads and start them 
thread_count = 50 
for i in range(thread_count): 
    t = threading.Thread(target=worker, args=(result_queue, input_tweet_queue)) 
    t.daemon = True 
    t.start() 

# Collect the results 
final_result = [] 
index = 0 
for tweet in tweets: 
    result[index] = q.get() 
    index = index + 1

現在のパフォーマンス：

Tweets Threads Execution-Time(Seconds) 
100  1  340 
100  5  66 
100  10  44 
100  50  24 
100  100  23

時間消費の主な原因は、これらの短縮URLを解決し、それらを読む。

このタスクを実行するより良い方法はありますか？どういうわけか非ブロッキングIOをその中に埋め込むことはできますか？または、マルチスレッドの代わりにnon-blocking IOメソッドを使用するだけでいいですか？

出典

2017-08-22 utengr

-1

私はgrequestsモジュールをしようとするだろう：https://github.com/kennethreitz/grequests

非常に愉快に文書化され、使いやすいです。

出典

2017-08-22 11:58:22

ツイートの短縮URL（https://t.co）を読む最速の方法

答えて

関連する問題