特定のウェブサイトでの掻き出しの問題

これは私の最初の質問です。 http://www.normattiva.it/特定のウェブサイトでの掻き出しの問題

私はこれ以下のコード（および同様の順列）を使用しています：あなたのよう

import requests, sys 

debug = {'verbose': sys.stderr} 
user_agent = {'User-agent': 'Mozilla/5.0', 'Connection':'keep-alive'} 

url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art' 

r = requests.session() 
s = r.get(url, headers=user_agent) 
#print(s.text) 
print(s.url) 
print(s.headers) 
print(s.request.headers)

を私はウェブサイトから自動的にいくつかのイタリアの法律のテキストを（すなわちスクレープ）をダウンロードしようとしています

"caricaArticolo"というクエリを読み込もうとしているのを見ることができます。

ただし、出力は私の検索が無効であることを言っページ（「セッションが有効でないか、有効期限が切れ」）を

あるページが私はブラウザや負荷を使用していないことを認識しているようです"ブレークアウト" javascript関数。

<body onload="javascript:breakout();">

私は、このようなセレンとrobobrowserとして「ブラウザ」シミュレータPythonスクリプトを使用しようとしたが、結果は同じです。

10分を費やしてページ出力を見て助けてくれる人はいますか？

出典

2016-09-15 terzim

@terzinこのページにアクセスするには、最初に承認されたユーザーがいる必要があります。有効なセッションがありません。 –

同じコードを試してみましたが、目的の出力が得られました –

"beautifulsoup"を使ってみてください。 –

あなたがネットワークの下でドキュメントタブの下で、オープンのdevのツールを使用してページ上の任意のリンクをクリックすると：

をあなたが最初に我々は上をクリックするものである3つのリンク、第二リターンを見ることができますあなたが特定のの記事にジャンプすることを可能にするHTML、最後に記事のテキストが含まれています。ソースで

がfirstlinkから返され、次の2個ののiframeタグ見ることができます：

<div id="alberoTesto"> 
     <iframe 
      src="/atto/caricaAlberoArticoli?atto.dataPubblicazioneGazzetta=2016-08-31&atto.codiceRedazionale=16G00182&atto.tipoProvvedimento=DECRETO LEGISLATIVO" 
      name="leftFrame" scrolling="auto" id="leftFrame" title="leftFrame" height="100%" style="width: 285px; float:left;" frameborder="0"> 
     </iframe> 

     <iframe 
      src="/atto/caricaArticoloDefault?atto.dataPubblicazioneGazzetta=2016-08-31&atto.codiceRedazionale=16G00182&atto.tipoProvvedimento=DECRETO LEGISLATIVO" 
      name="mainFrame" id="mainFrame" title="mainFrame" height="100%" style="width: 800px; float:left;" scrolling="auto" frameborder="0"> 
     </iframe>

最初の記事、/caricaArticoloDefaultとIDメインフレームと、後者のためにあるの私たちが欲しいものです。

あなたはセッション対象とし、bs4を使用してページを解析することによってそれを行うことができますので、最初の要求からクッキーを使用する必要があります。

import requests, sys 
import os 
from urlparse import urljoin 
import io 
user_agent = {'User-agent': 'Mozilla/5.0', 'Connection': 'keep-alive'} 

url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art' 

with requests.session() as s: 
    s.headers.update(user_agent) 
    r = s.get("http://www.normattiva.it/") 
    soup = BeautifulSoup(r.content, "lxml") 
    # get all the links from the initial page 
    for a in soup.select("div.testo p a[href^=http]"): 
     soup = BeautifulSoup(s.get(a["href"]).content) 
     # The link to the text is in a iframe tag retuened from the previous get. 

     text_src_link = soup.select_one("#mainFrame")["src"] 

     # Pick something to make the names unique 
     with io.open(os.path.basename(text_src_link), "w", encoding="utf-8") as f: 
      # The text is in pre tag that is in the div with the pre class 
      text = BeautifulSoup(s.get(urljoin("http://www.normattiva.it", text_src_link)).content, "html.parser")\ 
       .select_one("div.wrapper_pre pre").text 
      f.write(text)

最初のテキストファイルの抜粋：

   IL PRESIDENTE DELLA REPUBBLICA 
    Visti gli articoli 76, 87 e 117, secondo comma, lettera d), della 
Costituzione; 
    Vistala legge 28 novembre 2005, n. 246 e, in particolare, 
l'articolo 14: 
    comma 14, cosi' come sostituito dall'articolo 4, comma 1, lettera 
a), della legge 18 giugno 2009, n. 69, con il quale e' stata 
conferita al Governo la delega ad adottare, con le modalita' di cui 
all'articolo 20 della legge 15 marzo 1997, n. 59, decreti legislativi 
che individuano le disposizioni legislative statali, pubblicate 
anteriormente al 1° gennaio 1970, anche se modificate con 
provvedimenti successivi, delle quali si ritiene indispensabile la 
permanenza in vigore, secondo i principi e criteri direttivi fissati 
nello stesso comma 14, dalla lettera a) alla lettera h); 
    comma 15, con cui si stabilisce che i decreti legislativi di cui 
al citato comma 14, provvedono, altresi', alla semplificazione o al 
riassetto della materia che ne e' oggetto, nel rispetto dei principi 
e criteri direttivi di cui all'articolo 20 della legge 15 marzo 1997, 
n. 59, anche al fine di armonizzare le disposizioni mantenute in 
vigore con quelle pubblicate successivamente alla data del 1° gennaio 
1970; 
    comma 22, con cui si stabiliscono i termini per l'acquisizione del 
prescritto parere da parte della Commissione parlamentare per la 
semplificazione; 
    Visto il decreto legislativo 30 luglio 1999, n. 300, recante 
riforma dell'organizzazione del Governo, a norma dell'articolo 11 
della legge 15 marzo 1997, n. 59 e, in particolare, gli articoli da 
20 a 22;

出典

2016-09-15 13:13:19

ありがとう!!!!下記参照！ – terzim

すばらしい、素晴らしい、素晴らしいパドレック。できます。輸入品をクリアするためにちょっと編集しなければなりませんでしたが、うまく機能しています。どうもありがとう。私はちょうどpythonの可能性を発見しており、あなたはこの特定のタスクで私の旅をずっと簡単にしました。私はそれだけで解決していないでしょう。

import requests, sys 
import os 
from urllib.parse import urljoin 
from bs4 import BeautifulSoup 
import io 
user_agent = {'User-agent': 'Mozilla/5.0', 'Connection': 'keep-alive'} 

url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art' 

with requests.session() as s: 
    s.headers.update(user_agent) 
    r = s.get("http://www.normattiva.it/") 
    soup = BeautifulSoup(r.content, "lxml") 
    # get all the links from the initial page 
    for a in soup.select("div.testo p a[href^=http]"): 
     soup = BeautifulSoup(s.get(a["href"]).content) 
     # The link to the text is in a iframe tag retuened from the previous get. 

     text_src_link = soup.select_one("#mainFrame")["src"] 

     # Pick something to make the names unique 
     with io.open(os.path.basename(text_src_link), "w", encoding="utf-8") as f: 
      # The text is in pre tag that is in the div with the pre class 
      text = BeautifulSoup(s.get(urljoin("http://www.normattiva.it", text_src_link)).content, "html.parser")\ 
       .select_one("div.wrapper_pre pre").text 
      f.write(text)

出典

2016-09-15 19:39:14 terzim

特定のウェブサイトでの掻き出しの問題

答えて

関連する問題