2016-09-19 2 views
1

私はPoetryFoundation.orgから詩を掻き取ろうとしています。私はテストケースの1つで、特定の詩からhtmlを取り出すときに実際の詩が終わる前に余分に</body>が含まれていることがわかりました。私は詩のソースコードをオンラインで見ることができ、詩の真ん中には(期待通り)はありません。私は、Python 3.5.1を実行している美しいスープは、実際の最後の前に余分に</body>を持っています

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

poem_page = urlopen("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956") 
poem_soup = BeautifulSoup(poem_page.read(), "html5lib") 
print(poem_soup) 

:私は他の人が問題を再現しようとすることができるような具体的な例のURLで例を作成しました。私はこれをデフォルトのパーサーhtml.parserと同様にhtml5liblxmlと同様に試しました。

「詩の中で」を検索すると、htmlのスニペットを見つけることができます。このスニペットは、</body></html>という詩の途中でhtml文書全体を終了させて​​から、ドキュメントの残り:

in the poem</div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></body></html>. But when we met,<br/><div style="text-indent: -1em; padding-left: 1em;"><br/> 

私はオンラインのソースコードを見てきましたし、これはそれがどうあるべきかです:

in the poem</em>. But when we met,<br></div><div style="text-indent: -1em; padding-left: 1em;"> 

私はそれをこすりとき、それは全体のHTMLドキュメントを閉じている理由はわかりませんページの途中。

+0

それはその奇妙 '

によって引き起こされることが'そのポイントの周りにすべきおそらくそれとは反対になるでしょう。 – Xufox

+0

どのバージョンのlibxmlなどをインストールしましたか? –

+0

@Oregano、あなたがlxmlを試してもうまくいかなかったと答えたとき、あなたの受け入れられた答えはどうやって動くのですか? –

答えて

0

html.parserであなたのURLの詩を取得しようとすると、あなたと同じ問題が発生しました.htmlはin the poemの位置で切り捨てられました。

import requests 
from bs4 import BeautifulSoup 

poem_page = requests.get("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956") 
poem_soup = BeautifulSoup(poem_page.text, "html.parser") 
poem_div = poem_soup.find('div', class_='poem') 
print poem_div 

OUTPUT:

<div class="poem" data-view="ContentView"> 
<div style="text-indent: -1em; padding-left: 1em;">It seems a certain fear underlies everything. <br/></div><div style="text-indent: -1em; padding-left: 1em;">If I were to tell you something profound<br/></div><div style="text-indent: -1em; padding-left: 1em;"> it would be useless, as every single thing I know<br/></div><div style="text-indent: -1em; padding-left: 1em;"> is not timeless. I am particularly risk-averse.<br/></div><div style="text-indent: -1em; padding-left: 1em;"><br/></div><div style="text-indent: -1em; padding-left: 1em;">I choose someone else over me every time, <br/></div><div style="text-indent: -1em; padding-left: 1em;">as I'm sure they'll finish the task at hand, <br/></div><div style="text-indent: -1em; padding-left: 1em;">which is to say that whatever is in front of us<br/></div><div style="text-indent: -1em; padding-left: 1em;"> will get done if I'm not in charge of it.<br/></div><div style="text-indent: -1em; padding-left: 1em;"><br/></div><div style="text-indent: -1em; padding-left: 1em;">There is a limit to the number of times <br/></div><div style="text-indent: -1em; padding-left: 1em;">I can practice every single kind of mortification <br/></div><div style="text-indent: -1em; padding-left: 1em;">(of the flesh?). I can turn toward you and say <em>yes, <br/></em></div><div style="text-indent: -1em; padding-left: 1em;">it was you in the poem</div></div> 

しかしlxmlにパーサを変更し、すべてがOKです。

import requests 
from bs4 import BeautifulSoup 

poem_page = requests.get("https://www.poetryfoundation.org/poems-and-poets/poems/detail/57956") 
poem_soup = BeautifulSoup(poem_page.text, "lxml") 
poem_div = poem_soup.find('div', class_='poem') 
# print poem_div 
for s in poem_div.find_all('div'): 
    print list(s.children)[0] 

OUTPUT:

It seems a certain fear underlies everything. 
If I were to tell you something profound 
it would be useless, as every single thing I know 
is not timeless. I am particularly risk-averse. 
<br/> 
I choose someone else over me every time, 
as I'm sure they'll finish the task at hand, 
which is to say that whatever is in front of us 
will get done if I'm not in charge of it. 
<br/> 
There is a limit to the number of times 
I can practice every single kind of mortification 
(of the flesh?). I can turn toward you and say 
it was you in the poem. But when we met, 
<br/> 
you were actually wearing a shirt, and the poem 
wasn't about you or your indecipherable tattoo. 
The poem is always about me, but that one time 
I was in love with the memory of my twenties 
<br/> 
so I was, for a moment, in love with you 
because you remind me of an approaching 
subway brushing hair off my face with 
its hot breath. Darkness. And then light, 
<br/> 
the exact goldness of dawn fingering 
that brick wall out my bedroom window 
on Smith Street mornings when I'd wake 
next to godknowswho but always someone 
<br/> 
who wasn't a mistake, because what kind 
of mistakes are that twitchy and joyful 
even if they're woven with a particular 
thread of regret: the guy who used 
<br/> 
my toothbrush without asking, 
I walked to the end of a pier with him, 
would have walked off anywhere with him 
until one day we both landed in California 
<br/> 
when I was still young, and going West 
meant taking a laptop and some clothes 
in a hatchback and learning about produce. 
I can turn toward you, whoever you are, 
<br/> 
and say you are my lover simply because 
I say you are, and that is, I realize, 
a tautology, but this is my poem. I claim 
nothing other than what I write, and even that, 
<br/> 
I'd leave by the wayside, since the only thing 
to pack would be the candlesticks, and 
even those are burned through, thoroughly 
replaceable. Who am I kidding? I don't 
<br/> 
own anything worth packing into anything. 
We are cardboard boxes, you and I, stacked 
nowhere near each other and humming 
different tunes. It is too late to be writing this. 
<br/> 
I am writing this to tell you something less 
than neutral, which is to say I'm sorry. 
It was never you. It was always you: 
your unutterable name, this growl in my throat. 
<br/> 
関連する問題