不正な構造のHTMLから段落をフェッチする方法は?Pythonを使用してHTMLから段落をフェッチする方法
私は、この元のHTMLテキストを持っている:
:それは2つの 短い段落で新しいHTMLテキストを返しsoup = BeautifulSoup(html)
:
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
<li>AA Early Childhood Education, or related field. </li>
<li>2+ years experience in a licensed childcare facility </li>
<li>Ability to meet state requirements, including finger print clearance. </li>
<li>Excellent oral and written communication skills </li>
<li>Strong organization and time management skills. </li>
<li>Creativity in expanding children's learning through play.<br> </li>
<li>Strong classroom management skills.<br> </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
<br>
</p>
私は、Pythonを使用し、そのような何かをしようとし
<html>
<body>
<p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br/>
</p>
<ul>
<li>AA Early Childhood Education, or related field. </li>
<li>2+ years experience in a licensed childcare facility </li>
<li>Ability to meet state requirements, including finger print clearance. </li>
<li>Excellent oral and written communication skills </li>
<li>Strong organization and time management skills. </li>
<li>Creativity in expanding children's learning through play.
<br/> </li>
<li>Strong classroom management skills.
<br/> </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
<br/> </p>
</body>
</html>
しかし、それは私が期待したものではありません。その結果、私は、このHTMLテキストを取得したいと思います:
<html>
<body>
<p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
AA Early Childhood Education, or related field.
2+ years experience in a licensed childcare facility
Ability to meet state requirements, including finger print clearance.
Excellent oral and written communication skills
Strong organization and time management skills.
Creativity in expanding children's learning through play.
Strong classroom management skills.
</p>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.</p>
</body>
</html>
HTML上で取得するために、私は最善のアプローチは、元のHTMLから<p>
と</p>
を除くすべてのHTMLタグを削除することだと思います。
new_html = re.sub('<[^<]+?>', '', html)
もちろん、通常のexpessionは、すべてのHTMLタグを削除します。この目的のために
は、私は次の正規表現を試してみました。では、<p>
と
</p>
以外のすべてのhtmlタグを削除するにはどうすればよいですか?
誰かが私にr.e.私はnew_html
をBeautifulSoup()
に送り、私が期待するhtmlを得る。
あなたがテキストを取得しますか?もしそうなら、 'soup.get_text()'はうまくいくはずです。 – styvane
いいえ、段落の一覧を取得します。 – user3601768
これらの 'li'タグはどうでしょうか?それらをテキストだけで置き換えたいですか? – styvane