2016-04-22 13 views
0

不正な構造のHTMLから段落をフェッチする方法は?Pythonを使用してHTMLから段落をフェッチする方法

私は、この元のHTMLテキストを持っている:

:それは2つの 短い段落で新しいHTMLテキストを返し

soup = BeautifulSoup(html) 

This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
<br> 
<ul> 
    <li>AA Early Childhood Education, or related field. </li> 
    <li>2+ years experience in a licensed childcare facility </li> 
    <li>Ability to meet state requirements, including finger print clearance. </li> 
    <li>Excellent oral and written communication skills </li> 
    <li>Strong organization and time management skills. </li> 
    <li>Creativity in expanding children's learning through play.<br> </li> 
    <li>Strong classroom management skills.<br> </li> 
</ul> 
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br> 
</p> 

私は、Pythonを使用し、そのような何かをしようとし

<html> 

<body> 
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
     <br/> 
    </p> 
    <ul> 
     <li>AA Early Childhood Education, or related field. </li> 
     <li>2+ years experience in a licensed childcare facility </li> 
     <li>Ability to meet state requirements, including finger print clearance. </li> 
     <li>Excellent oral and written communication skills </li> 
     <li>Strong organization and time management skills. </li> 
     <li>Creativity in expanding children's learning through play. 
      <br/> </li> 
     <li>Strong classroom management skills. 
      <br/> </li> 
    </ul> 
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
     <br/> </p> 
</body> 

</html> 

しかし、それは私が期待したものではありません。その結果、私は、このHTMLテキストを取得したいと思います:

<html> 

<body> 
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
     AA Early Childhood Education, or related field. 
     2+ years experience in a licensed childcare facility 
     Ability to meet state requirements, including finger print clearance. 
     Excellent oral and written communication skills 
     Strong organization and time management skills. 
     Creativity in expanding children's learning through play. 
     Strong classroom management skills. 
    </p> 
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.</p> 
</body> 

</html> 

HTML上で取得するために、私は最善のアプローチは、元のHTMLから<p></p>を除くすべてのHTMLタグを削除することだと思います。

new_html = re.sub('<[^<]+?>', '', html) 

もちろん、通常のexpessionは、すべてのHTMLタグを削除します。この目的のために

は、私は次の正規表現を試してみました。では、 <p></p>以外のすべてのhtmlタグを削除するにはどうすればよいですか?

誰かが私にr.e.私はnew_htmlBeautifulSoup()に送り、私が期待するhtmlを得る。

+2

あなたがテキストを取得しますか?もしそうなら、 'soup.get_text()'はうまくいくはずです。 – styvane

+0

いいえ、段落の一覧を取得します。 – user3601768

+1

これらの 'li'タグはどうでしょうか?それらをテキストだけで置き換えたいですか? – styvane

答えて

1

短い答え

new_html = re.sub('<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', '', html)

長い答え

あなたの元の正規表現は奇妙なようです。私は[^<]の代わりに[^>]を入れたでしょう。あなたは、 "終わりのタグではないもの"が欲しいです。

また、+の後ろに?と入力すると奇妙です。

+の意味: "リピート1以上の時間"

?の意味: "0を繰り返すか、一度"。

両方の兆候があることは非常に奇妙です。とにかく

、我々はこのようなあなたの正規表現を表現することができます。

「オープンタグ」、同等ですし、「ないもの 『P』及びません/ P」、そして「終了タグ」

スラッシュではないユニークな文字 "または"スラッシュではないユニークな文字 "のいずれかを使用すると、 p '"または"スラッシュ、2つ以上のchar "、" close tag "の順に選択します。

と等価である:次いで

<[^p]又は[^>/][^>]+又は/[^p]又は/[^>][^>]+)次に>

これは、上記の正規表現で表されるものです。ここで

は、Pythonコンソールを入力するための簡単なテストです:

re.sub(
    '<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', 
    '', 
    'aa <p> bb <a> cc <li> dd <pp> ee <pa> ff </p> gg </a> hh </li> ii </pp> jj </pa> ff') 
+1

'+?'は1つ以上の、貪欲でないことを意味します。その '? 'がなければ、終了タグも捕捉されます。あなたは正しいですが、[^>]はより良いです。 –

+0

なぜHTML Parserの代わりに正規表現を使うべきですか? – styvane

1

これは手動ドキュメント操作のようなものですが、li要素とremove要素をループして、最初の段落のappendingにすることができます。その後、同様ul要素を削除します。

from bs4 import BeautifulSoup 


data = """ 
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
<br> 
<ul> 
    <li>AA Early Childhood Education, or related field. </li> 
    <li>2+ years experience in a licensed childcare facility </li> 
    <li>Ability to meet state requirements, including finger print clearance. </li> 
    <li>Excellent oral and written communication skills </li> 
    <li>Strong organization and time management skills. </li> 
    <li>Creativity in expanding children's learning through play.<br> </li> 
    <li>Strong classroom management skills.<br> </li> 
</ul> 
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br> 
</p>""" 

soup = BeautifulSoup(data, "lxml") 

p = soup.p 
for li in soup.find_all("li"): 
    p.append(li.get_text()) 
    li.extract() 

soup.find("ul").extract() 
print(soup.prettify()) 

は、あなたが持っていることを計画してきたように2つの段落を印刷します:

<html> 
<body> 
    <p> 
    This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
    <br/> 
    AA Early Childhood Education, or related field. 
    2+ years experience in a licensed childcare facility 
    Ability to meet state requirements, including finger print clearance. 
    Excellent oral and written communication skills 
    Strong organization and time management skills. 
    Creativity in expanding children's learning through play. 
    Strong classroom management skills. 
    </p> 
    <p> 
    The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br/> 
    </p> 
</body> 
</html> 

方法lxmlでの重要な違いがあることに注意してください、html.parserhtml5libあなたが投稿した入力HTMLを解析します。 html5libhtml.parserは、最初のパラグラフを実際には作成しません。上記のコードは実際にはlxmlになります。


より良い方法は、おそらく別の「スープ」オブジェクトを作成することです。サンプル:

from bs4 import BeautifulSoup 


data = """ 
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
<br> 
<ul> 
    <li>AA Early Childhood Education, or related field. </li> 
    <li>2+ years experience in a licensed childcare facility </li> 
    <li>Ability to meet state requirements, including finger print clearance. </li> 
    <li>Excellent oral and written communication skills </li> 
    <li>Strong organization and time management skills. </li> 
    <li>Creativity in expanding children's learning through play.<br> </li> 
    <li>Strong classroom management skills.<br> </li> 
</ul> 
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br> 
</p>""" 

soup = BeautifulSoup(data, "lxml") 

# create new soup 
new_soup = BeautifulSoup("<body></body>", "lxml") 
new_body = new_soup.body 

# create first paragraph 
first_p = new_soup.new_tag("p") 
first_p.append(soup.p.get_text()) 

for li in soup.find_all("li"): 
    first_p.append(li.get_text()) 

new_body.append(first_p) 

# create second paragraph 
second_p = soup.find_all("p")[-1] 
new_body.append(second_p) 

print(new_soup.prettify()) 

プリント:

<html> 
<body> 
    <p> 
    This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
    AA Early Childhood Education, or related field. 
    2+ years experience in a licensed childcare facility 
    Ability to meet state requirements, including finger print clearance. 
    Excellent oral and written communication skills 
    Strong organization and time management skills. 
    Creativity in expanding children's learning through play. 
    Strong classroom management skills. 
    </p> 
    <p> 
    The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br/> 
    </p> 
</body> 
</html> 
+1

@ user3601768ええ、おそらく別のスープオブジェクトを作る方が良いアイデアであり、サンプルを使って答えを更新しました。希望が役立ちます。 – alecxe

関連する問題