Python HTMLの削除

Pythonの文字列からすべてのHTMLを削除するにはどうすればよいですか？例えば、どのように私は変えることができます：Python HTMLの削除

blah blah <a href="blah">link</a>

blah blah link

への感謝を！

出典

2009-02-28 user29772

あなたの目的のために過剰なことがあるかもしれませんが、あなたの文字列にもっと複雑なHTMLや不正な形式のHTMLがある場合は、BeautifulSoupを試してみてください。警告：まだPython 3.0では利用できないと思います。 – bernie

あなたはすべてのタグを削除するには、正規表現を使用することができます。

>>> import re 
>>> s = 'blah blah <a href="blah">link</a>' 
>>> re.sub('<[^>]*>', '', s) 
'blah blah link'

出典

2009-02-28 22:43:17

正規表現を簡略化して '<.*?>'とすると、同じ結果が得られますが、これは正しい形式のHTMLを前提としています。 – UnkwnTech

quoted>をチェックする必要がありますか、それとも許可されていませんか？あなたはか何かがありますか？ –

@Unkwntech：私は]を><.*?>よりも優先します。なぜなら、前者はタグの終わりを見つけるためにバックトラックを保つ必要がないからです。 –

は Beautiful Soupを試してみてください。テキスト以外のすべてを捨てる。

出典

2009-02-28 22:52:16

>>> import re 
>>> s = 'blah blah <a href="blah">link</a>' 
>>> q = re.compile(r'<.*?>', re.IGNORECASE) 
>>> re.sub(q, '', s) 
'blah blah link'

出典

2009-02-28 23:23:36 riza

正規表現の解決策が壁に当たったら、この超簡単で信頼できるBeautifulSoupプログラムを試してみてください。

from BeautifulSoup import BeautifulSoup 

html = "<a> Keep me </a>" 
soup = BeautifulSoup(html) 

text_parts = soup.findAll(text=True) 
text = ''.join(text_parts)

出典

2009-03-01 02:00:18 Triptych

BeautifulSoupも同じ壁に当たっています。 http://stackoverflow.com/questions/598817/python-html-removal/600471#600471 – jfs

一部またはすべてのHTMLタグを剥ぎ取るために使用することができますstripogramと呼ばれる小さなライブラリもあります。

あなたはこのようにそれを使用することができます：

from stripogram import html2text, html2safehtml 
# Only allow <b>, <a>, <i>, <br>, and <p> tags 
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p")) 
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide. 
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

をので、あなたは、単にすべてのHTMLを取り除くしたい場合は、最初の関数にvalid_tags =（）を渡します。

documentation hereが見つかります。

出典

2009-03-01 14:45:46 MrTopf

html2textはこのようになります。属性はそれに'>'を持っている場合

出典

2009-03-01 18:38:03 RexE

を参照してください。html2textは、余分な手順を踏むことなくきれいにフォーマットされた、判読可能な出力を生成するのに最適です。変換する必要があるすべてのHTML文字列が例のように簡単な場合は、BeautifulSoupを使用します。より複雑な場合、html2textは元の読み込み可能なインテントを保持する素晴らしい仕事をします。 –

Regexs、BeautifulSoup、html2text はを動作しません。 Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

たとえば、stripogram suggested by @MrTopfなどの場合、「HTML/XMLパーサ」ベースのソリューションが役立つ場合があります。

ここElementTreeベースのソリューションです：

####from xml.etree import ElementTree as etree # stdlib 
from lxml import etree 

str_ = 'blah blah <a href="blah">link</a> END' 
root = etree.fromstring('<html>%s</html>' % str_) 
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

は出力：

blah blah link END

出典

2009-03-01 20:42:41 jfs

私はちょうどこれを書きました。それが必要。 html2textを使用し、ファイルパスを取るが、私はURLを好むだろう。 html2textの出力は、TextFromHtml2Text.text に保存されています。それを保存し、ペットのカナリアにフィードします。アイデアがここで説明されて

def remove_html_markup(s): 
    tag = False 
    quote = False 
    out = "" 

    for c in s: 
      if c == '<' and not quote: 
       tag = True 
      elif c == '>' and not quote: 
       tag = False 
      elif (c == '"' or c == "'") and tag: 
       quote = not quote 
      elif not tag: 
       out = out + c 

    return out

：

import html2text 
class TextFromHtml2Text: 

    def __init__(self, url = ''): 
     if url == '': 
      raise TypeError("Needs a URL") 
     self.text = "" 
     self.url = url 
     self.html = "" 
     self.gethtmlfile() 
     self.maytheswartzbewithyou() 

    def gethtmlfile(self): 
     file = open(self.url) 
     for line in file.readlines(): 
      self.html += line 

    def maytheswartzbewithyou(self): 
     self.text = html2text.html2text(self.html)

出典

2012-06-29 17:41:43

また、これを 'import urllib、html2text [break]として書くこともできます。def get_text_from_html_url（url）：[break] html2text.html2textを返します。urllib.urlopen（url）.read（））' short and cleaner –

これに簡単な方法がありますhttp://youtu.be/2tu9LTDujbw

あなたはここで働いそれを見ることができます。http://youtu.be/HPkNPcYed9M?t=35s

PS - あなたがしている場合クラス（私はあなたにリンクを与えるPythonでスマートなデバッグについて）に興味があります：http://www.udacity.com/overview/Course/cs259/CourseRev/1。それは無料です！

大歓迎です！ :)

出典

2013-01-22 17:31:08 Medeiros

Python HTMLの削除

答えて

関連する問題