urllib2の、Google App Engineの、およびUnicodeの質問

みんな、私はちょうどurllib2の、Google App Engineの、およびUnicodeの質問

私の現在の苦境はこれです...ので、私は問題の束に実行しているGoogleのアプリエンジンを学んでいますねえ。私はデータベースを持っています。

class Website(db.Model): 
    web_address = db.StringProperty() 
    company_name = db.StringProperty() 
    content = db.TextProperty() 
    div_section = db.StringProperty() 
    local_links = db.StringProperty() 
    absolute_links = db.BooleanProperty() 
    date_updated = db.DateTimeProperty()

私はコンテンツプロパティを持っています。

私はdb.TextProperty（）を使用しています。なぜなら、> 500バイトを持つWebページのコンテンツを格納する必要があるからです。

私が問題に遭遇しているのは、urllib2.readlines（）のフォーマットです。 TextProperty（）に入れると、ASCIIに変換されます。いくつかの文字は> 128であり、UnicodeDecodeErrorをスローします。

これを回避する簡単な方法はありますか？ほとんどの部分については、私はそれらの文字を気にしない...

私のエラーは次のとおりです。

Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/init.py", line 511, in call handler.get(*groups) File "/base/data/home/apps/game-job-finder/1.346504560470727679/main.py", line 61, in get x.content = website_data_joined File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/init.py", line 542, in set value = self.validate(value) File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/init.py", line 2407, in validate value = self.data_type(value) File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore_types.py", line 1006, in new return super(Text, cls).new(cls, arg, encoding) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2124: ordinal not in range(128)

出典

2010-11-27 shawn

私はASCIIにUnicodeを変換すると、「エンコーディング」ではない「デコードだろうと思っているだろう"あなたはそれが他の方法ではないと確信していますか？ –

うん、そうだよ。 – shawn

readlineを作成してデータストアを置くスニペットを追加できますか？ – systempuntoout

行がreadlinesから返されたように思われる（ユニコード文字列ではなく、バイト文字列ではありません潜在的に非ASCII文字を含むstrのインスタンス）。これらのバイトは、HTTP応答本体で受信された生データであり、使用されるエンコーディングに応じて異なる文字列を表します。テキスト（バイト！=文字）として扱うには、それらを "デコード"する必要があります。

エンコーディングがUTF-8である場合、このコードは正常に動作するはずです：

f = urllib2.open('http://www.google.com') 
website = Website() 
website.content = db.Text(f.read(), encoding = 'utf-8-sig') # 'sig' deals with BOM if present

注意実際のエンコードは（時にはページからページへ）のウェブサイトからウェブサイトに変化すること。使用されるエンコーディングは、HTTPレスポンスのContent-Typeヘッダーに含める必要があります（取得方法についてはthis questionを参照）。そうでない場合は、HTMLヘッダーのメタタグに含めることができます適切に）はるかに難しいです：

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

注エンコーディングを指定する、または間違ったエンコーディングを指定していないサイトがあること。

、あなたが本当に任意の文字が、ASCII気にしないのであれば、あなたはそれらを無視することができますし、それを使って行うこと：

f = urllib2.open('http://www.google.com') 
website = Website() 
content = unicode(f.read(), errors = 'ignore') # Ignore characters that cause errors 
website.content = db.Text(content) # Don't need to specify an encoding since content is already a unicode string

出典

2010-11-28 05:51:52 Cameron

urllib2の、Google App Engineの、およびUnicodeの質問

答えて

関連する問題