NLTKダイアログで文章を文章にトークン化

私は非ダイアログテキストを文章にトークン化できますが、文に引用符を追加すると、NLTKトークナイザはそれらを正しく分割しません。例えば、予想通り、この作品：NLTKダイアログで文章を文章にトークン化

import nltk.data 
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') 
text1 = 'Is this one sentence? This is separate. This is a third he said.' 
tokenizer.tokenize(text1)

これは、3つの異なる文章のリストになり：私は対話の中にそれを作る場合

['Is this one sentence?', 'This is separate.', 'This is a third he said.']

しかし、同じプロセスが動作しません。

text2 = '“Is this one sentence?” “This is separate.” “This is a third” he said.' 
tokenizer.tokenize(text2)

これは、単一の文としてそれを返します：私はこのケースでNLTKトークナイザの作業を行うことができますどのように

['“Is this one sentence?” “This is separate.” “This is a third” he said.']

？

出典

2017-09-30 jss367

トークナイザは、指示引用符をどうしたらいいのかわかりません。それらを通常のASCII二重引用符で置き換えてください。この例はうまくいきます。

>>> text3 = re.sub('[“”]', '"', text2) 
>>> nltk.sent_tokenize(text3) 
['"Is this one sentence?"', '"This is separate."', '"This is a third" he said.']

出典

2017-09-30 13:00:56 alexis

NLTKダイアログで文章を文章にトークン化

答えて

関連する問題