2017-03-05 7 views
0

私はテキストデータのバイナリ分類問題に取り組んでいます。テキストの単語を、私が選んだ明確に定義されたいくつかのWordクラスのフィーチャの外観に基づいて分類したいと思います。 今のところ、私は各単語クラスのテキスト全体の出現を検索し、一致する単語クラスの数を増やしています。このカウントは、各単語クラスの頻度を計算するためにさらに使用されます。ここに私のコードは次のとおりです。私のコードでre.search()を実装する方法は?

import nltk 
import re 

def wordClassFeatures(text): 
    home = """woke home sleep today eat tired wake watch 
     watched dinner ate bed day house tv early boring 
     yesterday watching sit""" 

    conversation = """know people think person tell feel friends 
talk new talking mean ask understand feelings care thinking 
friend relationship realize question answer saying""" 


    countHome = countConversation =0 

    totalWords = len(text.split()) 

    text = text.lower() 
    text = nltk.word_tokenize(text) 
    conversation = nltk.word_tokenize(conversation) 
    home = nltk.word_tokenize(home) 
''' 
    for word in text: 
     if word in conversation: #this is my current approach 
      countConversation += 1 
     if word in home: 
      countHome += 1 
''' 

    for word in text: 
     if re.search(word, conversation): #this is what I want to implement 
      countConversation += 1 
     if re.search(word, home): 
      countHome += 1 

    countConversation /= 1.0*totalWords 
    countHome /= 1.0*totalWords 

    return(countHome,countConversation) 

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't 
see the benefits (please correct me if I'm wrong), thus I abandoned that.""" 

print(wordClassFeatures(text)) 

これのデメリットは、テキスト中の単語が単語クラスに分類するために明示的に一致している必要がありますので、私は今、すべての単語クラスの各単語を語幹の余分なオーバーヘッドを持っているということです。したがって、私は現在、テキストの各単語を正規表現として入力し、各単語クラスでそれを検索しようとしています。 これは、エラーがスローされます。

line 362, in wordClassFeatures 
if re.search(conversation, word): 
    File "/root/anaconda3/lib/python3.6/re.py", line 182, in search 
    return _compile(pattern, flags).search(string) 
    File "/root/anaconda3/lib/python3.6/re.py", line 289, in _compile 
    p, loc = _cache[type(pattern), pattern, flags] 
TypeError: unhashable type: 'list' 

私は構文に大きな間違いがあると知っているが、re.searchの構文のほとんどはフォーマットであると、私はネット上でそれを見つけることができませんでした:

re.search("thank|appreciate|advance", x)

これを正しく実装する方法はありますか?

+1

それは 're.search(単語、会話)'でなければなりません。 –

+0

@Rawingそれを試してみました。このエラーが発生しました:wordClassFeaturesの行362 re.search(単語、会話): ファイル "/root/anaconda3/lib/python3.6/re.py"、行182、検索 return _compile(pattern、フラグ).search(string) TypeError:期待される文字列またはバイト状オブジェクト –

+0

この質問には、[最小、完全、および検証可能](http://stackoverflow.com/help/mcve)の例が必要です。そうすれば、私たちがあなたを助けやすくなります。 –

答えて

0

私はre.searchコードが会話変数のために供給していることをstringbufferなくlistを探していると信じています。

また、tokenizingの場合は、のテキストのすべての特殊文字を使用していますが、これは検索をオフにしています。だから、

、それは(文字列形式)ではないtokenize

として最初に私たちは特殊文字の テキスト

text = re.sub('\W+',' ', text) #strip text of all special characters 

次を除去する必要があり、我々は会話変数を残します

#conversation = nltk.word_tokenize(conversation) 
#home = nltk.word_tokenize(home) 

希望の回答を得る:

以下
(0.21301775147928995, 0.20118343195266272) 

全コード:

import nltk 
import re 

def wordClassFeatures(text): 
    home = """woke home sleep today eat tired wake watch 
     watched dinner ate bed day house tv early boring 
     yesterday watching sit""" 

    conversation = """know people think person tell feel friends 
talk new talking mean ask understand feelings care thinking 
friend relationship realize question answer saying""" 

    text = re.sub('\W+',' ', text) #strip text of all special characters 

    countHome = countConversation =0 

    totalWords = len(text.split()) 

    text = text.lower() 
    text = nltk.word_tokenize(text) 
    #conversation = nltk.word_tokenize(conversation) 
    #home = nltk.word_tokenize(home) 
    ''' 
     for word in text: 
      if word in conversation: #this is my current approach 
       countConversation += 1 
      if word in home: 
       countHome += 1 
    ''' 

    for word in text: 
     if re.search(word, conversation): #this is what I want to implement 
      countConversation += 1 
     if re.search(word, home): 
      countHome += 1 

    countConversation /= 1.0*totalWords 
    countHome /= 1.0*totalWords 

    return(countHome,countConversation) 

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't 
see the benefits (please correct me if I'm wrong), thus I abandoned that.""" 

print(wordClassFeatures(text)) 
関連する問題