ファイルからのストップワードの削除

ファイル内のデータ列からストップワードを削除したいとします。エンドユーザーが話しているときのために回線を除外しました。しかし、ストップワードを除外しませんusertext.apply(lambda x: [word for word in x if word not in stop_words]) 何が間違っていますか？ファイルからのストップワードの削除

import pandas as pd 
from stop_words import get_stop_words 
df = pd.read_csv("F:/textclustering/data/cleandata.csv", encoding="iso-8859-1") 
usertext = df[df.Role.str.contains("End-user",na=False)][['Data','chatid']] 
stop_words = get_stop_words('dutch') 
clean = usertext.apply(lambda x: [word for word in x if word not in stop_words]) 
print(clean)

出典

2017-03-08 DataNewB

最初にyすべての単語を削除するかどうかを調べるには 'clean = usertext.apply（lambda x：[]）'を試してみてください。 –

データ[] chatid [] dtype：object ['aan'、 'al'、 'alles'、 'als'、 'altijd'、 'andere'、 'ben'、 'bij' 「デー」、「ダー」、「デー」、「デ」、「デゼ」、「ダイ」、「デット」、「ドク」、「ドゥーン」、「ドア」、「デュー」、 'ヘン'、 'ヘブン'、 'ヘブン'、 'ヘブン'、 'ヘム'、「ヘン」、「ヘン」、「ヘン」、 'hoe'、 'hoe'、 'hun'、 'iemand'、 'iets'、 'ik'、 'in'、 'is'、 'ja'、 'je'、 ' 「ミーン」、「ミーン」、「ミーン」、「マン」、「ミート」、「ミー」、「ミーン」、「モエ」、「ナ」、「ナール」、 'niet'、 'nu'、 'of'、 'om'、 'omdat'、...]これは両方とも – DataNewB

clean = usertext.apply(lambda x: x if x not in stop_words else '')

出典

2017-03-08 14:40:22 galaxyan

の出力ですが、効率を上げるために 'stop_words'を' set'します。 –

私はNameErrorを取得します：（「name 'word'が定義されていません」、「index Dataで発生しました」）実行時 – DataNewB

@DataNewBごめんなさいx – galaxyan

あなたのストップワードの正規表現パターンを構築し、それらを削除するベクトル化str.replaceを呼び出すことができます。

ここ

In [124]: 
stop_words = ['a','not','the'] 
stop_words_pat = '|'.join(['\\b' + stop + '\\b' for stop in stop_words]) 
stop_words_pat 

Out[124]: 
'\\ba\\b|\\bnot\\b|\\bthe\\b' 

In [125]:  
df = pd.DataFrame({'text':['a to the b', 'the knot ace a']}) 
df['text'].str.replace(stop_words_pat, '') 

Out[125]: 
0   to b 
1  knot ace 
Name: text, dtype: object

我々は、各ストップワードを囲むパターンを構築するために、リスト内包表記を行います'\b'は改行であり、次にorすべての語を使用して'|'

出典

2017-03-08 14:55:42 EdChum

2つの問題：

まず、stop_wordsというモジュールがあり、後でstop_wordsという名前の変数を作成します。これは悪い形です。

第2に、リスト内の値ではなく、xパラメータをリストにするようにラムダ関数を.applyに渡しています。

つまり、df.apply(sqrt)の代わりにdf.apply(lambda x: [sqrt(val) for val in x])を実行しています。

あなたは、リストの処理を自分で行う必要があり、次のいずれか

clean = [x for x in usertext if x not in stop_words]

それとも、一度に一つの単語を取る関数で、適用されないはずです：@ジャンのよう

clean = usertext.apply(lambda x: x if x not in stop_words else '')

をFrançoisFabreがコメントで提案したように、あなたのstop_wordsがリストではなくセットであれば、スピードアップが可能です：

from stop_words import get_stop_words 

nl_stop_words = set(get_stop_words('dutch')) # NOTE: set 

usertext = ... 
clean = usertext.apply(lambda word: word if word not in nl_stop_words else '')

出典

2017-03-08 15:10:39

ファイルからのストップワードの削除

答えて

関連する問題