文字列検索ライブラリの結果 - バグ、機能、またはコーディングエラー？

私はを使用しています。Aho-Corasick文字列検索アルゴリズムを実装したPythonライブラリは、1回のパスで指定された文字列のパターンを検索します。出力は、私が期待していものではありません。文字列検索ライブラリの結果 - バグ、機能、またはコーディングエラー？

In [4]: import ahocorasick 
In [5]: import collections 

In [6]: tree = ahocorasick.KeywordTree() 

In [7]: ss = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first" 

In [8]: words = ["first sentence is", "first sentence", "the first sentence", "the first sentence is"] 

In [9]: for w in words: 
    ...:  tree.add(w) 
    ...: 

In [10]: tree.make() 

In [13]: final = collections.defaultdict(int) 

In [15]: for match in tree.findall(ss, allow_overlaps=True): 
    ....:  final[ss[match[0]:match[1]]] += 1 
    ....: 

In [16]: final 
{ 'the first sentence': 3, 'the first sentence is': 2}

私は期待していた出力は、このた：

{ 
    'the first sentence': 3, 
    'the first sentence is': 2, 
    'first sentence': 3, 
    'first sentence is': 2 
}

私は何かが足りないのですか？私は大きな文字列でこれをやっているので、後処理は私の最初の選択ではありません。希望の出力を得る方法はありますか？

出典

2011-11-11 Legend

ahocorasickモジュールについてよくわかりませんが、その結果は疑わしいと思われます。 acoraモジュールがこれを示しています

import acora 
import collections 

ss = "this is the first sentence in this book " 
    "the first sentence is really the most interesting " 
    "the first sentence is always first" 

words = ["first sentence is", 
     "first sentence", 
     "the first sentence", 
     "the first sentence is"] 

tree = acora.AcoraBuilder(*words).build() 

for match in tree.findall(ss): 
    result[match] += 1

結果：

>>> result 
defaultdict(<type 'int'>, 
      {'the first sentence' : 3, 
      'first sentence'  : 3, 
      'first sentence is' : 2, 
      'the first sentence is': 2})

出典

2011-11-11 22:35:25 Lemur

+1ありがとうございます。これは私の希望する出力と一致します。万が一大きな文章にこれを使用した経験がありますか？パフォーマンスは賢明です。 – Legend

大規模なコーパスを使った直接の経験はありません。 PyPiのページには、GILが解放され、「高速な」CPythonの実装が含まれていますが、それ以外はわかりません。 – Lemur

ああ、[ここ]（http：//www.quora.com/What-are-the-best）から抜粋して[esmre]（http://code.google.com/p/esmre/） -open-source-high-performance-string-matching-libraries）を使用します。 – Lemur

私がAho-Corasickのアルゴリズムを理解し、それを実装した方法は、あなたの期待される出力に同意するでしょう。使用しているPythonライブラリが間違っているようです。あるいは、特定の位置から始まる最も長いマッチではなく、ある位置からすべてのマッチを開始するためのフラグがあります。

元の論文の例であるhttp://www.win.tue.nl/~watson/2R080/opdracht/p333-aho-corasick.pdfが理解を助けます。

出典

2011-11-11 21:59:11

1は、有益な答えをいただき、ありがとうございます。私はこれが事実かもしれないと恐れていた。その図書館はかなり使用されているように思われます。 – Legend

文字列検索ライブラリの結果 - バグ、機能、またはコーディングエラー？

答えて

関連する問題