文字列から部分文字列の組み合わせを生成する

私は、与えられた単語のすべての可能な音節の組み合わせを生成しようとしています。何が音節であるかを特定するプロセスはここでは重要ではありませんが、それは私に問題をもたらしているすべての組み合わせの生成です。私はこれがおそらく（他の方法は問題ありませんが）いくつかの行で再帰的に行うことが可能だと思いますが、それを動作させるのに問題があります。誰も助けることができますか？文字列から部分文字列の組み合わせを生成する

// how to test a syllable, just for the purpose of this example 
    bool IsSyllable(string possibleSyllable) 
    { 
     return Regex.IsMatch(possibleSyllable, "^(mis|und|un|der|er|stand)$"); 
    } 

    List<string> BreakIntoSyllables(string word) 
    { 
     // the code here is what I'm trying to write 
     // if 'word' is "misunderstand" , I'd like this to return 
     // => {"mis","und","er","stand"},{ "mis","un","der","stand"} 
     // and for any other combinations to be not included 
    }

出典

2011-10-06 mikel

あなたの音節をカタログ化するには、Trieを使用する必要があります。または、あなたはnaiverソリューションを使用することができます:-) – xanatos

はこれで起動してみてくださいません：

var word = "misunderstand"; 

Func<string, bool> isSyllable = 
    t => Regex.IsMatch(t, "^(mis|und|un|der|er|stand)$"); 

var query = 
    from i in Enumerable.Range(0, word.Length) 
    from l in Enumerable.Range(1, word.Length - i) 
    let part = word.Substring(i, l) 
    where isSyllable(part) 
    select part;

これが返されます。

misunderstand-results

が助けを少なくともで開始することにしていますか？

編集：私はこの問題について少しさらに考えANクエリのこのカップルを思い付いた：

Func<string, IEnumerable<string[]>> splitter = null; 
splitter = 
    t => 
     from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     let e = t.Substring(n) 
     from g in (new [] { new [] { e } }).Concat(splitter(e)) 
     select (new [] { s }).Concat(g).ToArray(); 

var query = 
    from split in (new [] { new [] { word } }).Concat(splitter(word)) 
    where split.All(part => isSyllable(part)) 
    select split;

今queryリターンこの：

misunderstanding-results2

、私を知ってみましょうそれは今それを釘付けにしている場合。

出典

2011-10-06 13:27:49 Enigmativity

ありがとう、これは有望そうですが、それは私に現在コンパイルエラーを与えています - "'System.Web.Collections.Generic.IEnumerable 'には 'StartWith'皮切りに'" 。 – mikel

@mikel - あ、ごめんなさい。 Reactive Framework拡張メソッドを使用しました。私は私のPCに戻ったときにそれを修正します。 – Enigmativity

@mikel - 標準の演算子を使用するようにクエリを変更しました。 – Enigmativity

通常、このタイプの問題は、Triesを使用して解決されます。私はHow to create a trie in c#（私はそれを書き直していることに注意してください）にトライの実装をベースにします。

var trie = new Trie(new[] { "un", "que", "stio", "na", "ble", "qu", "es", "ti", "onable", "o", "nable" }); 
//var trie = new Trie(new[] { "u", "n", "q", "u", "e", "s", "t", "i", "o", "n", "a", "b", "l", "e", "un", "qu", "es", "ti", "on", "ab", "le", "nq", "ue", "st", "io", "na", "bl", "unq", "ues", "tio", "nab", "nqu", "est", "ion", "abl", "que", "stio", "nab" }); 

var word = "unquestionable"; 

var parts = new List<List<string>>(); 

Split(word, 0, trie, trie.Root, new List<string>(), parts); 

// 

public static void Split(string word, int index, Trie trie, TrieNode node, List<string> currentParts, List<List<string>> parts) 
{ 
    // Found a syllable. We have to split: one way we take that syllable and continue from it (and it's done in this if). 
    // Another way we ignore this possible syllable and we continue searching for a longer word (done after the if) 
    if (node.IsTerminal) 
    { 
     // Add the syllable to the current list of syllables 
     currentParts.Add(node.Word); 

     // "covered" the word with syllables 
     if (index == word.Length) 
     { 
      // Here we make a copy of the parts of the word. This because the currentParts list is a "working" list and is modified every time. 
      parts.Add(new List<string>(currentParts)); 
     } 
     else 
     { 
      // There are remaining letters in the word. We restart the scan for more syllables, restarting from the root. 
      Split(word, index, trie, trie.Root, currentParts, parts); 
     } 

     // Remove the syllable from the current list of syllables 
     currentParts.RemoveAt(currentParts.Count - 1); 
    } 

    // We have covered all the word with letters. No more work to do in this subiteration 
    if (index == word.Length) 
    { 
     return; 
    } 

    // Here we try to find the edge corresponding to the current character 

    TrieNode nextNode; 

    if (!node.Edges.TryGetValue(word[index], out nextNode)) 
    { 
     return; 
    } 

    Split(word, index + 1, trie, nextNode, currentParts, parts); 
} 

public class Trie 
{ 
    public readonly TrieNode Root = new TrieNode(); 

    public Trie() 
    { 
    } 

    public Trie(IEnumerable<string> words) 
    { 
     this.AddRange(words); 
    } 

    public void Add(string word) 
    { 
     var currentNode = this.Root; 

     foreach (char ch in word) 
     { 
      TrieNode nextNode; 

      if (!currentNode.Edges.TryGetValue(ch, out nextNode)) 
      { 
       nextNode = new TrieNode(); 
       currentNode.Edges[ch] = nextNode; 
      } 

      currentNode = nextNode; 
     } 

     currentNode.Word = word; 
    } 

    public void AddRange(IEnumerable<string> words) 
    { 
     foreach (var word in words) 
     { 
      this.Add(word); 
     } 
    } 
} 

public class TrieNode 
{ 
    public readonly Dictionary<char, TrieNode> Edges = new Dictionary<char, TrieNode>(); 
    public string Word { get; set; } 

    public bool IsTerminal 
    { 
     get 
     { 
      return this.Word != null; 
     } 
    } 
}

wordに興味があります文字列、partsは（おそらくそれList<string[]>作ることがより正しいだろうが、それはそれを行うのは非常に簡単だ可能音節のリストのリストが含まれますです。その代わりparts.Add(new List<string>(currentParts));のparts.Add(currentParts.ToArray());を書いて、List<string[]>にすべてList<List<string>>を変更します。

私はそれが間違っている音節のではなく、ただちにそれらを後でポストフィルタリングを破棄するのでthasがtheretically速く彼よりもEnigmativity応答のバリエーションを追加します。あなたがそれを好きなら、あなたがすべき彼に+1を与えてください。なぜなら、彼の考えがなければ、このバリアン可能ではないでしょう。しかし、それはまだハックです。我々は、の長さを長さ1から単語の全ての可能な音節を、計算

 from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     where isSyllable(s)

：「右」溶液は

Func<string, bool> isSyllable = t => Regex.IsMatch(t, "^(un|que|stio|na|ble|qu|es|ti|onable|o|nable)$"); 

Func<string, IEnumerable<string[]>> splitter = null; 
splitter = 
    t => 
     (
     from n in Enumerable.Range(1, t.Length - 1) 
     let s = t.Substring(0, n) 
     where isSyllable(s) 
     let e = t.Substring(n) 
     let f = splitter(e) 
     from g in f 
     select (new[] { s }).Concat(g).ToArray() 
     ) 
     .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]); 

var parts = splitter(word).ToList();

:-)説明をトライ（複数可）を使用しています単語 - 1とそれが音節かどうかを確認します。我々は直接非音節を取り除いた。音節としての完全な単語は後でチェックされます。

 let e = t.Substring(n) 
     let f = splitter(e)

我々は、文字列

 from g in f 
     select (new[] { s }).Concat(g).ToArray()

そして、私たちは「現在の」音節で見つかった音節をチェーンの残りの部分の音節を検索します。多くの無駄な配列を作成していることに注意してください。結果としてIEnumerable<IEnumerable<string>>を受け取れば、ToArrayを取り除くことができます。

（私たちは

 from g in splitter(t.Substring(n)) 
     select (new[] { s }).Concat(g).ToArray()

のように、多くのletを削除、一緒に多くの行を書き換えることができますが、我々は明確にするためにそれを行うことはありません）

そして、我々が見つかりました音節と「現在」の音節を連結。ここで

 .Concat(isSyllable(t) ? new[] { new string[] { t } } : new string[0][]);

少し複雑なこのConcatを使用して、空の配列を作成していない少しのように、私たちは、クエリを再構築することができますが、それは次のようになり

で（我々はisSyllable(t) ? new[] { new string[] { t } }.Concat(oldLambdaFunction) : oldLambdaFunctionとして全体ラムダ関数を書き換えることができ）最終的に、単語全体が音節であれば、単語全体を音節として追加します。そうでなければ我々Concat空の配列は、（そう何Concat）

出典

2011-10-06 13:26:29 xanatos

あなたがデータセットの大きさがわかりませんが、これは単純に「これは音節ですか？ n =単語の中の文字の数（単語が意味を持たない場合は、大きなデータセットでは遅くなる可能性があることを意味します）の単語ごとに「音節検出」ルーチンを約0（n * n）回呼び出す必要があります。。これは、検出アルゴリズムのスケーラビリティを考慮していません。これは、さらに音節を追加するにつれて減速する可能性があります。。

私は、音節が何であるかを特定するためのあなたのプロセスが適切ではないことを知っていますが、それはそれが自動補完のように機能するように変更することができます。つまり、この点から可能な音節は、はるかにスケーラブルである。パフォーマンスが低下した場合はtrieに置き換えてください。

出典

2011-10-06 13:53:39

文字列から部分文字列の組み合わせを生成する

答えて

関連する問題