key = valueのペアは、次の規則を使用して任意のテキストに一致する必要があります。インデントされた継続行を一致させるための正規表現

  • 大手行は以下の構造を有する:インデント付き
    • 開始 - 「二つのスペースまたはタブ」のLEA一度、例えば:( |\t)+
    • +文字と1つのスペース
    • 言葉VARCONST
    • およびkeyおよびvalue=文字


+ VAR somename = somevalue (indented with two spaces) 
     + VAR name3 = indented by one \t 


/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*(.*)$/ 

今問題:構文は、継続行可能例えば上記の行の後に少なくとも1つのインデントシーケンス( |\t)(別名2つのスペースまたは1つのタブ)を開始する行が続く行とみなされ、その全体の内容(先行スペースも含む)は前のキーのvalueである必要がありますライン。


+ VAR multi = 3 line value where the continuation lines 
    are indented (starts with two spaces or one tab) 
    and NOT followed by the '+' 


/^( |\t)+([^\+](.*))$/ 



while($text =~ m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=( |\t)+[^\+](.*)$)*)/gm) { 


(?=( |\t)+[^\+](.*)$)* matches null string many times in regex; marked by <-- HERE in m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=( |\t)+[^\+](.*)$)* <-- HERE)/ at so line 36. 


    ^( |\t)+  # <- space ... :(




#!/usr/bin/env perl 
use 5.014; 
use warnings; 
use Data::Dumper; 

my $txt = do { local $/; <DATA> }; 

my @matches1 = parse_by_lines($txt // ''); 
mydump('BY LINES', @matches1); 

my @matches2 = parse_by_one_regex($txt // ''); 
mydump('REGEX', @matches2); 

sub parse_by_lines { #produces the wanted output 
    my ($text) = @_; 
    my @match; 
    my $havekey; 
    for my $line (split "\n", $text) { 
     if($line =~ m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*(.*)$/) { 
      push @match, { indent => $1, type => $2, key => $3, val => $4 }; 
     elsif($havekey && $line =~ m/^( |\t)+([^\+](.*))$/) { #continuation line 
      $match[-1]->{val} .= "\n$line"; #prserve the \n in the val 
     else { 
      $havekey = 0; 
    return @match; 

sub parse_by_one_regex { #not working 
    my ($text) = @_; 
    my @match; 
    while($text =~ m/^( |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=( |\t)+[^\+](.*)$)*)/gm) { 
     push @match, { indent => $1, type => $2, key => $3, val => $4 }; 
    return @match; 

sub mydump { 
    my($label, @match) = @_; 
    say "#### $label ####"; 
    for my $m (@match) { 
     printf "%-6s: [%s]\n", $_, $m->{$_} for (qw(indent type key val)); 
     print "\n"; 

some arbitrary text lines 
or empty lines 

    could be indented 
    and could contain any character 

    + VAR name1 = var indented by two spaces and the first nonspace character is '+' 
line of arbitrary text 
    + VAR name2 = var indented by 2x2 spaces 

    + VAR name3 = var indented by one \t 
    + VAR name4 = the next line with "name5" is not valid. missing the = character, should not be matched 
    + VAR name5 
    + CONST name6 = the type could be VAR or CONST 

    + VAR multi1 = multiline value where the continuation lines 
    are indented (starts with two spaces or one tab) and NOT followed by the '+' 

    + VAR multi1 = multiline value 

    + VAR multi1 = multiline value 
    indented ok too 

    + VAR single = this is single line 
    + because this line even if it is indented, the first nonspace character is '+' 

    + VAR multi2 = multiline 
    could be 
     any way 
    and any number of times 
    until the first non-indented line 

the following should NOT match 

+ VAR some = sould not be matched, because the line isn't indented 
+ VAR some = sould not be matched, because the line isn't indented at least with TWO spaces or one tab 
    + SOME name = value not matched because the SOME isn't VAR or CONST 

while($text =~/
      (?m)   # multiline match 
      ^    # at the start of the line 
      ([ ]{2}|\t)+ # two spaces or tab - at least once 
      \+    # the '+' character 
      \s*    # followed by any number of spaces (e.g. "+VAR" or "+ VAR" are valid) 
      (VAR|CONST)  # the VAR or CONST 
      \s+    # followed at least one space (e.g. the "VAR_" should not matched) 
      (\w+)   # the keyword 
      \s*=\s*   # the '=' surrounded (and consumed) by any number of spaces 
      (    # capture the whole value (as it is) 
        .*      # any string up to end of line 
        (?:      # followed by (non-capturing group) 
          \R    # one line-break 
          ^    # at the start of the line 
          (?>[ ]{2,}|\t+) # atomic group - at least two spaces or at least one tab 
          [^+]   # followed by any character but '+' 
          .*    # any string up the end of line 
        )*    # any number of times (e.g. optionally) 
    /xg) { 
      push @match, { indent => $1, type => $2, key => $3, val => $4 }; 



(?:    # Start of non-capturing group (a) 
    \R    # One line-break 
    ^   # Start of line 
    (?> +|\t+) # At least two spaces or one tab character (possessively) 
    [^+\s]   # Not followed by `+` or a newline character 
    .*    # Up to end of line 
)*     # Repeat it as much as possible - end of non-capturing group (a) 


(?m)^(?: +|\t+)\+ *(?:VAR|CONST) *\w+ *=.*(?:\R^(?> +|\t+)[^+\s].*)* 

重要な部分は、最後のクラスタでありますあなたが文字クラス[ ]にそれを囲み、表示されるべき時間を表現するために量指定子[ ]{2,}を使用しない限り、x修飾子が設定されている間は正規表現の意味のある部分として扱われます。

     [ ]{2,} 
    [ ]* 
    [ ]*\w+[ ]*=.* 
      [ ]{2,} 

