2017-02-09 8 views
-1

私は生理学的データを扱っており、解析に必要なすべての生データを含む非常に大きな形式のXMLファイルがあります。我々はこれまでautohotkeyでregexスタイルのメソッドを使用してデータを抽出していましたが、パフォーマンスは悪いです。私はこれらのファイルから効率的にデータを解析または抽出する方法を探していた。 弱い構造化されたXMLファイルからのデータ抽出

XMLファイル構造が無効であるか、適切なsedコマンドをビルドして必要な方法でデータをまとめないと問題が発生しています。それはデータ構造全体に最小限のばらつきがあるので、ファイルからデータを効果的に引き出すために正規表現を使うことができると思います。私はこれまでgrep/sedやさまざまなXML解析ツールを使用しようとしてきました。ここで

Input: XML (poorly structured, 500-1500 mb) 

Desired Output: CSV 

VitalSigns: 

1 VitalSigns, Date, Time, Par, Value 
2 VitalSigns, Date, Time, Par, Value 
... 
end 

Waveforms: 

1 Waveforms, Date, Time, Waveform Channel, value 1, value 2, ... , value 480 
2 Waveforms, Date, Time, Waveform Channel, value 1, value 2, ... , value 480 
... 
end 

は、XMLファイルのサンプルです:ここで Dropbox link for sample file

 <VitalSigns Time="01/27/2017 10:33:04 PM"> 
      <VS> 
        <Par>PVC</Par> 
        <Value UOM="Bpm">0</Value> 
        <AlarmLimitHigh Label="PVC HI">6</AlarmLimitHigh> 
      </VS> 
      <VS> 
        <Par>RESP</Par> 
        <Value UOM="BrMin">68</Value> 
        <AlarmLimitLow Label="RESP LO">12</AlarmLimitLow> 
        <AlarmLimitHigh Label="RESP HI">60</AlarmLimitHigh> 
      </VS> 
      <VS> 
        <Par>SPO2-%</Par> 
        <Value UOM="%" Q="2">100</Value> 
        <AlarmLimitLow Label="SPO2-% LO">93</AlarmLimitLow> 
        <AlarmLimitHigh Label="SPO2-% HI">101</AlarmLimitHigh> 
      </VS> 
     </VitalSigns> 
     <Waveforms Time="01/27/2017 10:33:04 PM"> 
      <WaveformData Channel="I" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData> 
      <WaveformData Channel="II" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData> 
      <WaveformData Channel="SPO2" UOM="Uncalib">370,741,741,1112,1117,1122,1122,1128,1128,1128,1128,1128,1138,1149,1149,1160,1160,1160,1160,1160,1176,1192,1192,1208,1208,1208,1208,1208,1216,1224,1224,1232,1232,1232,1232,1232,1242,1253,1253,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1277,1290,1290,1304,1304,1304,1304,1304,1314,1325,1325,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1338,1341,1341,1344,1344,1344,1344,1344,1338,1333,1333,1328,1328,1328,1328,1328,1325,1322,1322,1320,1320,1320,1320,1320,1330,1341,1341,1352,1352,1352,1352,1352,1373,1394,1394,1416,1416,1416,1416,1416,1434,1453,1453,1472,1472,1472,1472,1472,1485,1498,1498,1512,1512,1512,1512,1512,1520,1528,1528,1536,1536,1536,1536,1536,1533,1530,1530,1528,1528,1528,1528,1528,1514,1501,1501,1488,1488,1488,1488,1488,1464,1440,1440,1416,1416,1416,1416,1416,1397,1378,1378,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1306,1309,1309,1312,1312,1312,1312,1312,1336,1360,1360,1384,1384,1384,1384,1384,1413,1442,1442,1472,1472,1472,1472,1472,1477,1482,1482,1488,1488,1488,1488,1488,1480,1472,1472,1464,1464,1464,1464,1464,1448,1432,1432,1416,1416,1416,1416,1416,1389,1362,1362,1336,1336,1336,1336,1336,1322,1309,1309,1296,1296,1296,1296,1296,1290,1285,1285,1280,1280,1280,1280,1280,1285,1290,1290,1296,1296,1296,1296,1296,1314,1333,1333,1352,1352,1352,1352,1352,1370,1389,1389,1408,1408,1408,1408,1408,1429,1450,1450,1472,1474,1477,1477,1480,1477,1474,1474,1472,1472,1472,1472,1472,1458,1445,1445,1432,1432,1432,1432,1432,1408,1384,1384,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1282,1261,1261,1240,1240,1240,1240,1240,1218,1197,1197,1176,1176,1176,1176,1176,1168,1160,1160,1152,1152,1152,1152,1152,1154,1157,1157,1160,1160,1160,1160,1160,1184,1208,1208,1232,1232,1232,1232,1232,1253,1274,1274,1296,1296,1296,1296,1296,1298,1301,1301,1304,1304,1304,1304,1304,1293,1282,1282,1272,1272,1272,1272,1272,1256,1240,1240,1224,1224,1224,1224,1224,1194,1165,1165,1136,1136,1136,1136,1136,1117,1098,1098,1080,1080,1080,1080,1080,1056,1032,1032,1008,1008,1008,1008,1008,994,981,981,968,968,968,968,968,960,952,952,944,944,944,944,944,960,976,976,992,992,992,992,992,1024,1056,1056,1088,1088,1088,1088,1088,1106,1125,1125,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1128,1112,1112,1096,1096,1096,1096,1096,1053,1010,1010,968,968,968,968,968,928,888,888,848</WaveformData> 
     </Waveforms> 
    <VitalSigns Time="01/27/2017 10:33:06 PM"> 
     <VS> 
       <Par>PVC</Par> 
       <Value UOM="Bpm">0</Value> 
       <AlarmLimitHigh Label="PVC HI">6</AlarmLimitHigh> 
     </VS> 
     <VS> 
       <Par>RESP</Par> 
       <Value UOM="BrMin">55</Value> 
       <AlarmLimitLow Label="RESP LO">12</AlarmLimitLow> 
       <AlarmLimitHigh Label="RESP HI">60</AlarmLimitHigh> 
     </VS> 
     <VS> 
       <Par>SPO2-%</Par> 
       <Value UOM="%" Q="2">99</Value> 
       <AlarmLimitLow Label="SPO2-% LO">93</AlarmLimitLow> 
       <AlarmLimitHigh Label="SPO2-% HI">101</AlarmLimitHigh> 
     </VS> 
    </VitalSigns> 
<Waveforms Time="01/27/2017 10:33:06 PM"> 
      <WaveformData Channel="I" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData> 
      <WaveformData Channel="II" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData> 
      <WaveformData Channel="SPO2" UOM="Uncalib">370,741,741,1112,1117,1122,1122,1128,1128,1128,1128,1128,1138,1149,1149,1160,1160,1160,1160,1160,1176,1192,1192,1208,1208,1208,1208,1208,1216,1224,1224,1232,1232,1232,1232,1232,1242,1253,1253,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1277,1290,1290,1304,1304,1304,1304,1304,1314,1325,1325,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1338,1341,1341,1344,1344,1344,1344,1344,1338,1333,1333,1328,1328,1328,1328,1328,1325,1322,1322,1320,1320,1320,1320,1320,1330,1341,1341,1352,1352,1352,1352,1352,1373,1394,1394,1416,1416,1416,1416,1416,1434,1453,1453,1472,1472,1472,1472,1472,1485,1498,1498,1512,1512,1512,1512,1512,1520,1528,1528,1536,1536,1536,1536,1536,1533,1530,1530,1528,1528,1528,1528,1528,1514,1501,1501,1488,1488,1488,1488,1488,1464,1440,1440,1416,1416,1416,1416,1416,1397,1378,1378,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1306,1309,1309,1312,1312,1312,1312,1312,1336,1360,1360,1384,1384,1384,1384,1384,1413,1442,1442,1472,1472,1472,1472,1472,1477,1482,1482,1488,1488,1488,1488,1488,1480,1472,1472,1464,1464,1464,1464,1464,1448,1432,1432,1416,1416,1416,1416,1416,1389,1362,1362,1336,1336,1336,1336,1336,1322,1309,1309,1296,1296,1296,1296,1296,1290,1285,1285,1280,1280,1280,1280,1280,1285,1290,1290,1296,1296,1296,1296,1296,1314,1333,1333,1352,1352,1352,1352,1352,1370,1389,1389,1408,1408,1408,1408,1408,1429,1450,1450,1472,1474,1477,1477,1480,1477,1474,1474,1472,1472,1472,1472,1472,1458,1445,1445,1432,1432,1432,1432,1432,1408,1384,1384,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1282,1261,1261,1240,1240,1240,1240,1240,1218,1197,1197,1176,1176,1176,1176,1176,1168,1160,1160,1152,1152,1152,1152,1152,1154,1157,1157,1160,1160,1160,1160,1160,1184,1208,1208,1232,1232,1232,1232,1232,1253,1274,1274,1296,1296,1296,1296,1296,1298,1301,1301,1304,1304,1304,1304,1304,1293,1282,1282,1272,1272,1272,1272,1272,1256,1240,1240,1224,1224,1224,1224,1224,1194,1165,1165,1136,1136,1136,1136,1136,1117,1098,1098,1080,1080,1080,1080,1080,1056,1032,1032,1008,1008,1008,1008,1008,994,981,981,968,968,968,968,968,960,952,952,944,944,944,944,944,960,976,976,992,992,992,992,992,1024,1056,1056,1088,1088,1088,1088,1088,1106,1125,1125,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1128,1112,1112,1096,1096,1096,1096,1096,1053,1010,1010,968,968,968,968,968,928,888,888,848</WaveformData> 
     </Waveforms> 

は理想的な出力です:

VitalSigns, 01/27/2017, 10:33:04 PM, PVC, 0  
VitalSigns, 01/27/2017, 10:33:04 PM, RESP, 68  
VitalSigns, 01/27/2017, 10:33:04 PM, SPO2, 100  
Waveforms, 01/27/2017, 10:33:04 PM, I, 347, 342, ..., end  
Waveforms, 01/27/2017, 10:33:04 PM, II, 347, 342, ..., end  
Waveforms, 01/27/2017, 10:33:04 PM, SPO2, 370, 741, ..., end  
VitalSigns, 01/27/2017, 10:33:06 PM, PVC, 0  
VitalSigns, 01/27/2017, 10:33:06 PM, RESP, 55 
VitalSigns, 01/27/2017, 10:33:06 PM, SPO2, 99 
... 
end of file 
+0

xmlの解析やxsltの使用に特別なツールを使用してみませんか? –

+1

@JoseRicardoBustosM。さて、この質問は、ファイルが有効なXMLではないことを示しています。そうすれば、ほとんどのツールがチョークします。 –

+1

欠落しているルート要素を除いて、サンプルはかなりよく見えます。最初にXMLの問題を修正してから、 'xmlstarlet'や' xsltproc'などの適切なツールを使用することをお勧めします。 –

答えて

0

他のコメンターが指摘したように、XMLツールが優れていますこの作業に適しています。しかし、ここで何をしたいやっGNU sedスクリプトがあり、同様sedで行うことができますsed -rn -f script.sed yourfile

# concatenate VitalSigns or Waveforms, the date, the time into the holdspace 
# on finding a Par, Value or WaveformData, further append the info to the holdspace 
/<VitalSigns|Waveforms/{ s/^[^<]*<([^ ]+) Time="([^ ]+) ([^"]+)".*$/\1, \2, \3, /; h; } 
/<Par>.*<\/Par>/   { s/^[^>]+>([^<]*).*$/\1, /; H; } 
/<Value.*<\/Value>/  { s/^[^>]+>([^<]*).*$/\1/; 
          H; 
          # now print Holdspace after removing newlines 
          x; 
          s/, \n/, /g; 
          p; 
          # now put only the first three columns back into Hold space 
          s/^(([^,]+,){3}).*$/\1/; 
          x; 
          } 

/<WaveformData/  { s/^.*WaveformData Channel="([^"]+)[^>]*>(.*)<\/WaveformData>.*$/\1, \2/; 
          H; 
          # now print Holdspace after removing newlines 
          x; 
          s/, \n/, /g; 
          p; 
          # now put only the first three columns back into Hold space 
          s/^(([^,]+,){3}).*$/\1/; 
          x; 
          } 

をscript.sedこのようにそれを使用してください。

ノート

  • 私は質問に投稿された例を使用していました。おそらくそれはxmlファイルの変更に対してあまり安定していません。
  • これは、秘密の構文が混乱し、ホールドとパターンスペースの間を行き来することが分かります。したがって、それはあまり拡張できません。
  • コメントにすでに提案されているように、XMLツールを使用する方が効果的です。
関連する問題