私は生理学的データを扱っており、解析に必要なすべての生データを含む非常に大きな形式のXMLファイルがあります。我々はこれまでautohotkeyでregexスタイルのメソッドを使用してデータを抽出していましたが、パフォーマンスは悪いです。私はこれらのファイルから効率的にデータを解析または抽出する方法を探していた。 弱い構造化されたXMLファイルからのデータ抽出
XMLファイル構造が無効であるか、適切なsedコマンドをビルドして必要な方法でデータをまとめないと問題が発生しています。それはデータ構造全体に最小限のばらつきがあるので、ファイルからデータを効果的に引き出すために正規表現を使うことができると思います。私はこれまでgrep/sedやさまざまなXML解析ツールを使用しようとしてきました。ここで
Input: XML (poorly structured, 500-1500 mb)
Desired Output: CSV
VitalSigns:
1 VitalSigns, Date, Time, Par, Value
2 VitalSigns, Date, Time, Par, Value
...
end
Waveforms:
1 Waveforms, Date, Time, Waveform Channel, value 1, value 2, ... , value 480
2 Waveforms, Date, Time, Waveform Channel, value 1, value 2, ... , value 480
...
end
は、XMLファイルのサンプルです:ここで Dropbox link for sample file
<VitalSigns Time="01/27/2017 10:33:04 PM">
<VS>
<Par>PVC</Par>
<Value UOM="Bpm">0</Value>
<AlarmLimitHigh Label="PVC HI">6</AlarmLimitHigh>
</VS>
<VS>
<Par>RESP</Par>
<Value UOM="BrMin">68</Value>
<AlarmLimitLow Label="RESP LO">12</AlarmLimitLow>
<AlarmLimitHigh Label="RESP HI">60</AlarmLimitHigh>
</VS>
<VS>
<Par>SPO2-%</Par>
<Value UOM="%" Q="2">100</Value>
<AlarmLimitLow Label="SPO2-% LO">93</AlarmLimitLow>
<AlarmLimitHigh Label="SPO2-% HI">101</AlarmLimitHigh>
</VS>
</VitalSigns>
<Waveforms Time="01/27/2017 10:33:04 PM">
<WaveformData Channel="I" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
<WaveformData Channel="II" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
<WaveformData Channel="SPO2" UOM="Uncalib">370,741,741,1112,1117,1122,1122,1128,1128,1128,1128,1128,1138,1149,1149,1160,1160,1160,1160,1160,1176,1192,1192,1208,1208,1208,1208,1208,1216,1224,1224,1232,1232,1232,1232,1232,1242,1253,1253,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1277,1290,1290,1304,1304,1304,1304,1304,1314,1325,1325,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1338,1341,1341,1344,1344,1344,1344,1344,1338,1333,1333,1328,1328,1328,1328,1328,1325,1322,1322,1320,1320,1320,1320,1320,1330,1341,1341,1352,1352,1352,1352,1352,1373,1394,1394,1416,1416,1416,1416,1416,1434,1453,1453,1472,1472,1472,1472,1472,1485,1498,1498,1512,1512,1512,1512,1512,1520,1528,1528,1536,1536,1536,1536,1536,1533,1530,1530,1528,1528,1528,1528,1528,1514,1501,1501,1488,1488,1488,1488,1488,1464,1440,1440,1416,1416,1416,1416,1416,1397,1378,1378,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1306,1309,1309,1312,1312,1312,1312,1312,1336,1360,1360,1384,1384,1384,1384,1384,1413,1442,1442,1472,1472,1472,1472,1472,1477,1482,1482,1488,1488,1488,1488,1488,1480,1472,1472,1464,1464,1464,1464,1464,1448,1432,1432,1416,1416,1416,1416,1416,1389,1362,1362,1336,1336,1336,1336,1336,1322,1309,1309,1296,1296,1296,1296,1296,1290,1285,1285,1280,1280,1280,1280,1280,1285,1290,1290,1296,1296,1296,1296,1296,1314,1333,1333,1352,1352,1352,1352,1352,1370,1389,1389,1408,1408,1408,1408,1408,1429,1450,1450,1472,1474,1477,1477,1480,1477,1474,1474,1472,1472,1472,1472,1472,1458,1445,1445,1432,1432,1432,1432,1432,1408,1384,1384,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1282,1261,1261,1240,1240,1240,1240,1240,1218,1197,1197,1176,1176,1176,1176,1176,1168,1160,1160,1152,1152,1152,1152,1152,1154,1157,1157,1160,1160,1160,1160,1160,1184,1208,1208,1232,1232,1232,1232,1232,1253,1274,1274,1296,1296,1296,1296,1296,1298,1301,1301,1304,1304,1304,1304,1304,1293,1282,1282,1272,1272,1272,1272,1272,1256,1240,1240,1224,1224,1224,1224,1224,1194,1165,1165,1136,1136,1136,1136,1136,1117,1098,1098,1080,1080,1080,1080,1080,1056,1032,1032,1008,1008,1008,1008,1008,994,981,981,968,968,968,968,968,960,952,952,944,944,944,944,944,960,976,976,992,992,992,992,992,1024,1056,1056,1088,1088,1088,1088,1088,1106,1125,1125,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1128,1112,1112,1096,1096,1096,1096,1096,1053,1010,1010,968,968,968,968,968,928,888,888,848</WaveformData>
</Waveforms>
<VitalSigns Time="01/27/2017 10:33:06 PM">
<VS>
<Par>PVC</Par>
<Value UOM="Bpm">0</Value>
<AlarmLimitHigh Label="PVC HI">6</AlarmLimitHigh>
</VS>
<VS>
<Par>RESP</Par>
<Value UOM="BrMin">55</Value>
<AlarmLimitLow Label="RESP LO">12</AlarmLimitLow>
<AlarmLimitHigh Label="RESP HI">60</AlarmLimitHigh>
</VS>
<VS>
<Par>SPO2-%</Par>
<Value UOM="%" Q="2">99</Value>
<AlarmLimitLow Label="SPO2-% LO">93</AlarmLimitLow>
<AlarmLimitHigh Label="SPO2-% HI">101</AlarmLimitHigh>
</VS>
</VitalSigns>
<Waveforms Time="01/27/2017 10:33:06 PM">
<WaveformData Channel="I" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
<WaveformData Channel="II" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
<WaveformData Channel="SPO2" UOM="Uncalib">370,741,741,1112,1117,1122,1122,1128,1128,1128,1128,1128,1138,1149,1149,1160,1160,1160,1160,1160,1176,1192,1192,1208,1208,1208,1208,1208,1216,1224,1224,1232,1232,1232,1232,1232,1242,1253,1253,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1277,1290,1290,1304,1304,1304,1304,1304,1314,1325,1325,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1338,1341,1341,1344,1344,1344,1344,1344,1338,1333,1333,1328,1328,1328,1328,1328,1325,1322,1322,1320,1320,1320,1320,1320,1330,1341,1341,1352,1352,1352,1352,1352,1373,1394,1394,1416,1416,1416,1416,1416,1434,1453,1453,1472,1472,1472,1472,1472,1485,1498,1498,1512,1512,1512,1512,1512,1520,1528,1528,1536,1536,1536,1536,1536,1533,1530,1530,1528,1528,1528,1528,1528,1514,1501,1501,1488,1488,1488,1488,1488,1464,1440,1440,1416,1416,1416,1416,1416,1397,1378,1378,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1306,1309,1309,1312,1312,1312,1312,1312,1336,1360,1360,1384,1384,1384,1384,1384,1413,1442,1442,1472,1472,1472,1472,1472,1477,1482,1482,1488,1488,1488,1488,1488,1480,1472,1472,1464,1464,1464,1464,1464,1448,1432,1432,1416,1416,1416,1416,1416,1389,1362,1362,1336,1336,1336,1336,1336,1322,1309,1309,1296,1296,1296,1296,1296,1290,1285,1285,1280,1280,1280,1280,1280,1285,1290,1290,1296,1296,1296,1296,1296,1314,1333,1333,1352,1352,1352,1352,1352,1370,1389,1389,1408,1408,1408,1408,1408,1429,1450,1450,1472,1474,1477,1477,1480,1477,1474,1474,1472,1472,1472,1472,1472,1458,1445,1445,1432,1432,1432,1432,1432,1408,1384,1384,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1282,1261,1261,1240,1240,1240,1240,1240,1218,1197,1197,1176,1176,1176,1176,1176,1168,1160,1160,1152,1152,1152,1152,1152,1154,1157,1157,1160,1160,1160,1160,1160,1184,1208,1208,1232,1232,1232,1232,1232,1253,1274,1274,1296,1296,1296,1296,1296,1298,1301,1301,1304,1304,1304,1304,1304,1293,1282,1282,1272,1272,1272,1272,1272,1256,1240,1240,1224,1224,1224,1224,1224,1194,1165,1165,1136,1136,1136,1136,1136,1117,1098,1098,1080,1080,1080,1080,1080,1056,1032,1032,1008,1008,1008,1008,1008,994,981,981,968,968,968,968,968,960,952,952,944,944,944,944,944,960,976,976,992,992,992,992,992,1024,1056,1056,1088,1088,1088,1088,1088,1106,1125,1125,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1128,1112,1112,1096,1096,1096,1096,1096,1053,1010,1010,968,968,968,968,968,928,888,888,848</WaveformData>
</Waveforms>
は理想的な出力です:
VitalSigns, 01/27/2017, 10:33:04 PM, PVC, 0
VitalSigns, 01/27/2017, 10:33:04 PM, RESP, 68
VitalSigns, 01/27/2017, 10:33:04 PM, SPO2, 100
Waveforms, 01/27/2017, 10:33:04 PM, I, 347, 342, ..., end
Waveforms, 01/27/2017, 10:33:04 PM, II, 347, 342, ..., end
Waveforms, 01/27/2017, 10:33:04 PM, SPO2, 370, 741, ..., end
VitalSigns, 01/27/2017, 10:33:06 PM, PVC, 0
VitalSigns, 01/27/2017, 10:33:06 PM, RESP, 55
VitalSigns, 01/27/2017, 10:33:06 PM, SPO2, 99
...
end of file
xmlの解析やxsltの使用に特別なツールを使用してみませんか? –
@JoseRicardoBustosM。さて、この質問は、ファイルが有効なXMLではないことを示しています。そうすれば、ほとんどのツールがチョークします。 –
欠落しているルート要素を除いて、サンプルはかなりよく見えます。最初にXMLの問題を修正してから、 'xmlstarlet'や' xsltproc'などの適切なツールを使用することをお勧めします。 –