2017-03-04 14 views
1

私はXMLエキスパートではありません。私はrentrezを使用してXMLファイルを解析する際に問題が発生しています。私は出力として、各pmid(PubMedデータベースの記事ID)によって著者と所属を得ようとしています。私は、著者が複数の提携をしている場合を除いてうまくいくコードを持っています。著者が複数の提携をしている場合、列の長さfirst_names,last_names、およびaffiliationが異なるようになり、エラーが返されます。私は本当にそれを処理するためのXML解析の専門知識がありません。私は厳密には以下のような結果を期待しています:entrez_fetchによって返された私のサンプルXMLファイルのrentrezを使用してRでXMLファイルを解析する

pmid   first_names last_names    affiliation 
27869504  Luca   Villa   Division of Experimental Oncology/Unit of Urology, URI , IRCCS Ospedale San Raffaele, Milan, Italy 
27869504  Luca   Villa   Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France 
27869504  Tarik Emre  Şener   Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France 
27869504  Tarik Emre  Şener   Department of Urology, Marmara University School of Medicine, Istanbul, Turkey 

構造は以下の通りです:後

<?xml version="1.0"?> 
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2017//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd"> 
<PubmedArticleSet> 
    <PubmedArticle> 
    <MedlineCitation Status="In-Data-Review" Owner="NLM"> 
    <PMID Version="1">27869504</PMID> 
    <DateCreated> 
    <Year>2016</Year> 
    <Month>11</Month> 
     <Day>21</Day> 
    </DateCreated> 
    <DateRevised> 
    <Year>2017</Year> 
    <Month>01</Month> 
    <Day>06</Day> 
    </DateRevised> 
    <Article PubModel="Print-Electronic"> 
    <Journal> 
     <ISSN IssnType="Electronic">1557-900X</ISSN> 
     <JournalIssue CitedMedium="Internet"> 
     <Volume>31</Volume> 
     <Issue>1</Issue> 
     <PubDate> 
      <Year>2017</Year> 
      <Month>Jan</Month> 
     </PubDate> 
     </JournalIssue> 
     <Title>Journal of endourology</Title> 
     <ISOAbbreviation>J. Endourol.</ISOAbbreviation> 
    </Journal> 
    <ArticleTitle>Initial Content Validation Results of a New Simulation Model for Flexible Ureteroscopy: The Key-Box.</ArticleTitle> 
    <Pagination> 
     <MedlinePgn>72-77</MedlinePgn> 
    </Pagination> 
    <ELocationID EIdType="doi" ValidYN="Y">10.1089/end.2016.0677</ELocationID> 
    <Abstract> 
     <AbstractText Label="PURPOSE" NlmCategory="OBJECTIVE">We sought to test the content validity of a new training model for flexible ureteroscopy: the Key-Box.</AbstractText> 
     <AbstractText Label="MATERIAL AND METHODS" NlmCategory="METHODS">Sixteen medical students were randomized to undergo a 10-day training consisting of performing 10 different exercises aimed at learning specific movements with the flexible ureteroscope, and how to catch and release stones with a nitinol basket using the Key-Box (n&#x2009;=&#x2009;8 students in the training group, n&#x2009;=&#x2009;8 students in the nontraining control group). Subsequently, an expert endourologist (O.T.) blindly assessed skills acquired by the whole cohort of students through two exercises on ureteroscope manipulation and one exercise on stone capture selected among those used for the training. A performance scale (1-5) assessing different steps of the procedure was used to evaluate each student. Time to complete the exercises was measured. Mann-Whitney Rank Sum test was used for comparisons between the two groups.</AbstractText> 
     <AbstractText Label="RESULTS" NlmCategory="RESULTS">Mean scores obtained by trained students were significantly higher compared with those obtained by nontrained students (all p&#x2009;&lt;&#x2009;0.001). All trained students were able to complete the two exercises on ureteroscope manipulation within 3 minutes, whereas two students (25%) were not able to finish the exercise on stone capture. Conversely, four (50%) and six (75%) nontrained students were not able to finish one out of the two exercises on ureteroscope manipulation and the exercise on stone capture, respectively. The mean time to complete the three exercises was 76.3, 69.9, and 107 and 172.5, 137.9, and 168 seconds in the trained and nontrained groups, respectively (all p&#x2009;&lt;&#x2009;0.001).</AbstractText> 
     <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">The K-Box(&#xAE;) seems to be a valid easy-to-use training model for initiating novel endoscopists to flexible ureteroscopy.</AbstractText> 
    </Abstract> 
    <AuthorList CompleteYN="Y"> 
     <Author ValidYN="Y"> 
     <LastName>Villa</LastName> 
     <ForeName>Luca</ForeName> 
     <Initials>L</Initials> 
     <AffiliationInfo> 
      <Affiliation>1 Division of Experimental Oncology/Unit of Urology, URI , IRCCS Ospedale San Raffaele, Milan, Italy .</Affiliation> 
     </AffiliationInfo> 
     <AffiliationInfo> 
      <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation> 
     </AffiliationInfo> 
     </Author> 
     <Author ValidYN="Y"> 
     <LastName>&#x15E;ener</LastName> 
     <ForeName>Tarik Emre</ForeName> 
     <Initials>TE</Initials> 
     <AffiliationInfo> 
      <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation> 
     </AffiliationInfo> 
     <AffiliationInfo> 
      <Affiliation>3 Department of Urology, Marmara University School of Medicine , Istanbul, Turkey .</Affiliation> 
     </AffiliationInfo> 
     </Author> 
     <Author ValidYN="Y"> 
     <LastName>Somani</LastName> 
     <ForeName>Bhaskar K</ForeName> 
     <Initials>BK</Initials> 
     <AffiliationInfo> 
      <Affiliation>4 Department of Urology, University Hospital Southampton NHS Trust , Southampton, United Kingdom .</Affiliation> 
     </AffiliationInfo> 
     </Author> 
     <Author ValidYN="Y"> 
     <LastName>Cloutier</LastName> 
     <ForeName>Jonathan</ForeName> 
     <Initials>J</Initials> 
     <AffiliationInfo> 
      <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation> 
     </AffiliationInfo> 
     <AffiliationInfo> 
      <Affiliation>5 Department of Urology, University Hospital Centre of Quebec City , Quebec, Canada .</Affiliation> 
     </AffiliationInfo> 
     </Author> 
     <Author ValidYN="Y"> 
     <LastName>Buttic&#xE8;</LastName> 
     <ForeName>Salvatore</ForeName> 
     <Initials>S</Initials> 
     <AffiliationInfo> 
      <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation> 
     </AffiliationInfo> 
     <AffiliationInfo> 
      <Affiliation>6 Department of Urology, University of Messina , Messina, Italy .</Affiliation> 
     </AffiliationInfo> 
     </Author> 
     <Author ValidYN="Y"> 
     <LastName>Marson</LastName> 
     <ForeName>Francesco</ForeName> 
     <Initials>F</Initials> 
     <AffiliationInfo> 
      <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation> 
     </AffiliationInfo> 
     <AffiliationInfo> 
      <Affiliation>7 Department of Urology, Citt&#xE0; della Salute e della Scienza, Turin, Italy .</Affiliation> 
     </AffiliationInfo> 
     </Author> 
     <Author ValidYN="Y"> 
     <LastName>Doizi</LastName> 
     <ForeName>Steeve</ForeName> 
     <Initials>S</Initials> 
     <AffiliationInfo> 
      <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation> 
     </AffiliationInfo> 
     </Author> 
     <Author ValidYN="Y"> 
     <LastName>Proietti</LastName> 
     <ForeName>Silvia</ForeName> 
     <Initials>S</Initials> 
     <AffiliationInfo> 
      <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation> 
     </AffiliationInfo> 
     <AffiliationInfo> 
      <Affiliation>8 Department of Urology, IRCCS San Raffaele Scientific Institute , Ville Turro Division, Milan, Italy .</Affiliation> 
     </AffiliationInfo> 
     </Author> 
     <Author ValidYN="Y"> 
     <LastName>Traxer</LastName> 
     <ForeName>Olivier</ForeName> 
     <Initials>O</Initials> 
     <AffiliationInfo> 
      <Affiliation>2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France .</Affiliation> 
     </AffiliationInfo> 
     </Author> 
    </AuthorList> 
    <Language>eng</Language> 
    <PublicationTypeList> 
     <PublicationType UI="D016428">Journal Article</PublicationType> 
    </PublicationTypeList> 
    <ArticleDate DateType="Electronic"> 
     <Year>2016</Year> 
     <Month>12</Month> 
     <Day>16</Day> 
    </ArticleDate> 
    </Article> 
    <MedlineJournalInfo> 
    <Country>United States</Country> 
    <MedlineTA>J Endourol</MedlineTA> 
    <NlmUniqueID>8807503</NlmUniqueID> 
    <ISSNLinking>0892-7790</ISSNLinking> 
    </MedlineJournalInfo> 
    <KeywordList Owner="NOTNLM"> 
    <Keyword MajorTopicYN="N">flexible ureteroscopy</Keyword> 
    <Keyword MajorTopicYN="N">learning curve</Keyword> 
    <Keyword MajorTopicYN="N">training model</Keyword> 
    <Keyword MajorTopicYN="N">ureteroscopy curriculum</Keyword> 
    </KeywordList> 
</MedlineCitation> 
<PubmedData> 
    <History> 
    <PubMedPubDate PubStatus="pubmed"> 
     <Year>2016</Year> 
     <Month>11</Month> 
     <Day>22</Day> 
     <Hour>6</Hour> 
     <Minute>0</Minute> 
    </PubMedPubDate> 
    <PubMedPubDate PubStatus="medline"> 
     <Year>2016</Year> 
     <Month>11</Month> 
     <Day>22</Day> 
     <Hour>6</Hour> 
     <Minute>0</Minute> 
    </PubMedPubDate> 
    <PubMedPubDate PubStatus="entrez"> 
     <Year>2016</Year> 
     <Month>11</Month> 
     <Day>22</Day> 
     <Hour>6</Hour> 
     <Minute>0</Minute> 
    </PubMedPubDate> 
    </History> 
    <PublicationStatus>ppublish</PublicationStatus> 
    <ArticleIdList> 
    <ArticleId IdType="pubmed">27869504</ArticleId> 
    <ArticleId IdType="doi">10.1089/end.2016.0677</ArticleId> 
    </ArticleIdList> 
</PubmedData> 
</PubmedArticle> 
</PubmedArticleSet> 

があるとき以外はうまく機能し、私が使用していたコードですPubMedデータベースの記事の著者のための複数の提携:

library(rentrez) 
library(XML) 

pubmedSearch <- entrez_search("pubmed", term = "flexible ureteroscope Simulation Model", 
          retmax = 10) 
SearchResults <- entrez_fetch(db="pubmed", pubmedSearch$ids, rettype="xml", 
          parsed=TRUE) 

xmlGetValue <- function(x, node){ 
    a <- xpathSApply(x, node, xmlValue) 
    if(length(a) == 0) {a <- NA} else {a} 
} 

parse_paper <- function(paper){ 
    pmid <- xmlGetValue(paper, ".//ArticleId[@IdType='pubmed']") 
    first_names <- xmlGetValue(paper, ".//Author/ForeName") 
    last_names <- xmlGetValue(paper, ".//Author/LastName") 
    affiliation <- xmlGetValue(paper, ".//AffiliationInfo/Affiliation") 
    data.frame(pmid=pmid, first_names=first_names, last_names=last_names, 
     affiliation=affiliation) 
} 

parse_multiple_papers <- function(papers){ 
    res <- xpathApply(papers, "/PubmedArticleSet/*", parse_paper) 
    do.call(rbind.data.frame, res) 
} 

test_df <- parse_multiple_papers(SearchResults) 

本当にありがとうございます。次のようにあなたはあなたを与えるxml2purrr

require(xml2) 
require(purrr) 

doc <- read_xml(doc) 
scope <- doc %>% xml_find_all("//author") 
scope %>% map_df(~data.frame(
    first_names = xml_find_first(.x, "./forename") %>% xml_text, 
    last_names = xml_find_first(.x, "./lastname") %>% xml_text, 
    affiliation = xml_find_all(.x, ".//affiliation") %>% xml_text, 
    stringsAsFactors = FALSE 
)) 

を使用してこれを行うことができます

答えて

2

この質問も思い付きました。私もここにそのコードを入れます

parse_author <- function(author){ 
    fn <- xmlValue(author[["ForeName"]]) 
    ln <- xmlValue(author[["LastName"]]) 
    aff <-paste(xpathApply(author, "AffiliationInfo/Affiliation", xmlValue), collapse="; ") 
    list(forname=fn, lastname=ln, affiliation=aff) 
} 

parse_paper <- function(paper){ 
    author_info <- xpathApply(paper, ".//AuthorList/Author", parse_author) 
    res <- do.call(rbind.data.frame, author_info) 
    res$pmid <-xpathSApply(paper, ".//ArticleId[@IdType='pubmed']", xmlValue) 
    res 
} 

parse_multiple_papers <- function(papers){ 
res <- xpathApply(papers, "/PubmedArticleSet/*", parse_paper) 
do.call(rbind.data.frame, res) 
} 

head(parse_multiple_papers(SearchResults)) 
2

:一つの可能​​な解決策のissue @ rentrez's repository、詳細はそこに与えられているよう

first_names last_names                        affiliation 
1   Luca  Villa 1 Division of Experimental Oncology/Unit of Urology, URI , IRCCS Ospedale San Raffaele, Milan, Italy . 
2   Luca  Villa   2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France . 
3 Tarik Emre  Şener   2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France . 
4 Tarik Emre  Şener      3 Department of Urology, Marmara University School of Medicine , Istanbul, Turkey . 
5 Bhaskar K  Somani  4 Department of Urology, University Hospital Southampton NHS Trust , Southampton, United Kingdom . 
6  Jonathan Cloutier   2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France . 
7  Jonathan Cloutier     5 Department of Urology, University Hospital Centre of Quebec City , Quebec, Canada . 
8 Salvatore Butticè   2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France . 
9 Salvatore Butticè          6 Department of Urology, University of Messina , Messina, Italy . 
10 Francesco  Marson   2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France . 
11 Francesco  Marson        7 Department of Urology, Città della Salute e della Scienza, Turin, Italy . 
12  Steeve  Doizi   2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France . 
13  Silvia Proietti   2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France . 
14  Silvia Proietti 8 Department of Urology, IRCCS San Raffaele Scientific Institute , Ville Turro Division, Milan, Italy . 
15  Olivier  Traxer   2 Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France . 
+1

いい答えです。これは 'rentrez'に関する質問ですので、pubmedレコードで' xml2 :: read_xml'を使うには、 'parsed = FALSE'(デフォルト)を' entrez_fetch'を実行します。 –

関連する問題