2016-09-12 4 views
-1

私はxmlファイルを持っています。次のようにetreeとしてlxmlでそれを解析した後、私はそのすべてのタグを取得することができます:相対XPathを持つ要素を取得するにはどうすればよいですか?

root = tree.getroot() 
for e in root.iter(): 
    print e.tag 

を、出力は次のようなものです:

'{http://www.w3.org/1999/xhtml}html' 
'{http://www.w3.org/1999/xhtml}head' 
'{http://www.w3.org/1999/xhtml}meta' 
'{http://www.w3.org/1999/xhtml}link' 
'{http://www.w3.org/1999/xhtml}meta' 
'{http://www.w3.org/1999/xhtml}meta' 
'{http://www.w3.org/1999/xhtml}meta' 
'{http://www.w3.org/1999/xhtml}script' 
'{http://www.w3.org/1999/xhtml}body' 
'{http://www.w3.org/1999/xhtml}section' 
'{http://www.w3.org/1999/xhtml}h1' 
'{http://www.w3.org/1999/xhtml}p' 
'{http://www.w3.org/1999/xhtml}em' 
'{http://www.w3.org/1999/xhtml}section' 
'{http://www.w3.org/1999/xhtml}h1' 
'{http://www.w3.org/1999/xhtml}p' 
'{http://www.w3.org/1999/xhtml}a' 
'{http://www.w3.org/1999/xhtml}p' 
'{http://www.w3.org/1999/xhtml}p' 

私は相対的で、いくつかの要素を取得したいですpython/lxml/bs4を使用してパスを作成します。たとえば、最初にp要素を第2のsectionに入れ、次の相対パスを指定します。/section[2]/p[1]

しかし、私はさえ返し、次のコードを持つすべてのセクションを取得することはできませんNone

xhtml = {http://www.w3.org/1999/xhtml} 
section = xhtml + "section" 
root.find(section) 

編集:私は<p>を取得したい。この例では

<?xml version="1.0" encoding="UTF-8"?> 
<?xml-model href="grammar/rash.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?> 
<html xmlns="http://www.w3.org/1999/xhtml" prefix="schema: http://schema.org/ prism: http://prismstandard.org/namespaces/basic/2.0/"> 
    <head> 
     <meta charset="UTF-8"/> 
     <meta name="viewport" content="width=device-width, initial-scale=1"/> 
     <link rel="stylesheet" href="css/bootstrap.min.css"/> 
     <link rel="stylesheet" href="css/rash.css"/> 
     <script src="js/jquery.min.js"><![CDATA[ ]]></script> 
     <script src="js/bootstrap.min.js"><![CDATA[ ]]></script> 
     <script src="js/rash.js"><![CDATA[ ]]></script> 
     <title>It ROCS! -- The RASH Online Conversion Service</title> 
     <meta about="#affiliation-1" property="schema:name" content="Department of Computer Science and Engineering, University of Bologna, Italy"/> 
     <meta about="#affiliation-2" property="schema:name" content="Oxford e-Research Centre, University of Oxford, UK"/> 
     <meta about="#affiliation-3" property="schema:name" content="Knowledge Media Institute, Open University, UK"/> 
     <meta property="prism:keyword" content="HTML-based format"/> 
     <meta property="prism:keyword" content="Scholarly HTML"/> 
     <meta property="prism:keyword" content="RASH"/> 
    <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"><![CDATA[ ]]></script></head> 
    <body> 
     <section role="doc-abstract"> 
     <h1>Abstract</h1> 
     <p>In this poster paper we introduce the <em>RASH Online Conversion Service</em>, i.e., a Web application that allows the conversion of ODT documents into RASH, a HTML-based markup language for writing scholarly articles, and from RASH into LaTeX. This tool allows authors with no experience in HTML to easily produce HTML-based papers and supports the publishing process by generating also a LaTeX version according to the Springer LNCS and ACM ICPS layouts.</p> 
     </section> 
     <section> 
     <h1>Introduction</h1> 
     <p>The use of HTML as format for writing scholarly papers and submitting them to scholarly venues is a very popular, discussed and trendy topic within the scholarly domain. This is demonstrated by the existence of several posts within technical mailing lists of the Web community<a href="#ftn0"> </a>, by the birth of W3C community groups on such topic<a href="#ftn3"> </a>, by the development of HTML-based formats for scholarly articles<a href="#ftn4"> </a>, and by the increasing number of events that are experimenting with HTML-based formats for submissions, such as the SAVE-SD<a href="#ftn5"> </a> and LDOW<a href="#ftn6"> </a> workshops at WWW 2016, and the Extended Semantic Web Conference<a href="#ftn7"> </a>.</p> 
     <p>In order to foster a wider adoption of these formats, frameworks for HTML-based papers should support the needs of all the actors involved in the production, delivery and fruition of scholarly articles, with particular regards to authors and publishers. Hence, this solution calls for a number of requirements that go well beyond those used on the Web. </p> 
     <p>First of all, it is vital to support authors with a variety of tools to provide for an easy transition to the new format. To this end, authors should be allowed to keep using well-known current word processors rather than adopting HTML and/or pure text editors. We thus need to support the conversion from the main word processor formats (e.g., ODT and OOXML) to HTML formats, in particular when authors use only basic features, such as standard styles for paragraphs and tables. In addition, authors should be given the option to focus on the content and let appropriate tools handle the presentation layer after the conversion into the HTML-based format.</p> 

:ここでは、元のファイルの一部ですこの文で始まる要素:「学術論文の書式としてのHTMLの使用...」

+1

を入力文書を表示しても、正確な予想される出力を示してください。ありがとう。 –

+0

@MathiasMüller入力文書を追加しました。 – sheshkovsky

+0

xmlの使用可能なスニペットを追加します。 –

答えて

0

BeautifulSoupはXPath式buをサポートしていませんt lxmlあなたは言及しました。

次のようなXPathの持つ要素を検索することができます。

from lxml import etree 

htmlparser = etree.HTMLParser() 
tree = etree.parse(html_content, htmlparser) 
tree.xpath(xpathselector) 
関連する問題