pdfqueryで正規表現を使用することはできますか？

正規表現を使用してpdf内のテキストを検出できますか（pdfqueryや他のツールを使用して）？pdfqueryで正規表現を使用することはできますか？

私たちはこれを行うことができます知っている：

pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf") 
pdf.load() 
label = pdf.pq('LTTextLineHorizontal:contains("Cash")') 
left_corner = float(label.attr('x0')) 
bottom_corner = float(label.attr('y0')) 
cash = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % \ 
     (left_corner, bottom_corner-30, \ 
     left_corner+150, bottom_corner)).text() 
print cash 
'179,000.00'

しかし、我々はこのような何か必要があります。これはまさに正規表現のためのルックアップではありませんが、それは/フィルタフォーマットする働き

pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf") 
pdf.load() 
label = pdf.pq('LTTextLineHorizontal:regex("\d{1,3}(?:,\d{3})*(?:\.\d{2})?")') 
cash = str(label.attr('x0')) 
print cash 
'179,000.00'

出典

2015-10-13 Dayvid Oliveira

を可能性の抽出：

def regex_function(pattern, match): 
    re_obj = re.search(pattern, match) 
    if re_obj != None and len(re_obj.groups()) > 0: 
     return re_obj.group(1) 
    return None 

pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf") 

pattern = '' 
pdf.extract([ 
('with_parent','LTPage[pageid=1]'), 
('with_formatter', 'text'), 
('year', 'LTTextLineHorizontal:contains("Form 1040A (")', 
     lambda match: regex_function(SOME_PATTERN_HERE, match))) 
])

私はこの次のいずれかをテストしていないが、それはまた、うまくいくかもしれない：

def some_regex_function_feature(): 
    # here you could use some regex. 
    return float(this.get('width',0)) * float(this.get('height',0)) > 40000 

pdf.pq('LTPage[page_index="1"] *').filter(regex_function_filter_here) 
[<LTTextBoxHorizontal>, <LTRect>, <LTRect>]

出典

2015-10-13 21:19:33

pdfqueryで正規表現を使用することはできますか？

答えて

関連する問題