2017-09-18 3 views
1

私のテキストファイルには、このような段落が含まれています。2つの同様のティラールの間に特定の単語を含む段落を抽出します

summary 

A result oriented and dedicated professional with three years’ experience in Software Development. A proactive individual with a logical approach to challenges, performs effectively even within a highly pressurised working environment. 

summary 

Oct 28th, 2010 – Till date Cognizant Technology Solutions  


Project #1 

Title   Wealth Passport – R7.3 
Client     Northern Trust 
Operating System Windows XP 
Technologies  J2EE, JSP, Struts, Oracle, PL/SQL 
Team Size  3 
Role   Team Member 
Period     22nd Aug’ 2013 - Till Date  
Project Description 
Wealth Passport R7.3 release aims at enhancements in four projects SGY, PMM, WPA and WPX. This primarily involves analysing existing issues in the four applications and enhancements to some of the functionalities. 
Role and Responsibilities 
Handled dockets in SGY and PMM applications. 
Done root cause analysis to existing issues in a short span of time. 
Designed and developed enhancements in PMM application. 
Preparing Unit Test cases for the developed Java modules and executing them. 


Project #2 
Title   PFS Development – WP Filecabinet and R7.2 
Client     Northern Trust 
Operating System Windows XP 
Technologies  J2EE, JSP, Struts, Weblogic Portal, Oracle, PL/SQL, UNIX, Hibernate, Spring, DOJO 
Team Size  1 
Role   Team Member – JavaEE Developer 
Period     18th June’ 2013 – 21st Aug’ 2013 
Project Description 
PFS Development project is to provide the development services for PFS capital projects: Wealth Passport, Private Passport 6.0 and Private Passport 7.0 
Wealth Passport Filecabinet provides functionality for users to store their files on our system. This enables users to create folders, upload files and view the uploaded files. Batch upload/delete option is also available. Deleted files will be moved to Waste Bucket, from where users can restore should they wish. This project aims at improving the performance of Filecabinet which was mandated by increasing customer base and files handled by the system. 

今、私は他の要約部分を抽出することなく"Project", "Teamsize " のような言葉が含まれているセクションの概要を抽出したいと思います。 私は以下のコードを試してみましたが、それは要約コンテンツ

import re 
import os 
with open ('9.txt', encoding='latin-1') as infile, open ('d.txt','w',encoding='latin-1') as outfile : 
    copy = False 
    for line in infile: 
     if line.strip() == 'summary': 
      re.compile('\r\nproject*\r\n') 
      copy = True 
     elif line.strip() == "summary": 
      copy =False 
     elif copy: 
      outfile.write(line) 
     #fh = open("d.txt",'r') 
     contents = fh.read() 
     len(contents) 

の両方を抽出して、私はここコンテンツ

summary 

    Oct 28th, 2010 – Till date Cognizant Technology Solutions  


    Project #1 

    Title   Wealth Passport – R7.3 
    Client     Northern Trust 
    Operating System Windows XP 
    Technologies  J2EE, JSP, Struts, Oracle, PL/SQL 
    Team Size  3 
    Role   Team Member 
    Period     22nd Aug’ 2013 - Till Date  
    Project Description 
    Wealth Passport R7.3 release aims at enhancements in four projects SGY, PMM, WPA and WPX. This primarily involves analysing existing issues in the four applications and enhancements to some of the functionalities. 
    Role and Responsibilities 
    Handled dockets in SGY and PMM applications. 
    Done root cause analysis to existing issues in a short span of time. 
    Designed and developed enhancements in PMM application. 
    Preparing Unit Test cases for the developed Java modules and executing them. 


    Project #2 
    Title   PFS Development – WP Filecabinet and R7.2 
    Client     Northern Trust 
    Operating System Windows XP 
    Technologies  J2EE, JSP, Struts, Weblogic Portal, Oracle, PL/SQL, UNIX, Hibernate, Spring, DOJO 
    Team Size  1 
    Role   Team Member – JavaEE Developer 
    Period     18th June’ 2013 – 21st Aug’ 2013 
    Project Description 
    PFS Development project is to provide the development services for PFS capital projects: Wealth Passport, Private Passport 6.0 and Private Passport 7.0 
    Wealth Passport Filecabinet provides functionality for users to store their files on our system. This enables users to create folders, upload files and view the uploaded files. Batch upload/delete option is also available. Deleted files will be moved to Waste Bucket, from where users can restore should they wish. This project aims at improving the performance of Filecabinet which was mandated by increasing customer base and files handled by the system. 
+0

テキストファイルの形式を制御できますか?もしそうなら、 'json'、' txt'あるいは 'csv'(いくつかの名前を挙げれば)ファイル形式を宣言するほうがはるかに簡単です。 –

+0

'd.txt'で期待される出力は? –

+0

プロジェクトワードが含まれている要約のセクション –

答えて

0

:たとえば

a = True 
a = not a 
# a is now False 

:自身の逆数にブール、あなたができる簡単なセットを切り替えるには

split_on = 'summary\n\n' 
must_contain = ['Project', 'Team Size'] 

with open('9.txt') as f_input, open('d.txt', 'w') as f_output: 
    for part in f_input.read().split(split_on): 
     if all(text in part for text in must_contain): 
      f_output.write(split_on + part) 
+0

私は、特定の単語をチェックしてセクションを抽出するランダムな方法がたくさんあります。すべてのファイルが上記のものと似ているわけではありません。 –

+0

ランダムなテキストファイルの任意のセクションを抽出したいのですが、{project、teamsize、 } –

+0

私は必要な単語のリストを含むすべてのセクションをフィルタリングするためにスクリプトを更新しました。 –

0

に第2の条件文がします含まれている保存にd.txtとしてテキストファイルを期待していそれは最初のものと同じ条件を持っているので、決して走りません。最初のインスタンスsummaryの後、意味コピーは常にTrueになります。

if line.strip() == 'summary': 
    re.compile('\r\nproject*\r\n') 
    copy = True 
elif line.strip() == "summary": 
    copy =False 

私がお勧めしたいことは、「概要」のタグを(私はこれらのコメントブロックの開始/終了することを意図していると仮定)ピックアップ1つの文を持っている - とcopyを切り替えます。興味のある単語を含むすべてのsummaryセクションを抽出するには

if line.strip() == 'summary': 
    copy = not copy 
elif copy: 
    outfile.write(line) 
関連する問題