.txtファイルから重複を削除し、新しい.txtファイルを作成する

一部の行はタイムスタンプとの唯一の違いを伴って重複して発生するため、フィルタリングしたいデータ（。約5800行）ちょうど2時間後です。重複の後期バージョンである行（たとえば、添付されている例の最初の行など）は省略してください。その他の行はすべて、新しい.txtファイルに残して書き込む必要があります。私はこの1つを解決するように見えることはできません.txtファイルから重複を削除し、新しい.txtファイルを作成する

deleted

：

1_3_IMM 2016-07-19 16:11:56 00:00:40 2 Sensor Check # should go 
1_3_IMM 2016-07-19 14:12:40 00:00:33 2 Sensor Check # should stay 
1_3_IMM 2016-07-19 14:11:56 00:00:40 2 Sensor Check # should stay 
1_3_IMM 2016-07-19 16:12:40 00:00:33 2 Sensor Check # should go 
1_4_IMM 2016-07-19 17:23:25 00:00:20 2 Sensor Check # should stay 
1_4_IMM 2016-07-19 19:23:25 00:00:20 2 Sensor Check # should go 
1_4_IMM 2016-07-19 19:15:24 00:02:21 2 Sensor Check # should stay 
1_4_IMM 2016-07-19 19:25:13 00:02:13 2 Sensor Check # should stay

私はPythonでいくつかのコードを書いて、出力はテキストのみの1行を持つ.txtファイルです。手伝ってくれますか？以下のコードを参照してください。

import os 

def filter_file(): 
    with open("output.txt", "w") as output: 
     #open the input file from a specified directory 
     directory = os.path.normpath("C:/Users/sande_000/Documents/Python files") 
     for subdir, dirs, files in os.walk(directory): 
      for file in files: 
       if file.startswith("input"): 
        input_file=open(os.path.join(subdir, file)) 
        #iterate over each line of the file 
        for line in input_file: 
         machine = line[0:7]    #stores machine number 
         date = line[8:18]    #stores date stamp 
         time_1 = int(line[19:21])  #stores hour stamp 
         time_2 = int(line[22:24])  #stores minutes stamp 
         time_3 = int(line[25:27])  #stores second stamp 
         #check current line with other lines for duplicates by iterating over each line of the file 
         for otherline in input_file: 
          compare_machine = otherline[0:7]    
          compare_date = otherline[8:18] 
          compare_time_1 = int(otherline[19:21])+2 
          compare_time_2 = int(otherline[22:24]) 
          compare_time_3 = int(otherline[25:27]) 
          #check whether machine number & date/hour+2/minutes/seconds stamp are similar. 
          #If yes, write 'deleted' to output.txt and stop comparing lines. 
          #If no, continue with comparing next line. 
          if compare_machine == machine and compare_date == date and compare_time_1 == time_1 and compare_time_2 == time_2 and compare_time_3 == time_3: 
           output.write("deleted"+"\n") 
           break 
          else: 
           continue 
          #If no overlap between one line with any other line from the file, write that line to output.txt since it is no duplicate. 
          output.write(line) 

        input_file.close() 

if __name__ == "__main__": 
    filter_file()

出典

2017-01-12 Sander

確か '14：12：40'とどまるべきと' 16：12：40'行くべき、右？あなたはそれを逆にしている。 – Tagc

テキストファイルには論理がありません。一度早く行くべきであり、もう一方は早いほうがよい。また、ログファイルは日付でソートされません。秩序は問題ですか？ –

これは本当に間違いでしたが、テキストファイルは今や論理的でなければなりません。すべての明快さのためには、早いほうがよいでしょう。ログファイルが日付順にソートされていないのは事実ですが、それについては何もできません。 – Sander

私は以下のコードが動作すると信じています。 datetimeがマイクロ秒を超える分解能をサポートしていないという事実を考慮して、レコードの最小3つの時間要素（ミリ秒、マイクロ秒、ナノ秒）に変化がある場合、このコードは機能しません。あなたの例では、それは違いはありません。

[email protected] /cygdrive/c/Temp 
$ tree 
. 
├── input_1.txt 
└── input_2.txt

input_1.txt：

1_3_IMM 2016-07-19 16:11:56 00:00:40 2 Sensor Check 
1_3_IMM 2016-07-19 14:12:40 00:00:33 2 Sensor Check 
1_3_IMM 2016-07-19 14:11:56 00:00:40 2 Sensor Check 
1_3_IMM 2016-07-19 16:12:40 00:00:33 2 Sensor Check

input_2.txt：

1_4_IMM 2016-07-19 17:23:25 00:00:20 2 Sensor Check 
1_4_IMM 2016-07-19 19:23:25 00:00:20 2 Sensor Check 
1_4_IMM 2016-07-19 19:15:24 00:02:21 2 Sensor Check 
1_4_IMM 2016-07-19 19:25:13 00:02:13 2 Sensor Check

output.txt A構造を有する入力ディレクトリの試験済み

import os 
from datetime import datetime, timedelta 

INPUT_DIR = 'C:\Temp' 
OUTPUT_FILE = 'output.txt' 


def parse_data(data): 
    for line in data.splitlines(): 
     date_s = ' '.join(line.split()[1:3]) 
     date = datetime.strptime(date_s, '%Y-%m-%d %H:%M:%S') 
     yield line, date 


def filter_duplicates(data): 
    duplicate_offset = timedelta(hours=2) 

    parsed_data = list(parse_data(data)) 
    lines, dates = zip(*parsed_data) 

    for line, date in parsed_data: 
     if (date - duplicate_offset) not in dates: 
      yield line 


def get_input_data_from_dir(directory): 
    data = '' 
    for sub_dir, _, files in os.walk(directory): 
     for file in files: 
      if file.startswith('input'): 
       with open(os.path.join(sub_dir, file)) as f: 
        data += f.read() + '\n' 

    return data 


if __name__ == '__main__': 
    data = get_input_data_from_dir(INPUT_DIR) 
    with open(OUTPUT_FILE, 'w') as f_out: 
     content = '\n'.join(filter_duplicates(data)) 
     f_out.write(content)

FTER実行：

1_3_IMM 2016-07-19 14:12:40 00:00:33 2 Sensor Check 
1_3_IMM 2016-07-19 14:11:56 00:00:40 2 Sensor Check 
1_4_IMM 2016-07-19 17:23:25 00:00:20 2 Sensor Check 
1_4_IMM 2016-07-19 19:15:24 00:02:21 2 Sensor Check 
1_4_IMM 2016-07-19 19:25:13 00:02:13 2 Sensor Check

便宜のためにコピーされた以下のあなたの予想される出力、：

1_3_IMM 2016-07-19 16:11:56 00:00:40 2 Sensor Check # should go 
1_3_IMM 2016-07-19 14:12:40 00:00:33 2 Sensor Check # should stay 
1_3_IMM 2016-07-19 14:11:56 00:00:40 2 Sensor Check # should stay 
1_3_IMM 2016-07-19 16:12:40 00:00:33 2 Sensor Check # should go 
1_4_IMM 2016-07-19 17:23:25 00:00:20 2 Sensor Check # should stay 
1_4_IMM 2016-07-19 19:23:25 00:00:20 2 Sensor Check # should go 
1_4_IMM 2016-07-19 19:15:24 00:02:21 2 Sensor Check # should stay 
1_4_IMM 2016-07-19 19:25:13 00:02:13 2 Sensor Check # should stay

出典

2017-01-12 15:32:14 Tagc

あなたのポストを見たとき、私はほとんど同じPythonコードを書いていましたが、80％でした。（あなたが 'not startstithith（" input "）をチェックしてみると、1レベルのインデントを保存できます。 Upvoted。しかし、1つの問題：著者は、複製を見るときに "delete"を書くように頼んだ。 – hansaplast

@hansaplast "著者は重複を見ると" delete "を書くように頼んだ。 OPは書いています* "重複の後期バージョンである行（例えば、添付されている例の最初の行）は除外してください。" * – Tagc

@hansaplastああ、私は質問の本文を見ましたが、彼のコードのコメント。私は私の答えを更新しようとします。 – Tagc

私はこの短いコードがそれを行うべきだと思います。 Isには、ネストされたループの代わりに2つの連続したループがあり、パフォーマンスが向上します。

from datetime import datetime, timedelta 

# os.walk etc. 

for file in files: 
    if not file.startswith("input"): 
     continue 

    entries = set() 

    # build up entries 
    for line in input_file: 
     machine = line[0:7]    #stores machine number 
     date = datetime.strptime(line[8:27], '%Y-%m-%d %H:%M:%S') 

     entries.add((machine, date)) 

    #check entries 
    for line in input_file: 
     machine = line[0:7]    #stores machine number 
     date = datetime.strptime(line[8:27], '%Y-%m-%d %H:%M:%S') - timedelta(hours=2) 

     if (machine, date) in entries: 
      output.write("deleted\n") 
     else: 
      output.write(line) 
     output.flush()

出典

2017-01-12 15:26:41

何度も試してみるときれいに見えますが、output.txtに何も書き込めません。ちょうど空のファイル... – Sander

書き込み後にoutput.flush（）を呼び出す必要があるかもしれません。 –

残念ながら、効果なし – Sander

.txtファイルから重複を削除し、新しい.txtファイルを作成する

答えて

関連する問題