パンダまたは列挙を使用して2つのデータセット間の重複/範囲を見つけよう

次の条件で2つのファイルに対して区間範囲演算を実行しようとしています chromが等しいかどうかを確認し、co0rdinatefileの開始と終了が等しいかgene_annotationファイルの開始と終了（ストランドが "+"の場合、開始と終了は例えば10-20、 " - "は20-10となります）、一致する場合は開始座標gene_id、遺伝子名ファイルからのgene_name。パンダまたは列挙を使用して2つのデータセット間の重複/範囲を見つけよう

chrom start end strand 
    1 4247322 4247912 - 
    1 4427449 4432604 + 
    1 4763414 4764404 - 
    1 4764597 4767606 - 
    1 4764597 4766491 - 
    1 4766882 4767606 - 
    1 4767729 4772649 - 
    1 4767729 4768829 - 
    1 4767729 4775654 - 
    1 4772382 4772649 - 
    1 4772814 4774032 - 
    1 4772814 4774159 - 
    1 4772814 4775654 - 
    1 4772814 4774032 + 
    1 4774186 4775654 - 
    1 4774186 4775654 
    1 4774186 4775699 -

coordinates_file（表現の目的のために、私は頭のannoataionファイルを持っている）

注釈ファイル内の行数〜50000 協調ファイル内の行数200,000

gene_annotationfile

chrom  start  end    gene_id gene_name strand 
17 71223692 71274336 ENSMUSG00000085299  Gm16627  - 
17 18186448 18211184 ENSMUSG00000067978 Vmn2r-ps113  + 
11 84645863 84684319 ENSMUSG00000020530  Ggnbp2  - 
7 51097639 51106551 ENSMUSG00000074155   Klk5  + 
13 31711037 31712238 ENSMUSG00000087276  Gm11378  +

希望出力

私はこれまで

chrom, start, end,strand, gene_id, gene_name 
1  4427432 4432686 + ENSMUSG0001,ENSMUSG0002 abcd,efgh

私のコード書きたいことが、その場合にはgene_idするためにマッピングすることができる一致がある場合

chrom, start, end,strand, gene_id, gene_name 
1  4427432 4432686 + ENSMUSG0001 abcd

もう一つの問題は、いくつかのケースである：

import csv 

with open('coordinates.txt', 'r') as source: 
     coordinates = list(csv.reader(source, delimiter="\t")) 

with open('/gene_annotations.txt', 'rU') as source: 
     #if i do not use 'rU' i get this error Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? 
     annotations = list(csv.reader(source, delimiter="\t")) 

for index,line in enumerate(coordinates): 

    for index2, line2 in enumerate(annotations): 


     if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] : 
     print "%s\t%s\t%s\t%s\t%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4]) 
     break

エラーはその私は得る

---> 15   if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] : 
16    print "%s\t%s\t%s\t%s\t%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4]) 
17    break 

TypeError: list indices must be integers, not list

パンダはこのための良いアプローチですか？

出典

2017-05-12 novicebioinforesearcher

座標は[[1,2]、[3,4]]のようなリストのリストであると仮定します。行線として、座標の各々の行インデックスとしてインデックスを返す座標上

for index,line in enumerate(coordinates):

反復、。

if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :

エラーメッセージは、ここでインデックスのリスト（行）を使用していることを意味します。あなたはおそらく代わりに行のインデックスを使用していた：

if coordinates[index][0] == annotations[index2][0] and coordinates[index][1] <= annotations[index2][1] and annotations[index2][2] >= coordinates[index][2] :

さらに良いことにはちょうどラインを使用することです：

if line[0] == line2[0] and line[1] <= line2[1] and line2[2] >= line[2] :

はhttps://docs.python.org/2.7/reference/compound_stmts.html?highlight=for_stmt#grammar-token-for_stmt

を見ます

出典

2017-05-12 14:44:30 tobbes76

パンダまたは列挙を使用して2つのデータセット間の重複/範囲を見つけよう

答えて

関連する問題