2016-05-15 6 views
-1

このHTML pageの生のテキストをすべて印刷する必要があります。正しい形式を維持しながら、HTMLの行を印刷する

各行は、この形式になっています。

ENSG00000001461'&nbsp';'&nbsp';'&nbsp';'&nbsp';ENST00000432012'&nbsp';'&nbsp';'&nbsp';'&nbsp';NIPAL3'&nbsp';'&nbsp';'&nbsp';'&nbsp';5'&nbsp';'&nbsp';'&nbsp';'&nbsp';1'&nbsp';'&nbsp';'&nbsp';'&nbsp';Forward'&nbsp';'&nbsp';'&nbsp';'&nbsp';NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]<'br/'> 

私は次のような出力をしたい:

ENSG00000001461 ENST00000432012 NIPAL3 5 1 Forward NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233] 

しかし、出力は次のとおりです。

ENSG00000001461 

これは私のコードです:

import urllib 
from bs4 import BeautifulSoup 
species = ['HomoSapiens', 'MusMusculus', 'DrosophilaMelanogaster','CaenorhabditisElegans'] 
rna_target = ['mRNA', 'lincRNA', 'lncRNA'] 
db = ['MB21E78v2', 'MB19E65v2', 'MB16E62v1'] 

species_input = input("Selezionare Specie: ") 
target_input = input("Selezionare tipo di RNA: ") 
db_input = input("Selezionare DataBase: ") 
check = 0 

for i in range(len(species)): 
    if species_input == species[i]: 
     for j in range(len(rna_target)): 
      if target_input == rna_target[j]: 
       for k in range(len(db)): 
        if db_input == db[k]: 
         check = 1 
if check == 1: 
    print("Dati Inseriti Correttamente!") 
else: 
    print("Error: Dati inseriti in modo errato!") 
    exit() 

url = urllib.request.urlopen("<https://cm.jefferson.edu/rna22/Precomputed/OptionController?>" +"species=" + species_input + "&type=" + target_input + "&version=" +db_input) 
print(url.geturl()) 

identifier = [] 
seq_input = input("Digitare ID miRNA: ") 
seq = "" 
seq = seq_input.split() 
print(seq) 

for i in range(len(seq)): 
    identifier.append(seq[i] + "%20") 
s = "" 
string = s.join(identifier) 

url_tab = urllib.request.urlopen("<https://cm.jefferson.edu/rna22/Precomputed/InputController?>"+"identifier=" string+"&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&"+"version=" + db_input + "&species=" + species_input + "&type=" + target_input) 
print(url_tab.geturl()) 

download = urllib.request.urlopen(" 
<http://cm.jefferson.edu/rna22/Precomputed/InputController?>download=ALL"+"&ident=" + string+"&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&" +"version=" + db_input + "&species=" + species_input + "&type=" + target_input) 
down_string = download.geturl() 
print(down_string) 
soup = BeautifulSoup(download, "html5lib") 
for match in soup.findAll('br'): 
    match.unwrap() 
s2 = soup 
s1 = s2.body.extract() 
print(s1.prettify(formatter=lambda s: s.strip(u'xa0'))) 
+1

コードはどこですか? – Selcuk

+0

何から出力? –

+0

質問に* [mcve] *を入れてください。 Pythonはどこですか? – jonrsharpe

答えて

1

ソースには行の概念はありません。必要なテキストは1行だけですbrタグを使って改行を使ってparateします。あなたを与えるだろう

import requests 
from bs4 import BeautifulSoup 

r = requests.get("https://cm.jefferson.edu/rna22/Precomputed/InputController?download=ALL&ident=hsa_miR_107%20hsa_miR_5011_5p%20hsa_miR_326&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&version=MB21E78v2&species=HomoSapiens&type=mRNA") 

soup = BeautifulSoup(r.content) 
for b in soup.find_all("br"): 
    b.replace_with("\n") 
print(soup.text) 

ENSG00000001461    ENST00000432012    NIPAL3    5    1    Forward    NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233] 
ENSG00000001631    ENST00000340022    KRIT1    5    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573] 
ENSG00000001631    ENST00000394503    KRIT1    3    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573] 
ENSG00000001631    ENST00000394505    KRIT1    3    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573] 
ENSG00000001631    ENST00000394507    KRIT1    4    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573] 
ENSG00000001631    ENST00000412043    KRIT1    4    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573] 
ENSG00000002834    ENST00000318008    LASP1    6    17    Forward    LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513] 
ENSG00000002834    ENST00000433206    LASP1    6    17    Forward    LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513] 
ENSG00000002834    ENST00000435347    LASP1    5    17    Forward    LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513] 
ENSG00000005381    ENST00000225275    MPO    5    17    Reverse    myeloperoxidase [Source:HGNC Symbol;Acc:HGNC:7218] 
ENSG00000005889    ENST00000539115    ZFX    4    23 X    Forward    zinc finger protein, X-linked [Source:HGNC Symbol;Acc:HGNC:12869] 
ENSG00000006432    ENST00000554752    MAP3K9    10    14    Reverse    mitogen-activated protein kinase kinase kinase 9 [Source:HGNC Symbol;Acc:HGNC:6861] 
ENSG00000006432    ENST00000611979    MAP3K9    10    14    Reverse    mitogen-activated protein kinase kinase kinase 9 [Source:HGNC Symbol;Acc:HGNC:6861] 
ENSG00000007216    ENST00000314669    SLC13A2    4    17    Forward    solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:10917] 
ENSG00000007216    ENST00000444914    SLC13A2    4    17    Forward    solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:10917] 

そして、全体の多くを

あなたがソースを解析する必要がある場合は、改行して、BRタグを置き換えるだけテキストを引くことができます同じの。

+0

それは動作します!どうもありがとうございました!! – Federico93

-1

私はあなたのコードをテストし、以前の答えを置き換えました。

次のエラーを編集すると、コードが正常に動作しているようです。

  • 「識別子=」と文字列
  • ここ

の間に+を追加ライン42

  • にEOLを削除したURLから
  • を<を削除し、私が得る出力のラインの一部を以下に示します。

    ENSG00000272325    ENST00000607016    NUDT3    4    6    Reverse    nudix (nucleoside diphosphate linked moiety X)-type motif 3 [Source:HGNC Symbol;Acc:HGNC:8050] 
    ENSG00000272980    ENST00000400926    CCR6    5    6    Forward    chemokine (C-C motif) receptor 6 [Source:HGNC Symbol;Acc:HGNC:1607] 
    ENSG00000274211    ENST00000612932    SOCS7    8    17    Forward    suppressor of cytokine signaling 7 [Source:HGNC Symbol;Acc:HGNC:29846] 
    ENSG00000274588    ENST00000611977    DGKK    4    23 X    Reverse    diacylglycerol kinase, kappa [Source:HGNC Symbol;Acc:HGNC:32395] 
    ENSG00000275004    ENST00000613655    ZNF280B    4    22    Reverse    zinc finger protein 280B [Source:HGNC Symbol;Acc:HGNC:23022] 
    ENSG00000275004    ENST00000619852    ZNF280B    4    22    Reverse    zinc finger protein 280B [Source:HGNC Symbol;Acc:HGNC:23022] 
    ENSG00000275832    ENST00000622683    ARHGAP23    6    17    Forward    Rho GTPase activating protein 23 [Source:HGNC Symbol;Acc:HGNC:29293] 
    ENSG00000277258    ENST00000616199    PCGF2    3    17    Reverse    polycomb group ring finger 2 [Source:HGNC Symbol;Acc:HGNC:12929] 
    ENSG00000278871    ENST00000623344    KDM5D    8    24 Y    Reverse    lysine (K)-specific demethylase 5D [Source:HGNC Symbol;Acc:HGNC:11115] 
    ENSG00000279096    ENST00000622918    AL356289.1    11    1    Forward    HCG1780467 {ECO:0000313|EMBL:EAX06861.1}; PRO0529 {ECO:0000313|EMBL:AAF16687.1} [Source:UniProtKB/TrEMBL;Acc:Q9UI23] 
    
  • +0

    このコードはこのエラーを生成します:print(myfile。       "、" \ t ")replace("
    "、" \ n ") TypeError: 'str'ではなく、バイトのようなオブジェクトが必要です – Federico93

    +0

    ありがとう。私は私の答えを編集しました。 – VePl

    関連する問題