2016-07-06 8 views
1
My file contains texts in the following format: 
<string1> <string2> "some text as a paragraph" . 
<string1> <string2> "some text as a paragraph" . 
<string1> <string2> "some text as a paragraph" . 
<string1> <string2> "some text as a paragraph" . 

string1とstring2には空白が含まれず、空白が1つずつあります。二重引用符の中のテキストには、単一のスペースも含まれます。Pandasのデータフレームとして不規則なテキストファイルを読む方法

この場合、段落が不規則な列に分割されるため、sep = " "でpd.read_csv()を直接使用することはできません。

このようなファイルをデータフレームとして解析する方法はありますか?たぶんregexを使用して何か。

pd.read_csv(file_name, sep = " ")作品とその下には、同じコードにはないデータの先頭行のあるデータの先頭4行が続き、あなたに

ありがとうございます。私は入力としてそれを読むためにrdflibを使うことができると知っていますが、私がパンダを使用している目的は、ここでは非常に基本的な追加/置換を行うだけです。続き

<http://dbpedia.org/resource/Animalia_(book)> <http://www.w3.org/2000/01/rdf-schema#comment> "Animalia is an illustrated children's book by Graeme Base. It was originally published in 1986, followed by a tenth anniversary edition in 1996, and a 25th anniversary edition in 2012. Over three million copies have been sold. A special numbered and signed anniversary edition was also published in 1996, with an embossed gold jacket."@en . 
<http://dbpedia.org/resource/Assistive_technology> <http://www.w3.org/2000/01/rdf-schema#comment> "Assistive technology is an umbrella term that includes assistive, adaptive, and rehabilitative devices for people with disabilities and also includes the process used in selecting, locating, and using them. Assistive technology promotes greater independence by enabling people to perform tasks that they were formerly unable to accomplish, or had great difficulty accomplishing, by providing enhancements to, or changing methods of interacting with, the technology needed to accomplish such tasks."@en . 
<http://dbpedia.org/resource/A> <http://www.w3.org/2000/01/rdf-schema#comment> "A (named a /ˈeɪ/, plural aes) is the 1st letter and the first vowel in the ISO basic Latin alphabet. It is similar to the Ancient Greek letter alpha, from which it derives. The upper-case version consists of the two slanting sides of a triangle, crossed in the middle by a horizontal bar. The lower-case version can be written in two forms: the double-storey a and single-storey ɑ. The latter is commonly used in handwriting and fonts based on it, especially fonts intended to be read by children."@en . 
<http://dbpedia.org/resource/Aristotle> <http://www.w3.org/2000/01/rdf-schema#comment> "Aristotle (/ˈærɪˌstɒtəl/; Greek: Ἀριστοτέλης [aristotélɛːs], Aristotélēs; 384 – 322 BC) was a Greek philosopher and scientist born in the Macedonian city of Stagira, Chalkidice, on the northern periphery of Classical Greece. His father, Nicomachus, died when Aristotle was a child, whereafter Proxenus of Atarneus became his guardian. At eighteen, he joined Plato's Academy in Athens and remained there until the age of thirty-seven (c. 347 BC)."@en . 

は、不規則な読書を与える:

df = pd.read_csv("input.txt", sep=" ", header=None, escapechar="\\").iloc[:, :-1] 
print(df) 

列スライシング:エスケープ文字としてバックスラッシュとread_csv()実際に両方のあなたのデータサンプルにも私の作品

<http://dbpedia.org/resource/Big_Sounds_of_the_Drags> <http://www.w3.org/2000/01/rdf-schema#comment> "Big Sounds of the Drags is the second album by electronic music producer Junkie XL.\"Check Your Basic Groove\" has an unusual introduction. This portion begins with the sounds of various farm animals (cows for example), then more layers of sound effects are added (including a supercar) until the song segues to the music."@en . 
<http://dbpedia.org/resource/Sydney_Roosters_Juniors> <http://www.w3.org/2000/01/rdf-schema#comment> "The Sydney Roosters Juniors is officially known as the Eastern Suburbs District Junior Rugby League. It is an affiliation of junior clubs in the Eastern Suburbs area, covering the Woollahra and Waverley local government areas (LGAs), the northern parts of the Randwick LGA and also the eastern areas of the City of Sydney LGA."@en . 
<http://dbpedia.org/resource/A_Shot_at_Glory> <http://www.w3.org/2000/01/rdf-schema#comment> "A Shot at Glory is a film by Michael Corrente produced in 1999 and released in 2001, starring Robert Duvall and the Scottish football player Ally McCoist. It had limited commercial and critical success. The film features the fictional Scottish football club Kilnockie, as they attempt to reach their first Scottish Cup Final. The final game is against Rangers."@en . 
<http://dbpedia.org/resource/Kumar_Ponnambalam> <http://www.w3.org/2000/01/rdf-schema#comment> "Kumar Ponnambalam (August 12, 1940 – January 5, 2000) was a prominent defence lawyer and a controversial minority Tamil nationalist politician from Sri Lanka. He was shot dead by unknown gunmen immediately after a suspected LTTE suicide bomb attack against the then president Chandrika Kumaratunga."@en . 
<http://dbpedia.org/resource/Amalia_Mendoza> <http://www.w3.org/2000/01/rdf-schema#comment> "Amalia Mendoza García (10 July 1923 – 11 June 2001), nicknamed La Tariácuri, was a Mexican singer and actress. \"Échame a mi la culpa\" and \"Amarga navidad\" were some of her greatest hits."@en . 
+1

本当のサンプル入力を提供できますか?ありがとう。 – alecxe

+0

サンプルテキストを追加しました。ありがとうございました。 –

+0

「<' and '>」は実際のテキストに表示されます。ありがとう。 – alecxe

答えて

1

ドットだけを含む最後の列を避けることです。

+0

私は以前は明らかではないことを申し訳ありません。実際に私は上記の入力をうまく読むことができますが、問題は同じ形式でこの別のファイルから始まります。私は私の質問を編集します。申し訳ありません。 –

+0

@HardikMalhotra大丈夫、問題のあるケースのサンプルも表示してください。ありがとう。 – alecxe

+0

辛抱強くありがとう、問題のあるケースを追加しました。ありがとうございました –

0

これは、不規則な読み取り値としてラベルを付ける場合にも機能します。

import re 
import pandas as pd 

col1 = [] 
col2 = [] 
col3 = [] 
with open('input.txt', 'r') as f: 
    for line in f: 
     g = re.match(r'^<(.*)> <(.*)> "(.*)"', line).groups() 
     col1.append(g[0]) 
     col2.append(g[1]) 
     col3.append(g[2]) 

df = pd.DataFrame({'col1': col1, 'col2': col2, 'col3': col3}) 
+0

それは良く見えますが、私はこれをいくつかの時間に試して、更新を提供します。ありがとう。また、2つの類似したファイルで異なる結果が得られている理由を調べてもらうことができますか?ありがとう –

0

私は何かが恋しいですか?なぜsplit()を使用しないのですか?

# -*- coding: utf-8 -*- 
sample = """\ 
<string1> <string2> "some text as a paragraph" . 
<string1> <string2> "some text as a paragraph" . 
<string1> <string2> "some text as a paragraph" . 
<string1> <string2> "some text as a paragraph" .""".splitlines() 

sample = """\ 
<http://dbpedia.org/resource/Big_Sounds_of_the_Drags> <http://www.w3.org/2000/01/rdf-schema#comment> "Big Sounds of the Drags is the second album by electronic music producer Junkie XL.\"Check Your Basic Groove\" has an unusual introduction. This portion begins with the sounds of various farm animals (cows for example), then more layers of sound effects are added (including a supercar) until the song segues to the music."@en . 
<http://dbpedia.org/resource/Sydney_Roosters_Juniors> <http://www.w3.org/2000/01/rdf-schema#comment> "The Sydney Roosters Juniors is officially known as the Eastern Suburbs District Junior Rugby League. It is an affiliation of junior clubs in the Eastern Suburbs area, covering the Woollahra and Waverley local government areas (LGAs), the northern parts of the Randwick LGA and also the eastern areas of the City of Sydney LGA."@en . 
<http://dbpedia.org/resource/A_Shot_at_Glory> <http://www.w3.org/2000/01/rdf-schema#comment> "A Shot at Glory is a film by Michael Corrente produced in 1999 and released in 2001, starring Robert Duvall and the Scottish football player Ally McCoist. It had limited commercial and critical success. The film features the fictional Scottish football club Kilnockie, as they attempt to reach their first Scottish Cup Final. The final game is against Rangers."@en . 
<http://dbpedia.org/resource/Kumar_Ponnambalam> <http://www.w3.org/2000/01/rdf-schema#comment> "Kumar Ponnambalam (August 12, 1940 – January 5, 2000) was a prominent defence lawyer and a controversial minority Tamil nationalist politician from Sri Lanka. He was shot dead by unknown gunmen immediately after a suspected LTTE suicide bomb attack against the then president Chandrika Kumaratunga."@en . 
<http://dbpedia.org/resource/Amalia_Mendoza> <http://www.w3.org/2000/01/rdf-schema#comment> "Amalia Mendoza García (10 July 1923 – 11 June 2001), nicknamed La Tariácuri, was a Mexican singer and actress. \"Échame a mi la culpa\" and \"Amarga navidad\" were some of her greatest hits."@en .""".splitlines() 

data = [s.split(None,2) for s in sample] 

for d in data: 
    print(d) 

は与える:

['<string1>', '<string2>', '"some text as a paragraph" .'] 
['<string1>', '<string2>', '"some text as a paragraph" .'] 
['<string1>', '<string2>', '"some text as a paragraph" .'] 
['<string1>', '<string2>', '"some text as a paragraph" .'] 

['<http://dbpedia.org/resource/Big_Sounds_of_the_Drags>', '<http://www.w3.org/2000/01/rdf-schema#comment>', '"Big Sounds of the Drags is the second album by electronic music producer Junkie XL."Check Your Basic Groove" has an unusual introduction. This portion begins with the sounds of various farm animals (cows for example), then more layers of sound effects are added (including a supercar) until the song segues to the music."@en .'] 
['<http://dbpedia.org/resource/Sydney_Roosters_Juniors>', '<http://www.w3.org/2000/01/rdf-schema#comment>', '"The Sydney Roosters Juniors is officially known as the Eastern Suburbs District Junior Rugby League. It is an affiliation of junior clubs in the Eastern Suburbs area, covering the Woollahra and Waverley local government areas (LGAs), the northern parts of the Randwick LGA and also the eastern areas of the City of Sydney LGA."@en .'] 
['<http://dbpedia.org/resource/A_Shot_at_Glory>', '<http://www.w3.org/2000/01/rdf-schema#comment>', '"A Shot at Glory is a film by Michael Corrente produced in 1999 and released in 2001, starring Robert Duvall and the Scottish football player Ally McCoist. It had limited commercial and critical success. The film features the fictional Scottish football club Kilnockie, as they attempt to reach their first Scottish Cup Final. The final game is against Rangers."@en .'] 
['<http://dbpedia.org/resource/Kumar_Ponnambalam>', '<http://www.w3.org/2000/01/rdf-schema#comment>', '"Kumar Ponnambalam (August 12, 1940 \x96 January 5, 2000) was a prominent defence lawyer and a controversial minority Tamil nationalist politician from Sri Lanka. He was shot dead by unknown gunmen immediately after a suspected LTTE suicide bomb attack against the then president Chandrika Kumaratunga."@en .'] 
['<http://dbpedia.org/resource/Amalia_Mendoza>', '<http://www.w3.org/2000/01/rdf-schema#comment>', '"Amalia Mendoza Garc\xeda (10 July 1923 \x96 11 June 2001), nicknamed La Tari\xe1curi, was a Mexican singer and actress. "\xc9chame a mi la culpa" and "Amarga navidad" were some of her greatest hits."@en .'] 

をご入力ファイル、使用からデータをロードするには:

with open('big_honking_file.dat') as sample: 
    data = [s.split(None,2) for s in sample] 

を(これはだけなので、入力ファイルから1行ずつ読み込みますメモリ内のデータセット全体の2つのコピーで終わることはありません)。

このリストを簡単に行操作するために、 littletablehttps://pypi.python.org/pypi/littletable/)をチェックしてください - それはおそらく、パンダを使用するよりも軽量です。

from littletable import Table 
data = Table() 
with open('sample.txt') as sample: 
    data.insert_many(s.split(None,2) for s in sample) 
+0

こんにちは、はいsplit()はトリックを行いますが、実際にこのファイルは非常に〜1.5ギガバイトです。私は簡単な行操作のためにパンダを使用しています。ファイル全体をサンプルに読み込み、これをデータフレームとして置くことは、同じタスクに2重のメモリを使用するようです。私が間違っているなら、私を訂正してください。ありがとう! –

関連する問題