Protobuf PythonでのStanfordNLP出力のシリアル化

protobufでStanfordNLPの結果を出力したいのですが（そのサイズはずっと小さいので）、Pythonで結果を読み返しています。私はどうしたらいいですか？Protobuf PythonでのStanfordNLP出力のシリアル化

java -cp "stanford-corenlp-full-2015-12-09/*" \ 
edu.stanford.nlp.pipeline.StanfordCoreNLP \ 
-annotators tokenize,ssplit \ 
-file input.txt \ 
-outputFormat serialized \ 
-outputSerializer \ 
edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer

は次に、このようなPythonモジュールにStanfordNLPのソースコードが付属していCoreNLP.proto、コンパイルするprotocを使用：

Iは出力指示hereこのようProtobufAnnotationSerializerでシリアル化結果を、続いて：

protoc --python_out=. CoreNLP.proto

はその後、pythonで私はこのようなファイルをリードバック：

import CoreNLP_pb2 
doc = CoreNLP_pb2.Document() 
doc.ParseFromString(open('input.txt.ser.gz', 'rb').read())

解析は、次のエラーメッセージで失敗し

--------------------------------------------------------------------------- 
DecodeError        Traceback (most recent call last) 
<ipython-input-213-d8eaeb9c2048> in <module>() 
     1 doc = CoreNLP_pb2.Document() 
----> 2 doc.ParseFromString(open('imed/s5_tokenized/conv-00000.ser.gz', 'rb').read()) 

/usr/local/lib/python2.7/dist-packages/google/protobuf/message.pyc in ParseFromString(self, serialized) 
    183  """ 
    184  self.Clear() 
--> 185  self.MergeFromString(serialized) 
    186 
    187 def SerializeToString(self): 

/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/python_message.pyc in MergeFromString(self, serialized) 
    1092   # The only reason _InternalParse would return early is if it 
    1093   # encountered an end-group tag. 
-> 1094   raise message_mod.DecodeError('Unexpected end-group tag.') 
    1095  except (IndexError, TypeError): 
    1096  # Now ord(buf[p:p+1]) == ord('') gets TypeError. 

DecodeError: Unexpected end-group tag.

UPDATE：

私はシリアライザガボール・アンジェリの作成者を尋ね、答えを得ました。 protobufオブジェクトはwriteDelimitedToのファイルにthis lineで書き込まれました。 writeToに変更すると、出力ファイルをPythonで読むことができます。

出典

2016-09-11 shaoyl85

実行中のprotocのバージョンは何ですか？ 'protoc --version' – sberry

@sberry：" libprotoc 3.0.0 "を出力します。 – shaoyl85

これはまた、問題（.javaファイルを生成するためにどのバージョンが使われたかわかりません）でも問題になるかもしれませんが、私の答えは最初です。 – sberry

この質問は再び出てきたようですので、適切な回答を書くと思いました。問題の根源は、protoがJavaのwriteDelimitedToメソッドを使って書かれていることです。これはGoogleがPython用に実装していないものです。（ファイルを想定しgzipedされていません - あなたは、必要に応じてファイルを解凍するために、適切なコードでf.read()を置き換えることができます）の回避策プロトファイルを読み込むために、次の方法を使用することです：

from google.protobuf.internal.decoder import _DecodeVarint 
import CoreNLP_pb2 

def readCoreNLPProtoFile(protoFile): 
    protos = [] 
    with open(protoFile, 'rb') as f: 
    # -- Read the file -- 
    data = f.read() 
    # -- Parse the file -- 
    # In Java. there's a parseDelimitedFrom() method that makes this easier 
    pos = 0 
    while (pos < len(data)): 
     # (read the proto) 
     (size, pos) = _DecodeVarint(data, pos) 
     proto = CoreNLP_pb2.Document() 
     proto.ParseFromString(data[pos:(pos+size)]) 
     pos += size 
     # (add the proto to the list; or, `yield proto`) 
     protos.append(proto) 
    return protos

ファイルCoreNLP_pb2ですコマンドを使用して、レポにCoreNLP.protoファイルからコンパイル：この（バージョン3.7.0）を書き込むのような形式がproto2、ないproto3ある

protoc --python_out /path/to/output/ /path/to/CoreNLP.proto

注こと。

出典

2016-12-04 22:12:55

Protobuf PythonでのStanfordNLP出力のシリアル化

答えて

関連する問題