Javaの2バイト文字を含む文字列を切り取る良い方法は

私は他の2つのシステム間のインターフェイスの固定長メッセージを作成するメソッドを書いています。Javaの2バイト文字を含む文字列を切り取る良い方法は

メッセージは、各アイテムについて合意された長さ（バイト）で送信する必要がありますが、合意された長さよりも長い場合、メッセージはアイテムの長さで切り捨てられます。

メッセージには2バイト文字が含まれているため、文字の途中で切り捨てられた場合、切り捨てられて壊れた状態になります。

正しいバイトを計算するために、最初から切り取る長さを検索します。メッセージが非常に長い場合は、パフォーマンスが悪いはずです。

私はより良い方法を見つけることができないので、私はここで助けを求める。コードが複雑で冗長であることは申し訳ありません。プロジェクト全体はhereです。

package thecodinglog.string; 

public class StringHelper { 

public static String substrb2(String str, Number beginByte) { 
    return substrb2(str, beginByte, null, null, null); 
} 

public static String substrb2(String str, Number beginByte, Number byteLength) { 
    return substrb2(str, beginByte, byteLength, null, null); 
} 

/** 
* Returns the substring of the String. 
* It returns a string as specified length and byte position. 
* You can pad characters left or right when there is a specified length. 
* It distinguishes between 1 byte character and 2 byte character and returns it exactly as specified byte length. 
* If the start position or the specified length causes a 2-byte character to be truncated in the middle, 
* it will be converted to Space. 
* You can specify either left or right padding. 
* 
* If beginByte is 0, it is changed to 1 and processed. 
* If beginByte is less than 0, the string is searched for from right to left. 
* If beginByte or byteLength is a real number, the decimal point is discarded. 
* If you do not specify a length, returns everything from the starting position to the right-end string. 
* 
* Examples: 
* <blockquote><pre> 
*  StringHelper.substrb2("a好호b", 1, 10, null, "|") returns "a好호b||||" 
*  StringHelper.substrb2("ab한글", 4, 2) returns " " 
*  StringHelper.substrb2("한a글", -3, 2) returns "a " 
*  StringHelper.substrb2("abcde한글이han gul다ykd", 7) returns " 글이han gul다ykd" 
* </pre></blockquote> 
* 
* @param str a string to substring 
* @param beginByte the beginning byte 
* @param byteLength length of bytes 
* @param leftPadding a character for padding. It must be 1 byte character. 
* @param rightPadding a character for padding. It must be 1 byte character. 
* @return a substring 
*/ 
public static String substrb2(String str, Number beginByte, Number byteLength, String leftPadding, String rightPadding) { 
    if (str == null || str.equals("")) { 
     throw new IllegalArgumentException("The source string can not be an empty string or null."); 
    } 

    if (leftPadding != null && rightPadding != null) { 
     throw new IllegalArgumentException("Left padding, right padding Either of two must be null."); 
    } 

    if (leftPadding != null) { 
     if (leftPadding.length() != 1) { 
      throw new IllegalArgumentException("The length of the padding string must be one."); 
     } 
     if (getByteLengthOfChar(leftPadding.charAt(0)) != 1) { 
      throw new IllegalArgumentException("The padding string must be 1 Byte character."); 
     } 
    } 

    if (rightPadding != null) { 
     if (rightPadding.length() != 1) { 
      throw new IllegalArgumentException("The length of the padding string must be one."); 
     } 
     if (getByteLengthOfChar(rightPadding.charAt(0)) != 1) { 
      throw new IllegalArgumentException("The padding string must be 1 Byte character."); 
     } 
    } 

    int beginPosition = beginByte.intValue(); 
    if (beginPosition == 0) beginPosition = 1; 

    int length; 
    if (byteLength != null) { 
     length = byteLength.intValue(); 
     if (length < 0) { 
      return null; 
     } 
    } else { 
     length = -1; 
    } 

    if (length == 0) 
     return null; 

    boolean beginHalf = false; 
    int accByte = 0; 
    int startIndex = -1; 

    if (beginPosition >= 0) { 
     for (int i = 0; i < str.length(); i++) { 
      if (beginPosition - 1 == accByte) { 
       startIndex = i; 
       accByte = accByte + getByteLengthOfChar(str.charAt(i)); 
       break; 
      } else if (beginPosition == accByte) { 
       beginHalf = true; 
       startIndex = i; 
       accByte = accByte + getByteLengthOfChar(str.charAt(i)); 
       break; 
      } else if (accByte + 2 == beginPosition && i == str.length() - 1) { 
       beginHalf = true; 
       accByte = accByte + getByteLengthOfChar(str.charAt(i)); 
       break; 
      } 
      accByte = accByte + getByteLengthOfChar(str.charAt(i)); 
     } 
    } else { 
     beginPosition = beginPosition * -1; 
     if(length > beginPosition){ 
      length = beginPosition; 
     } 

     for (int i = str.length() - 1; i >= 0; i--) { 

      accByte = accByte + getByteLengthOfChar(str.charAt(i)); 

      if (i == str.length() - 1) { 
       if (getByteLengthOfChar(str.charAt(i)) == 1) { 
        if (beginPosition == accByte) { 
         startIndex = i; 
         break; 
        } 
       } else { 
        if (beginPosition == accByte) { 
         if (length > 1) { 
          startIndex = i; 
          break; 
         } else { 
          beginHalf = true; 
          break; 
         } 
        }else if(beginPosition == accByte - 1){ 
         if(length == 1){ 
          beginHalf = true; 
          break; 
         } 
        } 
       } 
      } else { 
       if (getByteLengthOfChar(str.charAt(i)) == 1) { 
        if (beginPosition == accByte) { 
         startIndex = i; 
         break; 
        } 
       } else { 
        if (beginPosition == accByte) { 
         if (length > 1) { 
          startIndex = i; 
          break; 
         } else { 
          beginHalf = true; 
          break; 
         } 

        } else if(beginPosition == accByte - 1) { 
         if(length > 1){ 
          startIndex = i + 1; 
         } 
         beginHalf = true; 
         break; 

        } 
       } 

      } 
     } 
    } 


    if (accByte < beginPosition) { 
     throw new IndexOutOfBoundsException("The start position is larger than the length of the original string."); 
    } 


    StringBuilder stringBuilder = new StringBuilder(); 
    int accSubstrLength = 0; 

    if (beginHalf) { 
     stringBuilder.append(" "); 
     accSubstrLength++; 
    } 


    if (byteLength == null) { 
     stringBuilder.append(str.substring(startIndex)); 
     return new String(stringBuilder); 
    } 


    for (int i = startIndex; i < str.length() && startIndex >= 0; i++) { 
     accSubstrLength = accSubstrLength + getByteLengthOfChar(str.charAt(i)); 
     if (accSubstrLength == length) { 
      stringBuilder.append(str.charAt(i)); 
      break; 
     } else if (accSubstrLength - 1 == length) { 
       stringBuilder.append(" "); 
      break; 
     } else if (accSubstrLength - 1 > length) { 

      break; 
     } 
     stringBuilder.append(str.charAt(i)); 
    } 

    if (leftPadding != null) { 
     int diffLength = byteLength.intValue() - accSubstrLength; 
     StringBuilder padding = new StringBuilder(); 
     for (int i = 0; i < diffLength; i++) { 
      padding.append(leftPadding); 
     } 
     stringBuilder.insert(0, padding); 
    } 

    if (rightPadding != null) { 
     int diffLength = byteLength.intValue() - accSubstrLength; 
     StringBuilder padding = new StringBuilder(); 
     for (int i = 0; i < diffLength; i++) { 
      padding.append(rightPadding); 
     } 
     stringBuilder.append(padding); 
    } 


    return new String(stringBuilder); 
} 

private static int getByteLengthOfChar(char c) { 
    if ((int) c < 128) { 
     return 1; 
    } else { 
     return 2; 
    } 
} 
}

は新しいコードは、私が "글" ではなく "畸邦" と予想

String testData = "한글이가득"; 

Charset charset = Charset.forName("EUC-KR"); 
ByteBuffer byteBuffer = charset.encode(testData); 

byte[] newone = Arrays.copyOfRange(byteBuffer.array(), 1, 5); 

CharsetDecoder charsetDecoder = charset.newDecoder() 
     .replaceWith(" ") 
     .onMalformedInput(CodingErrorAction.REPLACE) 
     .onUnmappableCharacter(CodingErrorAction.REPLACE); 

CharBuffer charBuffer = charsetDecoder.decode(ByteBuffer.wrap(newone)); 

System.out.println(charBuffer.toString());

でみました。私は開始インデックスがデコードする正しい位置でなければならないと思いますが、私が望むものをメソッドに知らせることはできないと思います。

例を追加

index| 0 1 2 3 4 5 6 7 8 9 
Char | 한 | 글 | 이 | 가 | 득 
---- | ---- | ---- | ---- | ---- | ---- 
hex | c7d1 | b1db | c0cc | b0a1 | b5e6 
---- | ---- | ---- | ---- | ---- | ----

が開始インデックスは1であり、長さは4バイトであり、サブ進コードが、この

index| 0 1 2 3 4 5 6 7 8 9 
Char | 한 | 글 | 이 | 가 | 득 
---- | ---- | ---- | ---- | ---- | ---- 
hex | c7d1 | b1db | c0cc | b0a1 | b5e6 
---- | ---- | ---- | ---- | ---- | ---- 
sub | d1 | b1db | c0

ときデコーダデコードようになると仮定失敗d1b1dbc0、d1b1を1文字として扱います。dbc0を1文字にします。これは文字セットによって異なる場合がありますが、この場合はそのように変わります。デコーダが元のキャラクタのバイトセットを認識しない限り、デコーダは、バイトが開始点を知らないので、それを間違ったキャラクタでデコードする。

私は、この方法の鍵は、元の文字の開始位置（バイト単位）をデコーダに知らせる方法です。

出典

2017-08-08 JeongjinKim

あなたはcharはJavaで2バイトであることを認識しておりますか！ – Rodney

これは人に尋ねるためのコードです... [mcve]を作成する方法をご覧ください –

あなたの全体の質問は次のように言い換えることができます： "バイト表現が与えられた文字列の最長切り捨てを見つける長さ？もしそうなら、私は 'CharsetEncoder'を使用し、' char'によって 'char'を追加し、結果がオーバーフローするまで待ちます（または' encodeLoop'メソッドを参照してください）。 – GPI

文字列全体をbyte []に変換して配列をカットする方が簡単です。次に、配列ピースをStringに変換してみます。変換が失敗した場合は、ピース配列の最後のバイトをスキップします。

出典

2017-08-08 14:25:06 StanislavL

私はすでに試しましたが、2バイト文字の中間の位置に問題があります。 – JeongjinKim

これにはNIOメソッドがあります。 CharsetEncoder#encodeを使用

、一つの入力からすべての可能な文字まで、変換されるような方法で、バイト配列（実際ByteBuffer）文字列（またはむしろCharBufferが、変換は自明である）をコードすることができます入力が完全に処理されるが、出力がオーバーフローしない点。

CoderResult.OVERFLOWは、出力バッファに余分な文字をエンコードするのに十分なスペースがないことを示します。このメソッドは、より多くの残りのバイトを持つ出力バッファを使用して再度呼び出す必要があります。これは、通常、出力バッファから符号化されたバイトを排水することによって行われる。あなたの編集をFollwing

、ここでエンコーディングEUC-KRを使用して文字列한글이가득で、（私はまだあなたが達成したいのかわからないalthoug、これは私の最高の推測です）の例です。

まずは、バイト配列表現は、各文字のため

Char | 한 | 글 | 이 | 가 | 득 
---- | ---- | ---- | ---- | ---- | ---- 
hex | c7d1 | b1db | c0cc | b0a1 | b5e6

ですので、この文字列全体が今

を書き込むことが10バイトを必要とするか見てみましょう、私たちは9バイトのメッセージ長を持っていると言います。これにより、0xc7d12b1dbc0ccb0a1である한글이가（8バイト）を送信することができますが、득を送信する余裕がないため（0xb5e6では2バイト必要です。残りは1つだけです）、残りのバッファはブランク。実際

：

String testData = "한글이가득"; 
CharsetEncoder encoder = charset.newEncoder(); 
// We create a 9 bytes buffer 
ByteBuffer limitedSizeOutput = ByteBuffer.allocate(9); 
// We encode 
CoderResult coderResult = encoder.encode(CharBuffer.wrap(testData.toCharArray()), limitedSizeOutput, true); 
// The encoder tells us that it could not fit the whole chars in 9 bytes 
System.out.println(coderResult); // prints OVERFLOW 
// We can check that it encoded 8 bytes out of the 10 that compose the original string data 
limitedSizeOutput.flip(); 
System.out.println(limitedSizeOutput.limit()); // prints 8 
// We can see that these are in effect 한글이가 by reading the uffer 
System.out.println(charset.newDecoder().decode(limitedSizeOutput).toString());

出典

2017-08-08 14:36:49 GPI

お返事ありがとうございます。私はNIOメソッドを使用しようとしましたが、私が望む結果を得ることができませんでした。上記のコード。 – JeongjinKim

@ JeongjinKim私は明確に編集しました。これはあなたのために動作しない場合は、おそらく私はあなたの意図を誤解した。あなたはあまりにも明確にしていただけますか？ – GPI

実際には、文字列の先頭（インデックス0）から切り取っても問題ありません。私が文字列の途中からカットしたいのであれば、複雑な問題があります。私はQをより詳細に編集します。 – JeongjinKim

Javaの2バイト文字を含む文字列を切り取る良い方法は

答えて

関連する問題