一旦UTF-8编码，如何截断java字符串以适合给定的字节数？

这是一个简单的循环，用于计算UTF-8表示形式的大小，并在超出时截断：

public static String truncateWhenUTF8(String s, int maxBytes) {    int b = 0;    for (int i = 0; i < s.length(); i++) {        char c = s.charAt(i);        // ranges from http://en.wikipedia.org/wiki/UTF-8        int skip = 0;        int more;        if (c <= 0x007f) { more = 1;        }        else if (c <= 0x07FF) { more = 2;        } else if (c <= 0xd7ff) { more = 3;        } else if (c <= 0xDFFF) { // surrogate area, consume next char as well more = 4; skip = 1;        } else { more = 3;        }        if (b + more > maxBytes) { return s.substring(0, i);        }        b += more;        i += skip;    }    return s;}

这 确实可以
处理出现在输入字符串中的代理对。Java的UTF-8编码器（正确）将代理对输出为单个4字节序列而不是两个3字节序列，因此

truncateWhenUTF8()

将返回最长的截断字符串。如果您在实现中忽略代理对，则截短的字符串可能会短于所需的长度。

我没有对该代码做很多测试，但是这里有一些初步测试：

private static void test(String s, int maxBytes, int expectedBytes) {    String result = truncateWhenUTF8(s, maxBytes);    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));    if (utf8.length > maxBytes) {        System.out.println("BAD: our truncation of " + s + " was too big");    }    if (utf8.length != expectedBytes) {        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);    }    System.out.println(s + " truncated to " + result);}public static void main(String[] args) {    test("abcd", 0, 0);    test("abcd", 1, 1);    test("abcd", 2, 2);    test("abcd", 3, 3);    test("abcd", 4, 4);    test("abcd", 5, 4);    test("au0080b", 0, 0);    test("au0080b", 1, 1);    test("au0080b", 2, 1);    test("au0080b", 3, 3);    test("au0080b", 4, 4);    test("au0080b", 5, 4);    test("au0800b", 0, 0);    test("au0800b", 1, 1);    test("au0800b", 2, 1);    test("au0800b", 3, 1);    test("au0800b", 4, 4);    test("au0800b", 5, 5);    test("au0800b", 6, 5);    // surrogate pairs    test("uD834uDD1E", 0, 0);    test("uD834uDD1E", 1, 0);    test("uD834uDD1E", 2, 0);    test("uD834uDD1E", 3, 0);    test("uD834uDD1E", 4, 4);    test("uD834uDD1E", 5, 4);}

更新了 修改后的代码示例，现在可以处理代理对。

一旦UTF-8编码，如何截断java字符串以适合给定的字节数？

面试问答相关栏目本月热门文章