iOS 读取 txt 文本文件中文乱码的解决办法

乱码

0x00 前言

最近在做公司的一个外包项目,是一个小说阅读 App,但是不仅是文本小说的阅读,还有有声小说,学习,ASMR 安眠等功能,这篇文章主要是说读取 txt 文本小说乱码的解决办法。

太长不看,直接看解决办法

0x01 遇到的情况

Windows 上,txt 文件的编码一般是 GBK,但是 macOS 上一般是 UTF-8 的编码,所以在 iOS 上,直接用 UTF-8 编码去获取 txt 文件内容,会读取不到想要的数据:

1
2
3
NSError *error;
NSString *content = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:&error];
NSLog(@"UTF-8, error = %@", error);

既然 UTF-8 编码不行,那么我们就换对应的编码去读取:

1
2
3
4
5
6
7
8
9
10
11
//解决中文乱码
if (!content) {
error = nil;
content = [NSString stringWithContentsOfURL:url encoding:0x80000632 error:&error];
NSLog(@"GBK 632, error = %@", error);
}
if (!content) {
error = nil;
content = [NSString stringWithContentsOfURL:url encoding:0x80000631 error:&error];
NSLog(@"GBK 631, error = %@", error);
}

这样的话,GBK 编码的“一般”也都能正确读取了,为什么说是一般呢?还打引号呢?因为在某些情况下,上面两种方式并不能正确读取 GBK 编码的 txt 文件的内容,即使你在 Mac 上使用 VS Code 等类似的编辑器,使用 GBK 编码能够正确显示。

读取不了的情况遇到过两种,先说第一种。

0x02 第一种读取失败

先说 GBK,因为这个花了我一些时间去解决。

明明 Mac 上用 GBK 编码能够正确显示 txt 文件,服务端用 GBK 编码也能正确解析 txt 小说的章节,一到 iOS 上就不行了。

原因是可能编码比较特殊,遇到这个问题,首先试过系统提供的很多其他编码方式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
if (!content) {
error = nil;
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_2312_80);
content = [NSString stringWithContentsOfURL:url encoding:enc error:&error];
NSLog(@"GB 2312, error = %@", error);
}
if (!content) {
error = nil;
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingHZ_GB_2312);
content = [NSString stringWithContentsOfURL:url encoding:enc error:&error];
NSLog(@"HZ GB 2312, error = %@", error);
}
if (!content) {
error = nil;
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000);
content = [NSString stringWithContentsOfURL:url encoding:enc error:&error];
NSLog(@"GB 18030, error = %@", error);
}

都不行,遂去 Google,不过发现基本上都是类似的,抄来抄去,没什么卵用:

Screenshot 2019-10-18 at 16.55.06.png

也试过英文搜索,但是也没有找到能够解决的办法。

既然这些解决不了,只能靠自己了。

想到既然系统提供了这么多的编码,会不会有一种能够解析呢?然后跑去看编码的头文件:CFStringEncodingExt.h,内容主要如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
typedef CF_ENUM(CFIndex, CFStringEncodings) {
/* kCFStringEncodingMacRoman = 0L, defined in CoreFoundation/CFString.h */
kCFStringEncodingMacJapanese = 1,
kCFStringEncodingMacChineseTrad = 2,
kCFStringEncodingMacKorean = 3,
kCFStringEncodingMacArabic = 4,
kCFStringEncodingMacHebrew = 5,
kCFStringEncodingMacGreek = 6,
kCFStringEncodingMacCyrillic = 7,
kCFStringEncodingMacDevanagari = 9,
kCFStringEncodingMacGurmukhi = 10,
kCFStringEncodingMacGujarati = 11,
kCFStringEncodingMacOriya = 12,
kCFStringEncodingMacBengali = 13,
kCFStringEncodingMacTamil = 14,
kCFStringEncodingMacTelugu = 15,
kCFStringEncodingMacKannada = 16,
kCFStringEncodingMacMalayalam = 17,
kCFStringEncodingMacSinhalese = 18,
kCFStringEncodingMacBurmese = 19,
kCFStringEncodingMacKhmer = 20,
kCFStringEncodingMacThai = 21,
kCFStringEncodingMacLaotian = 22,
kCFStringEncodingMacGeorgian = 23,
kCFStringEncodingMacArmenian = 24,
kCFStringEncodingMacChineseSimp = 25,
kCFStringEncodingMacTibetan = 26,
kCFStringEncodingMacMongolian = 27,
kCFStringEncodingMacEthiopic = 28,
kCFStringEncodingMacCentralEurRoman = 29,
kCFStringEncodingMacVietnamese = 30,
kCFStringEncodingMacExtArabic = 31,
/* The following use script code 0, smRoman */
kCFStringEncodingMacSymbol = 33,
kCFStringEncodingMacDingbats = 34,
kCFStringEncodingMacTurkish = 35,
kCFStringEncodingMacCroatian = 36,
kCFStringEncodingMacIcelandic = 37,
kCFStringEncodingMacRomanian = 38,
kCFStringEncodingMacCeltic = 39,
kCFStringEncodingMacGaelic = 40,
/* The following use script code 4, smArabic */
kCFStringEncodingMacFarsi = 0x8C, /* Like MacArabic but uses Farsi digits */
/* The following use script code 7, smCyrillic */
kCFStringEncodingMacUkrainian = 0x98,
/* The following use script code 32, smUnimplemented */
kCFStringEncodingMacInuit = 0xEC,
kCFStringEncodingMacVT100 = 0xFC, /* VT100/102 font from Comm Toolbox: Latin-1 repertoire + box drawing etc */
/* Special Mac OS encodings*/
kCFStringEncodingMacHFS = 0xFF, /* Meta-value, should never appear in a table */

/* Unicode & ISO UCS encodings begin at 0x100 */
/* We don't use Unicode variations defined in TextEncoding; use the ones in CFString.h, instead. */

/* ISO 8-bit and 7-bit encodings begin at 0x200 */
/* kCFStringEncodingISOLatin1 = 0x0201, defined in CoreFoundation/CFString.h */
kCFStringEncodingISOLatin2 = 0x0202, /* ISO 8859-2 */
kCFStringEncodingISOLatin3 = 0x0203, /* ISO 8859-3 */
kCFStringEncodingISOLatin4 = 0x0204, /* ISO 8859-4 */
kCFStringEncodingISOLatinCyrillic = 0x0205, /* ISO 8859-5 */
kCFStringEncodingISOLatinArabic = 0x0206, /* ISO 8859-6, =ASMO 708, =DOS CP 708 */
kCFStringEncodingISOLatinGreek = 0x0207, /* ISO 8859-7 */
kCFStringEncodingISOLatinHebrew = 0x0208, /* ISO 8859-8 */
kCFStringEncodingISOLatin5 = 0x0209, /* ISO 8859-9 */
kCFStringEncodingISOLatin6 = 0x020A, /* ISO 8859-10 */
kCFStringEncodingISOLatinThai = 0x020B, /* ISO 8859-11 */
kCFStringEncodingISOLatin7 = 0x020D, /* ISO 8859-13 */
kCFStringEncodingISOLatin8 = 0x020E, /* ISO 8859-14 */
kCFStringEncodingISOLatin9 = 0x020F, /* ISO 8859-15 */
kCFStringEncodingISOLatin10 = 0x0210, /* ISO 8859-16 */

/* MS-DOS & Windows encodings begin at 0x400 */
kCFStringEncodingDOSLatinUS = 0x0400, /* code page 437 */
kCFStringEncodingDOSGreek = 0x0405, /* code page 737 (formerly code page 437G) */
kCFStringEncodingDOSBalticRim = 0x0406, /* code page 775 */
kCFStringEncodingDOSLatin1 = 0x0410, /* code page 850, "Multilingual" */
kCFStringEncodingDOSGreek1 = 0x0411, /* code page 851 */
kCFStringEncodingDOSLatin2 = 0x0412, /* code page 852, Slavic */
kCFStringEncodingDOSCyrillic = 0x0413, /* code page 855, IBM Cyrillic */
kCFStringEncodingDOSTurkish = 0x0414, /* code page 857, IBM Turkish */
kCFStringEncodingDOSPortuguese = 0x0415, /* code page 860 */
kCFStringEncodingDOSIcelandic = 0x0416, /* code page 861 */
kCFStringEncodingDOSHebrew = 0x0417, /* code page 862 */
kCFStringEncodingDOSCanadianFrench = 0x0418, /* code page 863 */
kCFStringEncodingDOSArabic = 0x0419, /* code page 864 */
kCFStringEncodingDOSNordic = 0x041A, /* code page 865 */
kCFStringEncodingDOSRussian = 0x041B, /* code page 866 */
kCFStringEncodingDOSGreek2 = 0x041C, /* code page 869, IBM Modern Greek */
kCFStringEncodingDOSThai = 0x041D, /* code page 874, also for Windows */
kCFStringEncodingDOSJapanese = 0x0420, /* code page 932, also for Windows */
kCFStringEncodingDOSChineseSimplif = 0x0421, /* code page 936, also for Windows */
kCFStringEncodingDOSKorean = 0x0422, /* code page 949, also for Windows; Unified Hangul Code */
kCFStringEncodingDOSChineseTrad = 0x0423, /* code page 950, also for Windows */
/* kCFStringEncodingWindowsLatin1 = 0x0500, defined in CoreFoundation/CFString.h */
kCFStringEncodingWindowsLatin2 = 0x0501, /* code page 1250, Central Europe */
kCFStringEncodingWindowsCyrillic = 0x0502, /* code page 1251, Slavic Cyrillic */
kCFStringEncodingWindowsGreek = 0x0503, /* code page 1253 */
kCFStringEncodingWindowsLatin5 = 0x0504, /* code page 1254, Turkish */
kCFStringEncodingWindowsHebrew = 0x0505, /* code page 1255 */
kCFStringEncodingWindowsArabic = 0x0506, /* code page 1256 */
kCFStringEncodingWindowsBalticRim = 0x0507, /* code page 1257 */
kCFStringEncodingWindowsVietnamese = 0x0508, /* code page 1258 */
kCFStringEncodingWindowsKoreanJohab = 0x0510, /* code page 1361, for Windows NT */

/* Various national standards begin at 0x600 */
/* kCFStringEncodingASCII = 0x0600, defined in CoreFoundation/CFString.h */
kCFStringEncodingANSEL = 0x0601, /* ANSEL (ANSI Z39.47) */
kCFStringEncodingJIS_X0201_76 = 0x0620,
kCFStringEncodingJIS_X0208_83 = 0x0621,
kCFStringEncodingJIS_X0208_90 = 0x0622,
kCFStringEncodingJIS_X0212_90 = 0x0623,
kCFStringEncodingJIS_C6226_78 = 0x0624,
kCFStringEncodingShiftJIS_X0213 API_AVAILABLE(macos(10.5), ios(2.0), watchos(2.0), tvos(9.0)) = 0x0628, /* Shift-JIS format encoding of JIS X0213 planes 1 and 2*/
kCFStringEncodingShiftJIS_X0213_MenKuTen = 0x0629, /* JIS X0213 in plane-row-column notation */
kCFStringEncodingGB_2312_80 = 0x0630,
kCFStringEncodingGBK_95 = 0x0631, /* annex to GB 13000-93; for Windows 95 */
kCFStringEncodingGB_18030_2000 = 0x0632,
kCFStringEncodingKSC_5601_87 = 0x0640, /* same as KSC 5601-92 without Johab annex */
kCFStringEncodingKSC_5601_92_Johab = 0x0641, /* KSC 5601-92 Johab annex */
kCFStringEncodingCNS_11643_92_P1 = 0x0651, /* CNS 11643-1992 plane 1 */
kCFStringEncodingCNS_11643_92_P2 = 0x0652, /* CNS 11643-1992 plane 2 */
kCFStringEncodingCNS_11643_92_P3 = 0x0653, /* CNS 11643-1992 plane 3 (was plane 14 in 1986 version) */

/* ISO 2022 collections begin at 0x800 */
kCFStringEncodingISO_2022_JP = 0x0820,
kCFStringEncodingISO_2022_JP_2 = 0x0821,
kCFStringEncodingISO_2022_JP_1 = 0x0822, /* RFC 2237*/
kCFStringEncodingISO_2022_JP_3 = 0x0823, /* JIS X0213*/
kCFStringEncodingISO_2022_CN = 0x0830,
kCFStringEncodingISO_2022_CN_EXT = 0x0831,
kCFStringEncodingISO_2022_KR = 0x0840,

/* EUC collections begin at 0x900 */
kCFStringEncodingEUC_JP = 0x0920, /* ISO 646, 1-byte katakana, JIS 208, JIS 212 */
kCFStringEncodingEUC_CN = 0x0930, /* ISO 646, GB 2312-80 */
kCFStringEncodingEUC_TW = 0x0931, /* ISO 646, CNS 11643-1992 Planes 1-16 */
kCFStringEncodingEUC_KR = 0x0940, /* ISO 646, KS C 5601-1987 */

/* Misc standards begin at 0xA00 */
kCFStringEncodingShiftJIS = 0x0A01, /* plain Shift-JIS */
kCFStringEncodingKOI8_R = 0x0A02, /* Russian internet standard */
kCFStringEncodingBig5 = 0x0A03, /* Big-5 (has variants) */
kCFStringEncodingMacRomanLatin1 = 0x0A04, /* Mac OS Roman permuted to align with ISO Latin-1 */
kCFStringEncodingHZ_GB_2312 = 0x0A05, /* HZ (RFC 1842, for Chinese mail & news) */
kCFStringEncodingBig5_HKSCS_1999 = 0x0A06, /* Big-5 with Hong Kong special char set supplement*/
kCFStringEncodingVISCII = 0x0A07, /* RFC 1456, Vietnamese */
kCFStringEncodingKOI8_U = 0x0A08, /* RFC 2319, Ukrainian */
kCFStringEncodingBig5_E = 0x0A09, /* Taiwan Big-5E standard */

/* Other platform encodings*/
/* kCFStringEncodingNextStepLatin = 0x0B01, defined in CoreFoundation/CFString.h */
kCFStringEncodingNextStepJapanese = 0x0B02, /* NextStep Japanese encoding */

/* EBCDIC & IBM host encodings begin at 0xC00 */
kCFStringEncodingEBCDIC_US = 0x0C01, /* basic EBCDIC-US */
kCFStringEncodingEBCDIC_CP037 = 0x0C02, /* code page 037, extended EBCDIC (Latin-1 set) for US,Canada... */

kCFStringEncodingUTF7 API_AVAILABLE(macos(10.6), ios(4.0), watchos(2.0), tvos(9.0)) = 0x04000100, /* kTextEncodingUnicodeDefault + kUnicodeUTF7Format RFC2152 */
kCFStringEncodingUTF7_IMAP API_AVAILABLE(macos(10.6), ios(4.0), watchos(2.0), tvos(9.0)) = 0x0A10, /* UTF-7 (IMAP folder variant) RFC3501 */

/* Deprecated constants */
kCFStringEncodingShiftJIS_X0213_00 = 0x0628 /* Shift-JIS format encoding of JIS X0213 planes 1 and 2 (DEPRECATED) */
};

我就一个个看下来,看到感觉可能可以用的,我就拿过来去试试,比如 kCFStringEncodingMacChineseSimp,但是发现还是不行,直到我试了下 kCFStringEncodingDOSChineseSimplif,竟然可以正确读取中文了,真是非常高兴:

1
2
3
4
5
6
if (!content) {
error = nil;
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingDOSChineseSimplif);
content = [NSString stringWithContentsOfURL:url encoding:enc error:&error];
NSLog(@"DOS Chinese Simplif, error = %@", error);
}

0x03 第二种读取失败

除了 GKB 编码读取某些 txt 文件时会出问题,UTF-8 编码也是一样。

txt 文件读取方法最好把编码写全一些,这个不行就用另一个试试,另一个不行再用另另一个。

遇到过 UTF-16 LE 编码的,所以索性把基本上用到的编码读取全给上了,下面是我用到的所有尝试读取的编码(不排除后期增加其他编码):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
NSError *error;
NSString *content = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:&error];
NSLog(@"UTF-8, error = %@", error);
//解决中文乱码
if (!content) {
error = nil;
content = [NSString stringWithContentsOfURL:url encoding:0x80000632 error:&error];
NSLog(@"GBK 632, error = %@", error);
}
if (!content) {
error = nil;
content = [NSString stringWithContentsOfURL:url encoding:0x80000631 error:&error];
NSLog(@"GBK 631, error = %@", error);
}
if (!content) {
error = nil;
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_2312_80);
content = [NSString stringWithContentsOfURL:url encoding:enc error:&error];
NSLog(@"GB 2312, error = %@", error);
}
if (!content) {
error = nil;
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingHZ_GB_2312);
content = [NSString stringWithContentsOfURL:url encoding:enc error:&error];
NSLog(@"HZ GB 2312, error = %@", error);
}
if (!content) {
error = nil;
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingMacChineseSimp);
content = [NSString stringWithContentsOfURL:url encoding:enc error:&error];
NSLog(@"Mac Chinese Simp, error = %@", error);
}
if (!content) {
error = nil;
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingDOSChineseSimplif);
content = [NSString stringWithContentsOfURL:url encoding:enc error:&error];
NSLog(@"DOS Chinese Simplif, error = %@", error);
}
if (!content) {
error = nil;
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000);
content = [NSString stringWithContentsOfURL:url encoding:enc error:&error];
NSLog(@"GB 18030, error = %@", error);
}
if (!content) {
error = nil;
content = [NSString stringWithContentsOfURL:url encoding:NSUTF16StringEncoding error:&error];
NSLog(@"UTF-16, error = %@", error);
}
if (!content) {
error = nil;
content = [NSString stringWithContentsOfURL:url encoding:NSUTF16LittleEndianStringEncoding error:&error];
NSLog(@"UTF-16-LE, error = %@", error);
}
if (!content) {
error = nil;
content = [NSString stringWithContentsOfURL:url encoding:NSUTF16BigEndianStringEncoding error:&error];
NSLog(@"UTF-16-BE, error = %@", error);
}
if (!content) {
error = nil;
content = [NSString stringWithContentsOfURL:url encoding:NSUTF32StringEncoding error:&error];
NSLog(@"UTF-32, error = %@", error);
}
if (!content) {
error = nil;
content = [NSString stringWithContentsOfURL:url encoding:NSUTF32LittleEndianStringEncoding error:&error];
NSLog(@"UTF-32-LE, error = %@", error);
}
if (!content) {
error = nil;
content = [NSString stringWithContentsOfURL:url encoding:NSUTF32BigEndianStringEncoding error:&error];
NSLog(@"UTF-32-BE, error = %@", error);
}

这样,基本上都可以正确读取 txt 文件了。