Sometimes when reading in text from Japanese (or other) PDFs, the character encoding can’t display every character.

落\n\n札\n\n落 札 者 等 の 公 示\n\n次のとおり落札者等について公示します。\n\n平成 31
年１月４日\n\n［掲載順序］\n\n①品目分類番号 ②調達件名及び数量 \ue7d3調達方法 \ue7d4契約方式
\ue7d5落札決定日（随意契約の場合\nは契約日） \ue7d6落札者（随意契約の場合は契約者）の氏名及び住所
\ue7d7落札価格（随意契約の場合\nは契約価格） \ue7d8入札公告日又は公示日 \ue7d9随意契約の場合はその理由
\ue7da指名業者名（指名競争\n入札の場合） \ue7bc落札方式 \ue7bd予定価格\n\n〇支出負担行為担当官 内閣府大臣官房会計担当参事官 横内
憲二 （東京都千代田区永田町１\ue61c\n\n６\ue61c１）\n\n◎調達機関番号 007

If you look at the PDF file, the text is an encoding called UniJIS-UCS2-H. I haven’t heard of it, but it’s something related to “ISO 10646-1:1993, UCS-2 encoding” according to Adobe. Maybe it has something to do with Shift-JIS, but decoding the characters doesn’t ever seem to work.

So let’s learn to fix it.

Converting PDFs to text

This is Kenji’s code! I just borrowed it.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

def convert_pdf_to_txt(path): # 引数にはPDFファイルパスを指定
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    laparams.detect_vertical = True # Trueにすることで綺麗にテキストを抽出できる
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    
    return fstr

A function to replace ‘bad’ characters with the ‘good’ version

This function will replace bad characters with the good version. For example, the character  cannot be displayed correctly, but it’s really ①! The bad character  looks the same, but it’s actually ②.

The bad characters  and  look exactly the same, but they are secretly different because they have different codes:

character	code	bad display
`①`	`\ue7d1`	``
`②`	`\ue7d2`	``

If we know the codes, we can fix this problem. In the function below, I replace two of the bad characters with good versions.

def fix_characters(text):
    replacements = {
        '\ue7d1': '①',
        '\ue7d2': '②',
    }
    for key, value in replacements.items():
        text = text.replace(key, value)
    return text

Let’s test it out.

bad_text = "Go to the store\nBuy bread and milk"
good_text = fix_characters(bad_text)
good_text

'①Go to the store\n②Buy bread and milk'

Okay, cool, works well!

Finding out the codes

But if every character looks the same, how do we know what code it is? It isn’t easy, but it’s possible!

First, we convert a PDF to text like we would normally. We’ll also replace known bad characters with their good versions. Maybe some of the characters left are bad, right?

from io import StringIO

text = convert_pdf_to_txt("20190104c000010134.pdf")
text = fix_characters(text)

# Let's see the first 1000 characters
text[:1000]

'落\n\n札\n\n落 札 者 等 の 公 示\n\n次のとおり落札者等について公示します。\n\n平成 31 年１月４日\n\n［掲載順序］\n\n①品目分類番号 ②調達件名及び数量 \ue7d3調達方法 \ue7d4契約方式 \ue7d5落札決定日（随意契約の場合\nは契約日） \ue7d6落札者（随意契約の場合は契約者）の氏名及び住所 \ue7d7落札価格（随意契約の場合\nは契約価格） \ue7d8入札公告日又は公示日 \ue7d9随意契約の場合はその理由 \ue7da指名業者名（指名競争\n入札の場合） \ue7bc落札方式 \ue7bd予定価格\n\n〇支出負担行為担当官 内閣府大臣官房会計担当参事官 横内 憲二 （東京都千代田区永田町１\ue61c\n\n６\ue61c１）\n\n◎調達機関番号 007 ◎所在地番号 13\n\n①73 ②平成30年度国際広報キャンペーンテーマに係る広報の実施業務 広報テーマ：「我が国の\n平成30年度の戦略的メッセージの理解促進・浸透」（海外ＴＶ媒体・イベントを活用した発信事業）\n一式 \ue7d3購入等 \ue7d4随意 \ue7d530.11.30 \ue7d6㈱電通（東京都港区東新橋１\ue61c８\ue61c１） \ue7d792\uea75614\uea75320\n円 \ue7d9ｂ「排他的権利の保護」\n①73 ②平成30年度国際広報キャンペーンテーマに係る広報の実施業務 広報テーマ：「我が国の\n平成30年度の戦略的メッセージの理解促進・浸透」（国内英字媒体・オウンドメディアを活用した\n発信事業）一式 \ue7d3購入等 \ue7d4随意 \ue7d530.11.30 \ue7d6㈱電通（東京都港区東新橋１\ue61c８\ue61c１） \ue7d7\n8\n\uea75\n223\uea75271円 \ue7d9ｂ「排他的権利の保護」\n\n〇支出負担行為担当官 内閣府経済社会総合研究所次長 市川 正樹 （東京都千代田区永田町１\ue61c\n\n６\ue61c１）\n\n◎調達機関番号 007 ◎所在地番号 13\n\n①71、27 ②内閣府経済社会総合研究所システム運用管理業務 一式 \ue7d3購入等 \ue7d4一般 \ue7d5\n30. 9. 7 \ue7d6富士通株式会社（東京都港区東新橋１\ue61c５\ue61c２） \ue7d7188\uea75179\uea75200円 \ue7d830. 6. 4 \ue7bc\n総合評価\n①72 ②民間企業投資・除却調査の実査業務 一式 \ue7d3購入等 \ue7d4一般 \ue7d530. 9.19 \ue7d6株式会社\nサーベイリサーチセンター（東京都荒川区西日暮里２\ue61c40\ue61c10） \ue7d741\uea75526\uea75000円 \ue7d830. 7.24\n\ue7bc最低価格\n①72 ②企業行動に関するアンケート調査（平成30年度）業務 一式 \ue7d3購入等 \ue7d4一般 \ue7d5\n30. 9.28 \ue7d6株式会社サーベイリサーチセンター（東京都荒川区'

Now we make a dataframe of every single character on the page, and also its code.

import pandas as pd

# I want to see up to 500 rows
pd.set_option("display.max_rows", 500)

# Convert into a list of unique characters
# '田町1-6-1)' becomes ['田', '町', '1', '-', '6']
uniques = list(set(text))

# Make a dataframe of every character on the page
# List the 'code point' - the number that represents the character
# For example, 私 is 31169 - https://unicodemap.org/details/0x79C1/index.html
# The code point can be in decimal (0-9) or hex (0-9 + a-f)
codepoints = pd.DataFrame({
    'character': uniques,
    'decimal_code': [ord(char) for char in uniques],
    'hex_code': [hex(ord(char)).replace("0x", "\\u") for char in uniques]
})
codepoints

	character	decimal_code	hex_code
0	本	26412	\u672c
1	会	20250	\u4f1a
2	成	25104	\u6210
3	総	32207	\u7dcf
4		59349	\ue7d5
5	ッ	12483	\u30c3
6	也	20063	\u4e5f
7	ブ	12502	\u30d6
8	埼	22524	\u57fc
9	高	39640	\u9ad8
10	閣	38307	\u95a3
11	サ	12469	\u30b5
12	テ	12486	\u30c6
13	横	27178	\u6a2a
14	府	24220	\u5e9c
15	ケ	12465	\u30b1
16	番	30058	\u756a
17	告	21578	\u544a
18	・	12539	\u30fb
19	ま	12414	\u307e
20	解	35299	\u89e3
21	８	65304	\uff18
22	長	38263	\u9577
23	円	20870	\u5186
24	ア	12450	\u30a2
25	す	12377	\u3059
26	株	26666	\u682a
27	学	23398	\u5b66
28	察	23519	\u5bdf
29	税	31246	\u7a0e
30	ジ	12472	\u30b8
31	隆	38534	\u9686
32	設	35373	\u8a2d
33	カ	12459	\u30ab
34	用	29992	\u7528
35	葉	33865	\u8449
36		59351	\ue7d7
37	シ	12471	\u30b7
38	:	58	\u3a
39	担	25285	\u62c5
40	造	36896	\u9020
41	宮	23470	\u5bae
42	進	36914	\u9032
43	警	35686	\u8b66
44	司	21496	\u53f8
45	」	12301	\u300d
46	霞	38686	\u971e
47	り	12426	\u308a
48	ベ	12505	\u30d9
49	除	38500	\u9664
50	木	26408	\u6728
51	）	65289	\uff09
52	研	30740	\u7814
53	口	21475	\u53e3
54	光	20809	\u5149
55	約	32004	\u7d04
56	（	65288	\uff08
57	)	41	\u29
58	数	25968	\u6570
59	定	23450	\u5b9a
60	レ	12524	\u30ec
61	我	25105	\u6211
62	1	49	\u31
63	二	20108	\u4e8c
64	㈱	12849	\u3231
65	屋	23627	\u5c4b
66	載	36617	\u8f09
67	チ	12481	\u30c1
68		59354	\ue7da
69	田	30000	\u7530
70	枝	26525	\u679d
71	「	12300	\u300c
72	グ	12464	\u30b0
73		12	\uc
74	憲	25010	\u61b2
75	低	20302	\u4f4e
76	等	31561	\u7b49
77	た	12383	\u305f
78	皇	30343	\u7687
79	契	22865	\u5951
80	く	12367	\u304f
81	省	30465	\u7701
82	決	27770	\u6c7a
83	次	27425	\u6b21
84		32	\u20
85	正	27491	\u6b63
86	１	65297	\uff11
87	ガ	12460	\u30ac
88	式	24335	\u5f0f
89	落	33853	\u843d
90	い	12356	\u3044
91	運	36939	\u904b
92	広	24195	\u5e83
93	委	22996	\u59d4
94	的	30340	\u7684
95	そ	12381	\u305d
96	を	12434	\u3092
97	３	65299	\uff13
98	際	38555	\u969b
99	船	33337	\u8239
100	箕	31637	\u7b95
101	ィ	12451	\u30a3
102		59324	\ue7bc
103	博	21338	\u535a
104	係	20418	\u4fc2
105	タ	12479	\u30bf
106	入	20837	\u5165
107	氏	27663	\u6c0f
108	媒	23186	\u5a92
109	居	23621	\u5c45
110	第	31532	\u7b2c
111	〇	12295	\u3007
112	済	28168	\u6e08
113	。	12290	\u3002
114	浸	28024	\u6d78
115	方	26041	\u65b9
116	及	21450	\u53ca
117	価	20385	\u4fa1
118	為	28858	\u70ba
119	藤	34276	\u85e4
120	つ	12388	\u3064
121	エ	12456	\u30a8
122	ピ	12500	\u30d4
123		60021	\uea75
124	②	9313	\u2461
125	央	22830	\u592e
126	申	30003	\u7533
127	量	37327	\u91cf
128	ミ	12511	\u30df
129	.	46	\u2e
130	ク	12463	\u30af
131	ラ	12521	\u30e9
132	で	12391	\u3067
133	富	23500	\u5bcc
134	玉	29577	\u7389
135	家	23478	\u5bb6
136	北	21271	\u5317
137	春	26149	\u6625
138	民	27665	\u6c11
139	促	20419	\u4fc3
140	達	36948	\u9054
141	微	24494	\u5fae
142	電	38651	\u96fb
143	場	22580	\u5834
144	が	12364	\u304c
145	月	26376	\u6708
146	保	20445	\u4fdd
147	西	35199	\u897f
148	銀	37504	\u9280
149	Ｇ	65319	\uff27
150	企	20225	\u4f01
151	代	20195	\u4ee3
152	主	20027	\u4e3b
153	び	12403	\u3073
154	町	30010	\u753a
155		58908	\ue61c
156		59347	\ue7d3
157	外	22806	\u5916
158	か	12363	\u304b
159	4	52	\u34
160	鈴	37428	\u9234
161	科	31185	\u79d1
162	借	20511	\u501f
163	度	24230	\u5ea6
164	前	21069	\u524d
165	動	21205	\u52d5
166	茶	33590	\u8336
167	ば	12400	\u3070
168	管	31649	\u7ba1
169	地	22320	\u5730
170	コ	12467	\u30b3
171	営	21942	\u55b6
172	て	12390	\u3066
173	デ	12487	\u30c7
174	内	20869	\u5185
175	鶴	40372	\u9db4
176		59325	\ue7bd
177	、	12289	\u3001
178	品	21697	\u54c1
179	争	20105	\u4e89
180	行	34892	\u884c
181		59533	\ue88d
182	意	24847	\u610f
183	は	12399	\u306f
184		59531	\ue88b
185	ャ	12515	\u30e3
186	曜	26332	\u66dc
187	権	27177	\u6a29
188	指	25351	\u6307
189	相	30456	\u76f8
190	ヌ	12492	\u30cc
191	里	37324	\u91cc
192	使	20351	\u4f7f
193	3	51	\u33
194	購	36092	\u8cfc
195	柏	26575	\u67cf
196		59348	\ue7d4
197	し	12375	\u3057
198	京	20140	\u4eac
199	座	24231	\u5ea7
200	平	24179	\u5e73
201	事	20107	\u4e8b
202	者	32773	\u8005
203	◎	9678	\u25ce
204	5	53	\u35
205	Ｔ	65332	\uff34
206	庁	24193	\u5e81
207	備	20633	\u5099
208	d	100	\u64
209	阪	38442	\u962a
210	若	33509	\u82e5
211	山	23665	\u5c71
212	人	20154	\u4eba
213	件	20214	\u4ef6
214	都	37117	\u90fd
215	政	25919	\u653f
216	実	23455	\u5b9f
217	ソ	12477	\u30bd
218	掲	25522	\u63b2
219	日	26085	\u65e5
220	曽	26365	\u66fd
221	丁	19969	\u4e01
222	法	27861	\u6cd5
223	却	21364	\u5374
224	間	38291	\u9593
225	利	21033	\u5229
226	部	37096	\u90e8
227	放	25918	\u653e
228	託	35351	\u8a17
229	所	25152	\u6240
230	Ｖ	65334	\uff36
231	質	36074	\u8cea
232	／	65295	\uff0f
233	フ	12501	\u30d5
234	信	20449	\u4fe1
235	\n	10	\ua
236	樹	27193	\u6a39
237	上	19978	\u4e0a
238	ン	12531	\u30f3
239	住	20303	\u4f4f
240	合	21512	\u5408
241	体	20307	\u4f53
242	活	27963	\u6d3b
243	ほ	12411	\u307b
244	お	12362	\u304a
245	ペ	12506	\u30da
246	城	22478	\u57ce
247	市	24066	\u5e02
248	ュ	12517	\u30e5
249	社	31038	\u793e
250	示	31034	\u793a
251	i	105	\u69
252	(	40	\u28
253	資	36039	\u8cc7
254	ｂ	65346	\uff42
255	報	22577	\u5831
256	面	38754	\u9762
257	マ	12510	\u30de
258	般	33324	\u822c
259	0	48	\u30
260	年	24180	\u5e74
261	港	28207	\u6e2f
262	札	26413	\u672d
263	出	20986	\u51fa
264	液	28082	\u6db2
265	ウ	12454	\u30a6
266	公	20844	\u516c
267	官	23448	\u5b98
268	メ	12513	\u30e1
269	薬	34220	\u85ac
270	関	38306	\u95a2
271	格	26684	\u683c
272	①	9312	\u2460
273	負	36000	\u8ca0
274	透	36879	\u900f
275	務	21209	\u52d9
276	由	30001	\u7531
277		59352	\ue7d8
278	国	22269	\u56fd
279	析	26512	\u6790
280	房	25151	\u623f
281	発	30330	\u767a
282	化	21270	\u5316
283	一	19968	\u4e00
284	経	32076	\u7d4c
285	橋	27211	\u6a4b
286	確	30906	\u78ba
287	護	35703	\u8b77
288	査	26619	\u67fb
289	競	31478	\u7af6
290	評	35413	\u8a55
291	類	39006	\u985e
292	永	27704	\u6c38
293	５	65301	\uff15
294	予	20104	\u4e88
295	ー	12540	\u30fc
296	英	33521	\u82f1
297	究	31350	\u7a76
298	神	31070	\u795e
299	型	22411	\u578b
300	排	25490	\u6392
301	：	65306	\uff1a
302	江	27743	\u6c5f
303	7	55	\u37
304	2	50	\u32
305	プ	12503	\u30d7
306		59350	\ue7d6
307	ビ	12499	\u30d3
308	随	38543	\u968f
309	坊	22346	\u574a
310	王	29579	\u738b
311	戦	25126	\u6226
312	課	35506	\u8ab2
313	中	20013	\u4e2d
314	ト	12488	\u30c8
315	大	22823	\u5927
316	の	12398	\u306e
317	最	26368	\u6700
318	金	37329	\u91d1
319	９	65305	\uff19
320		59534	\ue88e
321	序	24207	\u5e8f
322	c	99	\u63
323	＆	65286	\uff06
324	調	35519	\u8abf
325	る	12427	\u308b
326	号	21495	\u53f7
327	他	20182	\u4ed6
328	６	65302	\uff16
329	［	65339	\uff3b
330	理	29702	\u7406
331	施	26045	\u65bd
332	6	54	\u36
333	セ	12475	\u30bb
334	パ	12497	\u30d1
335	荒	33618	\u8352
336	手	25163	\u624b
337	略	30053	\u7565
338	げ	12370	\u3052
339	順	38918	\u9806
340	字	23383	\u5b57
341	ム	12512	\u30e0
342	呂	21570	\u5442
343	9	57	\u39
344	送	36865	\u9001
345	ナ	12490	\u30ca
346	４	65300	\uff14
347	］	65341	\uff3d
348	顕	38997	\u9855
349	ロ	12525	\u30ed
350	談	35527	\u8ac7
351	オ	12458	\u30aa
352	東	26481	\u6771
353	通	36890	\u901a
354	名	21517	\u540d
355	士	22763	\u58eb
356	暮	26286	\u66ae
357	哲	21746	\u54f2
358	分	20998	\u5206
359	ス	12473	\u30b9
360	又	21448	\u53c8
361	千	21315	\u5343
362	イ	12452	\u30a4
363	局	23616	\u5c40
364		59353	\ue7d9
365	川	24029	\u5ddd
366	機	27231	\u6a5f
367	目	30446	\u76ee
368	8	56	\u38
369	岩	23721	\u5ca9
370	に	12395	\u306b
371	業	26989	\u696d
372	ド	12489	\u30c9
373	投	25237	\u6295
374	新	26032	\u65b0
375	計	35336	\u8a08
376	支	25903	\u652f
377	ル	12523	\u30eb
378	キ	12461	\u30ad
379	と	12392	\u3068
380	２	65298	\uff12
381	気	27671	\u6c17
382	参	21442	\u53c2
383	県	30476	\u770c
384	恭	24685	\u606d
385	在	22312	\u5728
386	区	21306	\u533a
387	け	12369	\u3051
388	海	28023	\u6d77
389	当	24403	\u5f53
390	臣	33251	\u81e3
391	リ	12522	\u30ea

You might recognize the hex_code column from our replacement function from earlier:

def fix_characters(text):
    replacements = {
        '\ue7d1': '①',
        '\ue7d2': '②',
    }
    for key, value in replacements.items():
        text = text.replace(key, value)
    return text

When we do our replacing, we look for the hex code to replace with the good version.

Finding the bad ones

Okay, we have all of our characters and all of their codes, but how do we find the bad ones? I only see a couple in there!

I think only the codes above 50,000 are bad, so let’s only look at only those.

codepoints[codepoints.decimal_code > 50000]

	character	decimal_code	hex_code
4		59349	\ue7d5
21	８	65304	\uff18
36		59351	\ue7d7
51	）	65289	\uff09
56	（	65288	\uff08
68		59354	\ue7da
86	１	65297	\uff11
97	３	65299	\uff13
102		59324	\ue7bc
123		60021	\uea75
149	Ｇ	65319	\uff27
155		58908	\ue61c
156		59347	\ue7d3
176		59325	\ue7bd
181		59533	\ue88d
184		59531	\ue88b
196		59348	\ue7d4
205	Ｔ	65332	\uff34
230	Ｖ	65334	\uff36
232	／	65295	\uff0f
254	ｂ	65346	\uff42
277		59352	\ue7d8
293	５	65301	\uff15
301	：	65306	\uff1a
306		59350	\ue7d6
319	９	65305	\uff19
320		59534	\ue88e
323	＆	65286	\uff06
328	６	65302	\uff16
329	［	65339	\uff3b
346	４	65300	\uff14
347	］	65341	\uff3d
364		59353	\ue7d9
380	２	65298	\uff12

Okay, so now we have a list of some codes that are bad. We just need to learn to convert them to ‘good’ characters.

Finding what the ‘good’ version of the character is

Okay, number 364 is bad, the second-to-last one on the list. It’s printed as , with decimal code 59353 and hex code \ue7d9.

Let’s open our PDF and search for  - we can’t type the character, so we will cut and paste it.

When we search, it does not display correctly in the search box, but ⑨ is highlighted!

Now we will edit the fix_characters method to let it know that \ue7d9 is ⑨.

def fix_characters(text):
    replacements = {
        '\ue7d1': '①',
        '\ue7d2': '②',
        '\ue7d9': '⑨'
    }
    for key, value in replacements.items():
        text = text.replace(key, value)
    return text

We will repeat this for every missing character. It won’t be fun, but eventually we will have a function that can fix all of our bad characters!