Sometimes when reading in text from Japanese (or other) PDFs, the character encoding can’t display every character.

落\n\n札\n\n落 札 者 等 の 公 示\n\n次のとおり落札者等について公示します。\n\n平成 31
年1月4日\n\n[掲載順序]\n\n①品目分類番号 ②調達件名及び数量 \ue7d3調達方法 \ue7d4契約方式
\ue7d5落札決定日(随意契約の場合\nは契約日) \ue7d6落札者(随意契約の場合は契約者)の氏名及び住所
\ue7d7落札価格(随意契約の場合\nは契約価格) \ue7d8入札公告日又は公示日 \ue7d9随意契約の場合はその理由
\ue7da指名業者名(指名競争\n入札の場合) \ue7bc落札方式 \ue7bd予定価格\n\n〇支出負担行為担当官 内閣府大臣官房会計担当参事官 横内
憲二 (東京都千代田区永田町1\ue61c\n\n6\ue61c1)\n\n◎調達機関番号 007

If you look at the PDF file, the text is an encoding called UniJIS-UCS2-H. I haven’t heard of it, but it’s something related to “ISO 10646-1:1993, UCS-2 encoding” according to Adobe. Maybe it has something to do with Shift-JIS, but decoding the characters doesn’t ever seem to work.

So let’s learn to fix it.

Converting PDFs to text

This is Kenji’s code! I just borrowed it.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams

def convert_pdf_to_txt(path): # 引数にはPDFファイルパスを指定
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    laparams.detect_vertical = True # Trueにすることで綺麗にテキストを抽出できる
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    
    return fstr

A function to replace ‘bad’ characters with the ‘good’ version

This function will replace bad characters with the good version. For example, the character cannot be displayed correctly, but it’s really ! The bad character looks the same, but it’s actually .

The bad characters and look exactly the same, but they are secretly different because they have different codes:

character code bad display
\ue7d1
\ue7d2

If we know the codes, we can fix this problem. In the function below, I replace two of the bad characters with good versions.

def fix_characters(text):
    replacements = {
        '\ue7d1': '①',
        '\ue7d2': '②',
    }
    for key, value in replacements.items():
        text = text.replace(key, value)
    return text

Let’s test it out.

bad_text = "Go to the store\nBuy bread and milk"
good_text = fix_characters(bad_text)
good_text
'①Go to the store\n②Buy bread and milk'

Okay, cool, works well!

Finding out the codes

But if every character looks the same, how do we know what code it is? It isn’t easy, but it’s possible!

First, we convert a PDF to text like we would normally. We’ll also replace known bad characters with their good versions. Maybe some of the characters left are bad, right?

from io import StringIO

text = convert_pdf_to_txt("20190104c000010134.pdf")
text = fix_characters(text)

# Let's see the first 1000 characters
text[:1000]
'落\n\n札\n\n落 札 者 等 の 公 示\n\n次のとおり落札者等について公示します。\n\n平成 31 年1月4日\n\n[掲載順序]\n\n①品目分類番号 ②調達件名及び数量 \ue7d3調達方法 \ue7d4契約方式 \ue7d5落札決定日(随意契約の場合\nは契約日) \ue7d6落札者(随意契約の場合は契約者)の氏名及び住所 \ue7d7落札価格(随意契約の場合\nは契約価格) \ue7d8入札公告日又は公示日 \ue7d9随意契約の場合はその理由 \ue7da指名業者名(指名競争\n入札の場合) \ue7bc落札方式 \ue7bd予定価格\n\n〇支出負担行為担当官 内閣府大臣官房会計担当参事官 横内 憲二 (東京都千代田区永田町1\ue61c\n\n6\ue61c1)\n\n◎調達機関番号 007 ◎所在地番号 13\n\n①73 ②平成30年度国際広報キャンペーンテーマに係る広報の実施業務 広報テーマ:「我が国の\n平成30年度の戦略的メッセージの理解促進・浸透」(海外TV媒体・イベントを活用した発信事業)\n一式 \ue7d3購入等 \ue7d4随意 \ue7d530.11.30 \ue7d6㈱電通(東京都港区東新橋1\ue61c8\ue61c1) \ue7d792\uea75614\uea75320\n円 \ue7d9b「排他的権利の保護」\n①73 ②平成30年度国際広報キャンペーンテーマに係る広報の実施業務 広報テーマ:「我が国の\n平成30年度の戦略的メッセージの理解促進・浸透」(国内英字媒体・オウンドメディアを活用した\n発信事業)一式 \ue7d3購入等 \ue7d4随意 \ue7d530.11.30 \ue7d6㈱電通(東京都港区東新橋1\ue61c8\ue61c1) \ue7d7\n8\n\uea75\n223\uea75271円 \ue7d9b「排他的権利の保護」\n\n〇支出負担行為担当官 内閣府経済社会総合研究所次長 市川 正樹 (東京都千代田区永田町1\ue61c\n\n6\ue61c1)\n\n◎調達機関番号 007 ◎所在地番号 13\n\n①71、27 ②内閣府経済社会総合研究所システム運用管理業務 一式 \ue7d3購入等 \ue7d4一般 \ue7d5\n30. 9. 7 \ue7d6富士通株式会社(東京都港区東新橋1\ue61c5\ue61c2) \ue7d7188\uea75179\uea75200円 \ue7d830. 6. 4 \ue7bc\n総合評価\n①72 ②民間企業投資・除却調査の実査業務 一式 \ue7d3購入等 \ue7d4一般 \ue7d530. 9.19 \ue7d6株式会社\nサーベイリサーチセンター(東京都荒川区西日暮里2\ue61c40\ue61c10) \ue7d741\uea75526\uea75000円 \ue7d830. 7.24\n\ue7bc最低価格\n①72 ②企業行動に関するアンケート調査(平成30年度)業務 一式 \ue7d3購入等 \ue7d4一般 \ue7d5\n30. 9.28 \ue7d6株式会社サーベイリサーチセンター(東京都荒川区'

Now we make a dataframe of every single character on the page, and also its code.

import pandas as pd

# I want to see up to 500 rows
pd.set_option("display.max_rows", 500)

# Convert into a list of unique characters
# '田町1-6-1)' becomes ['田', '町', '1', '-', '6']
uniques = list(set(text))

# Make a dataframe of every character on the page
# List the 'code point' - the number that represents the character
# For example, 私 is 31169 - https://unicodemap.org/details/0x79C1/index.html
# The code point can be in decimal (0-9) or hex (0-9 + a-f)
codepoints = pd.DataFrame({
    'character': uniques,
    'decimal_code': [ord(char) for char in uniques],
    'hex_code': [hex(ord(char)).replace("0x", "\\u") for char in uniques]
})
codepoints
character decimal_code hex_code
0 26412 \u672c
1 20250 \u4f1a
2 25104 \u6210
3 32207 \u7dcf
4 59349 \ue7d5
5 12483 \u30c3
6 20063 \u4e5f
7 12502 \u30d6
8 22524 \u57fc
9 39640 \u9ad8
10 38307 \u95a3
11 12469 \u30b5
12 12486 \u30c6
13 27178 \u6a2a
14 24220 \u5e9c
15 12465 \u30b1
16 30058 \u756a
17 21578 \u544a
18 12539 \u30fb
19 12414 \u307e
20 35299 \u89e3
21 65304 \uff18
22 38263 \u9577
23 20870 \u5186
24 12450 \u30a2
25 12377 \u3059
26 26666 \u682a
27 23398 \u5b66
28 23519 \u5bdf
29 31246 \u7a0e
30 12472 \u30b8
31 38534 \u9686
32 35373 \u8a2d
33 12459 \u30ab
34 29992 \u7528
35 33865 \u8449
36 59351 \ue7d7
37 12471 \u30b7
38 : 58 \u3a
39 25285 \u62c5
40 36896 \u9020
41 23470 \u5bae
42 36914 \u9032
43 35686 \u8b66
44 21496 \u53f8
45 12301 \u300d
46 38686 \u971e
47 12426 \u308a
48 12505 \u30d9
49 38500 \u9664
50 26408 \u6728
51 65289 \uff09
52 30740 \u7814
53 21475 \u53e3
54 20809 \u5149
55 32004 \u7d04
56 65288 \uff08
57 ) 41 \u29
58 25968 \u6570
59 23450 \u5b9a
60 12524 \u30ec
61 25105 \u6211
62 1 49 \u31
63 20108 \u4e8c
64 12849 \u3231
65 23627 \u5c4b
66 36617 \u8f09
67 12481 \u30c1
68 59354 \ue7da
69 30000 \u7530
70 26525 \u679d
71 12300 \u300c
72 12464 \u30b0
73 12 \uc
74 25010 \u61b2
75 20302 \u4f4e
76 31561 \u7b49
77 12383 \u305f
78 30343 \u7687
79 22865 \u5951
80 12367 \u304f
81 30465 \u7701
82 27770 \u6c7a
83 27425 \u6b21
84 32 \u20
85 27491 \u6b63
86 65297 \uff11
87 12460 \u30ac
88 24335 \u5f0f
89 33853 \u843d
90 12356 \u3044
91 36939 \u904b
92 24195 \u5e83
93 22996 \u59d4
94 30340 \u7684
95 12381 \u305d
96 12434 \u3092
97 65299 \uff13
98 38555 \u969b
99 33337 \u8239
100 31637 \u7b95
101 12451 \u30a3
102 59324 \ue7bc
103 21338 \u535a
104 20418 \u4fc2
105 12479 \u30bf
106 20837 \u5165
107 27663 \u6c0f
108 23186 \u5a92
109 23621 \u5c45
110 31532 \u7b2c
111 12295 \u3007
112 28168 \u6e08
113 12290 \u3002
114 28024 \u6d78
115 26041 \u65b9
116 21450 \u53ca
117 20385 \u4fa1
118 28858 \u70ba
119 34276 \u85e4
120 12388 \u3064
121 12456 \u30a8
122 12500 \u30d4
123 60021 \uea75
124 9313 \u2461
125 22830 \u592e
126 30003 \u7533
127 37327 \u91cf
128 12511 \u30df
129 . 46 \u2e
130 12463 \u30af
131 12521 \u30e9
132 12391 \u3067
133 23500 \u5bcc
134 29577 \u7389
135 23478 \u5bb6
136 21271 \u5317
137 26149 \u6625
138 27665 \u6c11
139 20419 \u4fc3
140 36948 \u9054
141 24494 \u5fae
142 38651 \u96fb
143 22580 \u5834
144 12364 \u304c
145 26376 \u6708
146 20445 \u4fdd
147 西 35199 \u897f
148 37504 \u9280
149 65319 \uff27
150 20225 \u4f01
151 20195 \u4ee3
152 20027 \u4e3b
153 12403 \u3073
154 30010 \u753a
155 58908 \ue61c
156 59347 \ue7d3
157 22806 \u5916
158 12363 \u304b
159 4 52 \u34
160 37428 \u9234
161 31185 \u79d1
162 20511 \u501f
163 24230 \u5ea6
164 21069 \u524d
165 21205 \u52d5
166 33590 \u8336
167 12400 \u3070
168 31649 \u7ba1
169 22320 \u5730
170 12467 \u30b3
171 21942 \u55b6
172 12390 \u3066
173 12487 \u30c7
174 20869 \u5185
175 40372 \u9db4
176 59325 \ue7bd
177 12289 \u3001
178 21697 \u54c1
179 20105 \u4e89
180 34892 \u884c
181 59533 \ue88d
182 24847 \u610f
183 12399 \u306f
184 59531 \ue88b
185 12515 \u30e3
186 26332 \u66dc
187 27177 \u6a29
188 25351 \u6307
189 30456 \u76f8
190 12492 \u30cc
191 37324 \u91cc
192 使 20351 \u4f7f
193 3 51 \u33
194 36092 \u8cfc
195 26575 \u67cf
196 59348 \ue7d4
197 12375 \u3057
198 20140 \u4eac
199 24231 \u5ea7
200 24179 \u5e73
201 20107 \u4e8b
202 32773 \u8005
203 9678 \u25ce
204 5 53 \u35
205 65332 \uff34
206 24193 \u5e81
207 20633 \u5099
208 d 100 \u64
209 38442 \u962a
210 33509 \u82e5
211 23665 \u5c71
212 20154 \u4eba
213 20214 \u4ef6
214 37117 \u90fd
215 25919 \u653f
216 23455 \u5b9f
217 12477 \u30bd
218 25522 \u63b2
219 26085 \u65e5
220 26365 \u66fd
221 19969 \u4e01
222 27861 \u6cd5
223 21364 \u5374
224 38291 \u9593
225 21033 \u5229
226 37096 \u90e8
227 25918 \u653e
228 35351 \u8a17
229 25152 \u6240
230 65334 \uff36
231 36074 \u8cea
232 65295 \uff0f
233 12501 \u30d5
234 20449 \u4fe1
235 \n 10 \ua
236 27193 \u6a39
237 19978 \u4e0a
238 12531 \u30f3
239 20303 \u4f4f
240 21512 \u5408
241 20307 \u4f53
242 27963 \u6d3b
243 12411 \u307b
244 12362 \u304a
245 12506 \u30da
246 22478 \u57ce
247 24066 \u5e02
248 12517 \u30e5
249 31038 \u793e
250 31034 \u793a
251 i 105 \u69
252 ( 40 \u28
253 36039 \u8cc7
254 65346 \uff42
255 22577 \u5831
256 38754 \u9762
257 12510 \u30de
258 33324 \u822c
259 0 48 \u30
260 24180 \u5e74
261 28207 \u6e2f
262 26413 \u672d
263 20986 \u51fa
264 28082 \u6db2
265 12454 \u30a6
266 20844 \u516c
267 23448 \u5b98
268 12513 \u30e1
269 34220 \u85ac
270 38306 \u95a2
271 26684 \u683c
272 9312 \u2460
273 36000 \u8ca0
274 36879 \u900f
275 21209 \u52d9
276 30001 \u7531
277 59352 \ue7d8
278 22269 \u56fd
279 26512 \u6790
280 25151 \u623f
281 30330 \u767a
282 21270 \u5316
283 19968 \u4e00
284 32076 \u7d4c
285 27211 \u6a4b
286 30906 \u78ba
287 35703 \u8b77
288 26619 \u67fb
289 31478 \u7af6
290 35413 \u8a55
291 39006 \u985e
292 27704 \u6c38
293 65301 \uff15
294 20104 \u4e88
295 12540 \u30fc
296 33521 \u82f1
297 31350 \u7a76
298 31070 \u795e
299 22411 \u578b
300 25490 \u6392
301 65306 \uff1a
302 27743 \u6c5f
303 7 55 \u37
304 2 50 \u32
305 12503 \u30d7
306 59350 \ue7d6
307 12499 \u30d3
308 38543 \u968f
309 22346 \u574a
310 29579 \u738b
311 25126 \u6226
312 35506 \u8ab2
313 20013 \u4e2d
314 12488 \u30c8
315 22823 \u5927
316 12398 \u306e
317 26368 \u6700
318 37329 \u91d1
319 65305 \uff19
320 59534 \ue88e
321 24207 \u5e8f
322 c 99 \u63
323 65286 \uff06
324 調 35519 \u8abf
325 12427 \u308b
326 21495 \u53f7
327 20182 \u4ed6
328 65302 \uff16
329 65339 \uff3b
330 29702 \u7406
331 26045 \u65bd
332 6 54 \u36
333 12475 \u30bb
334 12497 \u30d1
335 33618 \u8352
336 25163 \u624b
337 30053 \u7565
338 12370 \u3052
339 38918 \u9806
340 23383 \u5b57
341 12512 \u30e0
342 21570 \u5442
343 9 57 \u39
344 36865 \u9001
345 12490 \u30ca
346 65300 \uff14
347 65341 \uff3d
348 38997 \u9855
349 12525 \u30ed
350 35527 \u8ac7
351 12458 \u30aa
352 26481 \u6771
353 36890 \u901a
354 21517 \u540d
355 22763 \u58eb
356 26286 \u66ae
357 21746 \u54f2
358 20998 \u5206
359 12473 \u30b9
360 21448 \u53c8
361 21315 \u5343
362 12452 \u30a4
363 23616 \u5c40
364 59353 \ue7d9
365 24029 \u5ddd
366 27231 \u6a5f
367 30446 \u76ee
368 8 56 \u38
369 23721 \u5ca9
370 12395 \u306b
371 26989 \u696d
372 12489 \u30c9
373 25237 \u6295
374 26032 \u65b0
375 35336 \u8a08
376 25903 \u652f
377 12523 \u30eb
378 12461 \u30ad
379 12392 \u3068
380 65298 \uff12
381 27671 \u6c17
382 21442 \u53c2
383 30476 \u770c
384 24685 \u606d
385 22312 \u5728
386 21306 \u533a
387 12369 \u3051
388 28023 \u6d77
389 24403 \u5f53
390 33251 \u81e3
391 12522 \u30ea

You might recognize the hex_code column from our replacement function from earlier:

def fix_characters(text):
    replacements = {
        '\ue7d1': '①',
        '\ue7d2': '②',
    }
    for key, value in replacements.items():
        text = text.replace(key, value)
    return text

When we do our replacing, we look for the hex code to replace with the good version.

Finding the bad ones

Okay, we have all of our characters and all of their codes, but how do we find the bad ones? I only see a couple in there!

I think only the codes above 50,000 are bad, so let’s only look at only those.

codepoints[codepoints.decimal_code > 50000]
character decimal_code hex_code
4 59349 \ue7d5
21 65304 \uff18
36 59351 \ue7d7
51 65289 \uff09
56 65288 \uff08
68 59354 \ue7da
86 65297 \uff11
97 65299 \uff13
102 59324 \ue7bc
123 60021 \uea75
149 65319 \uff27
155 58908 \ue61c
156 59347 \ue7d3
176 59325 \ue7bd
181 59533 \ue88d
184 59531 \ue88b
196 59348 \ue7d4
205 65332 \uff34
230 65334 \uff36
232 65295 \uff0f
254 65346 \uff42
277 59352 \ue7d8
293 65301 \uff15
301 65306 \uff1a
306 59350 \ue7d6
319 65305 \uff19
320 59534 \ue88e
323 65286 \uff06
328 65302 \uff16
329 65339 \uff3b
346 65300 \uff14
347 65341 \uff3d
364 59353 \ue7d9
380 65298 \uff12

Okay, so now we have a list of some codes that are bad. We just need to learn to convert them to ‘good’ characters.

Finding what the ‘good’ version of the character is

Okay, number 364 is bad, the second-to-last one on the list. It’s printed as , with decimal code 59353 and hex code \ue7d9.

Let’s open our PDF and search for - we can’t type the character, so we will cut and paste it.

When we search, it does not display correctly in the search box, but is highlighted!

Now we will edit the fix_characters method to let it know that \ue7d9 is .

def fix_characters(text):
    replacements = {
        '\ue7d1': '①',
        '\ue7d2': '②',
        '\ue7d9': '⑨'
    }
    for key, value in replacements.items():
        text = text.replace(key, value)
    return text

We will repeat this for every missing character. It won’t be fun, but eventually we will have a function that can fix all of our bad characters!