Encoding problems with requests and BeautifulSoup¶
Sometimes when you use Python's requests
library on Chinese, Japanese, or Korean, your text comes back looking like garbage! Like gibberish! More specifically, something like this:
Ö±ÊôÌØÉè»ú¹¹\xa0\n\xa0¹ú×Êί\n\xa0¹
úÎñÔºÖ±Êô»ú¹¹\xa0\n\xa0º£¹Ø×ÜÊð\xa0
\xa0¹úË°×ܾÖ\xa0\xa0Êг¡¼à¹Ü×ܾÖ
\n\xa0½ðÈÚ¼à¹Ü×ܾÖ\xa0\xa0Ö¤¼à»á
\xa0¹ãµç×ܾÖ\xa0\n\xa0ÌåÓý×ܾÖ\xa0
It can also happen in French, German, Spanish... any language with characters that aren't simple A-Z (it just isn't always as obvious in those cases).
The quick fix¶
The problem is probably an undefined encoding or character set. You can usually fix this by allowing requests to guess the character encoding.
response = requests.get('http://district.ce.cn/zt/rwk/rw/sbj/index_3.shtml')
# This line fixes it
response.encoding = response.apparent_encoding
doc = BeautifulSoup(response.text)
doc.text[:300]
'\n\n\n\n\n\n\n\n省部级任免_中国经济网\n\n\n\n\n\n\n\n\n\n\n\n\n 广告载入中...\n\n\n\n\n\n\n\n\n人事盘点\n反腐・问责\n省级任免\n中直机构人物库\n部委人物库\n地方党政领导人物库\n人事动态\n\n\n\n\n\n首页\xa0>\xa0人物库\xa0>\xa0省部级任免\n\n\n·张佰成任锡林郭勒盟委书记 么永波不再担任[2023/07/12]\n·徐向国任黑龙江省政府党组成员(图|简历)[2023/07/12]\n·河北省委常委张成中任省政府党组副书记[2023/07/09]\n·陈新武任重庆市委秘书长、市委办公厅主任[2023/07/08]\n·张晓强任甘肃省委常委、兰州市委书记 朱天舒不再担任[2023/07/08]\n\n\n·重庆'
The longer explanation¶
This is a problem with encoding. When a file shows up, it's just a big long list of numbers. Encoding is the process of converting those numbers into text. Here are some examples of how different numbers are converted to characters in different encodings:
number | UTF-8 | Latin-1 | ISO 8859-7 | ISO 8859-14 | |
---|---|---|---|---|---|
69 |
E | E | E | E | |
240 |
ð | ð | π | ṫ | |
202 |
Ê | Ê | Ċ |
Notice how they aren't always same! Most encodings have the same numbers for normal boring un-accented Latin (English) characters, but quickly diverge when it gets to accents or more complex character systems. If you don't know the right encoding for a page, some of the characters will probably be mixed up.
For example, let's look at this page in Chinese.
It looks fine on the web, but when you use requests and BeautifulSoup, everything falls apart!
import requests
from bs4 import BeautifulSoup
url = 'http://district.ce.cn/zt/rwk/rw/sbj/index_3.shtml'
response = requests.get(url)
doc = BeautifulSoup(response.text)
doc.text
'\n\n\n\n\n\n\n\nÊ¡²¿¼¶ÈÎÃâ_Öйú¾\xad¼ÃÍø\n\n\n\n\n\n\n\n\n\n\n\n\n ¹ã¸æÔØÈëÖÐ...\n\n\n\n\n\n\n\n\nÈËÊÂÅ̵ã\n·´¸¯¡¤ÎÊÔð\nÊ¡¼¶ÈÎÃâ\nÖÐÖ±»ú¹¹ÈËÎï¿â\n²¿Î¯ÈËÎï¿â\nµØ·½µ³ÕþÁìµ¼ÈËÎï¿â\nÈËʶ¯Ì¬\n\n\n\n\n\nÊ×Ò³\xa0>\xa0ÈËÎï¿â\xa0>\xa0Ê¡²¿¼¶ÈÎÃâ\n\n\n·ÕÅ°Û³ÉÈÎÎýÁÖ¹ùÀÕÃËίÊé¼Ç ôÓÀ²¨²»ÔÙµ£ÈÎ[2023/07/12]\n·ÐìÏò¹úÈκÚÁú½\xadÊ¡Õþ¸®µ³×é³ÉÔ±(ͼ|¼òÀú)[2023/07/12]\n·ºÓ±±Ê¡Î¯³£Î¯ÕųÉÖÐÈÎÊ¡Õþ¸®µ³×鸱Êé¼Ç[2023/07/09]\n·³ÂÐÂÎäÈÎÖØÇìÊÐίÃØÊ鳤¡¢ÊÐί°ì¹«ÌüÖ÷ÈÎ[2023/07/08]\n·ÕÅÏþÇ¿ÈθÊËàʡί³£Î¯¡¢À¼ÖÝÊÐίÊé¼Ç ÖìÌìÊæ²»ÔÙµ£ÈÎ[2023/07/08]\n\n\n·ÖØÇìÊÐί³£Î¯ÂÞÝþÒÑÈÎÁ½½\xadÐÂÇøµ³¹¤Î¯Êé¼Ç[2023/07/07]\n·Öܺ£±øµ±Ñ¡³¤É³ÊÐÊг¤[2023/07/06]\n·³ÂÐÂÎäÈÎÖØÇìÊÐί³£Î¯ ´ËÇ°µ£Èκþ±±Ê¡Î¯³£Î¯[2023/07/06]\n·ÖìÌìÊæÈÎÄþÏÄ»Ø×å×ÔÖÎÇøµ³Î¯³£Î¯¡¢Õþ·¨Î¯Êé¼Ç[2023/07/05]\n·ÍõÖ¾¾üÈιúÎñÔº¸±ÃØÊ鳤 ´ËÇ°µ£ÈκÚÁú½\xadʡί¸±Êé¼Ç[2023/07/04]\n\n\n·µËÐÞÃ÷ÈÎ×î¸ßÈËÃñ·¨Ôºµ³×鸱Êé¼Ç ´ËÇ°µ£Èν\xadËÕʡί¸±Êé¼Ç[2023/07/04]\n·ÕųÉÖÐÈκӱ±Ê¡Î¯³£Î¯ ´ËÇ°µ£ÈÎÁÉÄþʡί³£Î¯[2023/07/03]\n·ÀîÕþÈÎÉϺ£ÊÐί³£Î¯¡¢ÊÐίÃØÊ鳤[2023/07/03]\n·Ê©Ð¡ÁÕÈÎËÄ´¨Ê¡Î¯¸±Êé¼Ç[2023/07/03]\n·ÍõÇïʵÈμªÁÖʡί³£Î¯¡¢×éÖ¯²¿²¿³¤ ´ËÇ°µ£ÈκÚÁú½\xadÊ¡¸±Ê¡³¤[2023/07/02]\n\n\n·ãÆÏ£¾üµ±Ñ¡º£ÄÏÊ¡×ܹ¤»áÖ÷ϯ[2023/07/02]\n·ÍõêÍÈÎס½¨²¿µ³×é³ÉÔ± ´ËÇ°µ£Èν\xadËÕÊ¡¸±Ê¡³¤(ͼ|¼òÀú)[2023/06/27]\n·ÍõÖ¾ÖÒÈι«°²²¿¸±²¿³¤ ´ËÇ°µ£Èι㶫ʡ¸±Ê¡³¤¡¢Ê¡¹«°²ÌüÌü³¤[2023/06/26]\n·ÂÀÓñÓ¡ÈΰÄÃÅÖÐÁª°ì¸±Ö÷ÈÎ ´ËÇ°µ£Èι㶫ʡ¸±Ê¡³¤¡¢Ö麣ÊÐίÊé¼Ç[2023/06/26]\n·ÎâÐãÕÂÈÎÈËÉ粿¸±²¿³¤ ´ËÇ°µ£ÈÎÄþÏÄ»Ø×å×ÔÖÎÇøÕþ¸®¸±Ö÷ϯ[2023/06/22]\n\n\n·¹ù·¼ÈÎÉú̬»·¾³²¿µ³×é³ÉÔ± ´ËÇ°µ£ÈÎÉϺ£ÊÐί³£Î¯¡¢¸±Êг¤[2023/06/20]\n·½¯Ì챦ÈÎÁÉÄþʡί³£Î¯¡¢×éÖ¯²¿²¿³¤ ´ËÇ°µ£Èιú¼ÒÏç´åÕñÐ˾ָ±¾Ö³¤[2023/06/19]\n·¹ùÓÀº½Èι㶫ʡί³£Î¯¡¢¹ãÖÝÊÐίÊé¼Ç ÁÖ¿ËÇì²»ÔÙµ£ÈÎ[2023/06/16]\n·Éòµ¤ÑôÈκ£ÄÏʡί¸±Êé¼Ç[2023/06/15]\n·Ò¶Å£Æ½µ±Ñ¡Î÷°²ÊÐÊг¤[2023/06/03]\n\n\n·Öܺ£±øÈγ¤É³ÊдúÊг¤[2023/06/03]\n·ÍõÎÀ¶«²»ÔÙµ£ÈÎÇຣʡ¸±Ê¡³¤Ö°Îñ[2023/06/03]\n·ÄÚÃɹÅ×ÔÖÎÇøÕþÐ\xad¸±Ö÷ϯÑî\x84¼æÈνÌÓýÌüÌü³¤[2023/06/02]\n·ÍõÁÖ»¢ÈÎÇຣʡ¸±Ê¡³¤[2023/06/02]\n··½¹â»ª´ÇÈ¥ÉÂÎ÷Ê¡ÈË´ó³£Î¯»á¸±Ö÷ÈÎÖ°Îñ[2023/06/01]\n\n\n\n\n\n\n\n\n\n²¿Î¯ÈËÎï¿â\n\nÖÐÖ±ÈËÎï¿â\n\nµØ·½ÈËÎï¿â\n\n\n\n\n\n\n\xa0¹úÎñÔº×é³É²¿ÃÅ\xa0\n\xa0Íâ½»²¿\xa0\xa0¹ú·À²¿\xa0\xa0·¢¸Äί\xa0\xa0½ÌÓý²¿\xa0\n\xa0¿Æ¼¼²¿\xa0\xa0¹¤ÐŲ¿\xa0\xa0¹ú¼ÒÃñί\xa0\xa0¹«°²²¿\n\xa0¹ú°²²¿\xa0\xa0ÃñÕþ²¿\xa0\xa0˾·¨²¿\xa0\xa0²ÆÕþ²¿\xa0\n\xa0ÈËÉ粿\xa0\xa0×ÔÈ»×ÊÔ´²¿\xa0\xa0Éú̬»·¾³²¿\xa0\n\xa0ס½¨²¿\xa0\xa0½»Í¨²¿\xa0\xa0Ë®Àû²¿\xa0\n\xa0ũҵũ´å²¿\xa0\xa0ÉÌÎñ²¿\xa0\xa0ÎÄ»¯ºÍÂÃÓβ¿\n\xa0ÎÀ½¡Î¯\xa0\xa0ÍËÒÛ¾üÈËÊÂÎñ²¿\xa0\n\xa0Ó¦¼±¹ÜÀí²¿\xa0\xa0ÑëÐÐ\xa0\xa0Éó¼ÆÊð\xa0\n\xa0¹úÎñÔºÖ±ÊôÌØÉè»ú¹¹\xa0\n\xa0¹ú×Êί\n\xa0¹úÎñÔºÖ±Êô»ú¹¹\xa0\n\xa0º£¹Ø×ÜÊð\xa0\xa0¹úË°×ܾÖ\xa0\xa0Êг¡¼à¹Ü×ܾÖ\xa0\n\xa0½ðÈÚ¼à¹Ü×ܾÖ\xa0\xa0Ö¤¼à»á\xa0\xa0¹ãµç×ܾÖ\xa0\n\xa0ÌåÓý×ܾÖ\xa0\xa0¹ú¼ÒÐŷþÖ\n\xa0¹ú¼Òͳ¼Æ¾Ö\xa0\xa0¹ú¼Ò֪ʶ²úȨ¾Ö\n\xa0¹ú¼Ê·¢Õ¹ºÏ×÷Êð\xa0\xa0¹ú¼ÒÒ½ÁƱ£ÕϾÖ\n\xa0¹úÎñÔº²ÎÊÂÊÒ\xa0\xa0»ú¹ØÊÂÎñ¹ÜÀí¾Ö\n\xa0¹úÎñÔº°ìÊ»ú¹¹\xa0\n\xa0¹úÎñÔºÑо¿ÊÒ\n\xa0¹úÎñÔºÖ±ÊôÊÂÒµµ¥Î»\xa0\n\xa0лªÉç\xa0\xa0Öйú¿ÆѧԺ\xa0\xa0ÖйúÉç¿ÆÔº\xa0\n\xa0Öйú¹¤³ÌÔº\xa0\xa0¹úÎñÔº·¢Õ¹Ñо¿ÖÐÐÄ\n\xa0ÖÐÑë¹ã²¥µçÊÓ×Ų̈\xa0\xa0ÖйúÆøÏó¾Ö\xa0\n\xa0²¿Î¯¹ÜÀí¾Ö\xa0\n\xa0Á¸Ê³ºÍÎï×Ê´¢±¸¾Ö\xa0\xa0¹ú¼ÒÄÜÔ´¾Ö\n\xa0¹ú¼ÒÊý¾Ý¾Ö\xa0\xa0¹ú¼Ò¹ú·À¿Æ¹¤¾Ö\n\xa0¹ú¼ÒÑ̲ÝרÂô¾Ö\xa0\xa0¹ú¼ÒÒÆÃñ¹ÜÀí¾Ö\n\xa0¹ú¼ÒÁÖÒµºÍ²ÝÔ\xad¾Ö\xa0\xa0¹ú¼ÒÌú·¾Ö\n\xa0ÖйúÃñÓú½¿Õ¾Ö\xa0\xa0¹ú¼ÒÓÊÕþ¾Ö\n\xa0¹ú¼ÒÎÄÎï¾Ö\xa0\xa0¹ú¼ÒÖÐÒ½Ò©¹ÜÀí¾Ö\n\xa0¼²²¡Ô¤·À¿ØÖƾÖ\xa0\xa0¹ú¼Ò¿óɽ°²¼à¾Ö\n\xa0¹ú¼ÒÏû·À¾ÈÔ®¾Ö\xa0\xa0¹ú¼ÒÍâ»ã¹ÜÀí¾Ö\n\xa0¹ú¼ÒÒ©Æ·¼à¶½¹ÜÀí¾Ö\n\n\n\n\n\n\n\n\n\xa0µ³ÖÐÑë¸÷²¿ÃÅ\xa0\n\xa0ÖÐÑë¼Íί¡¢¹ú¼Ò¼àί\xa0\n\xa0ÖÐÑë°ì¹«Ìü\xa0\n\xa0ÖÐÑë×éÖ¯²¿\xa0\n\xa0ÖÐÑëÐû´«²¿\xa0\n\xa0ÖÐÑëͳս²¿\xa0\n\xa0ÖÐÁª²¿\xa0\n\xa0ÖÐÑëÕþ·¨Î¯\xa0\n\xa0ÖÐÑëÕþ²ßÑо¿ÊÒ\xa0\n\xa0ÖÐÑë¹ú°²°ì\xa0\n\xa0ÖÐÑëÍøÐÅ°ì(¹ú¼ÒÍøÐÅ°ì)\xa0\n\xa0ÖÐÑë¾üÃñÈںϰì\xa0\n\xa0ÖÐÑę̈°ì(¹ų́°ì)\xa0\n\xa0ÖÐÑë²Æ°ì\xa0\n\xa0ÖÐÑëÍâ°ì\xa0\n\xa0ÖÐÑë±à°ì\xa0\n\xa0ÖÐÑëºÍ¹ú¼Ò»ú¹Ø¹¤Î¯\xa0\n\xa0µ³ÖÐÑëÖ±ÊôÊÂÒµµ¥Î»\xa0\n\xa0ÖÐÑ뵳У(¹ú¼ÒÐÐÕþѧԺ)\xa0\n\xa0ÖÐÑ뵳ʷºÍÎÄÏ×Ñо¿Ôº\xa0\n\xa0ÈËÃñÈÕ±¨Éç\xa0\n\xa0ÇóÊÇÔÓÖ¾Éç\n\xa0¹âÃ÷ÈÕ±¨Éç\xa0\n\xa0ÖйúÆÖ¶«¸É²¿Ñ§Ôº\xa0\n\xa0Öйú¾®¸Ôɽ¸É²¿Ñ§Ôº\xa0\n\xa0ÖйúÑÓ°²¸É²¿Ñ§Ôº\xa0\n\xa0ÖÐÑëÉç»áÖ÷ÒåѧԺ\xa0\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
Why this happens¶
This is because of an argument about the right encoding. Usually web requests come back with a header that says what the right encoding is.
For example, if we download the NYT homepage we can see that it is an encoding called utf-8
.
response = requests.get("https://nytimes.com")
response.headers['Content-Type']
'text/html; charset=utf-8'
The website we are scraping, though, doesn't have a character encoding listed in its content-type header.
response = requests.get("http://district.ce.cn/zt/rwk/rw/sbj/index_3.shtml")
response.headers['Content-Type']
'text/html'
The browser can display things correctly because instead of listening to the response, it's just reading what's listed in the HTML.