OCR then table extraction
We're more or less following the instructions from https://jonathansoma.com/everything/pdfs/selecting-your-tool/, except very much ignoring https://jonathansoma.com/everything/pdfs/selecting-your-tool/#ocr-with-tabular-data
Step One: OCR¶
Since the text isn't "real" text, just an image of text, we have to start by OCRing it.
Installation¶
We'll start by installing OCRmyPDF, which means we first need to install it with brew then install the Python library that connects to it. It's going to install a lot of stuff with homebrew, so maybe run it from the command line instead of here.
Find more details about it here
!brew install ocrmypdf
%pip install --quiet ocrmypdf
[notice] A new release of pip is available: 23.0.1 -> 24.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
Usage¶
This creates a new PDF called OCRhelp_GW-ocr.pdf
, which we'll use in step two.
import ocrmypdf
# The page already has some text on it, and the error msg suggests adding force_ocr
ocrmypdf.ocr('OCRhelp_GW.pdf', 'OCRhelp_GW-ocr.pdf', force_ocr=True)
Output()
Output()
Output()
Output()
Output()
Output()
Output()
The output file size is 2.12× larger than the input file. Possible reasons for this include: --force-ocr was issued, causing transcoding. PDF/A conversion was enabled. (Try `--output-type pdf`.)
<ExitCode.ok: 0>
Step Two: Table extraction¶
pdfplumber¶
Let's first try with JSVine's pdfplumber. It has a nice easy way to detect tables.
%pip install --quiet pdfplumber
import pdfplumber
pdf = pdfplumber.open("OCRhelp_GW-ocr.pdf")
page = pdf.pages[0]
page.extract_table()
Looks like it could't find any tables! Maybe we could tweak the ocrmypdf
settings, but I don't know how they work.
Instead we're going to use Camelot because I love it, even though it requires an old version of PyPDF. We need to make sure we install the dependencies first.
!brew install --quiet ghostscript tcl-tk
!pip install --quiet "camelot-py[cv]" "pypdf2<3"
[notice] A new release of pip is available: 23.0.1 -> 24.1 [notice] To update, run: pip install --upgrade pip
But then it will break but don't worry I made a fix
!mkdir -p ~/lib
!ln -s "$(brew --prefix gs)/lib/libgs.dylib" ~/lib
Okay now we can use it!
import camelot
tables = camelot.read_pdf('OCRhelp_GW-ocr.pdf')
tables[0].df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CATEGORIZATION OF BLOCKS/ MANDALS/ TALUKAS IN ... | ||||||||||||
1 | S.No. States / Union Total No. of Safe Semi-Cr... | Territories Assessed Nos. % Nos. % Nos. % Nos.... | |||||||||||
2 | |||||||||||||
3 | States | ||||||||||||
4 | 1 |Andhra Pradesh 670 501 75 60 9 24 4 45 7 40 6 | ||||||||||||
5 | 2 |Arunachal Pradesh 11 11 100 0 0 0 0 0 0 0 0 | ||||||||||||
6 | 3 |Assam 28 28 100 0 0 0 0 0 0 0 0 | ||||||||||||
7 | 4 |Bihar 534 432 81 72 13 18 s 12 2 0 0 | ||||||||||||
8 | 5 |Chattisgarh 146 122 84 22 15 2 1 0 0 0 0 | ||||||||||||
9 | 6 |Dethi 34 3 9 7 21 2 6 22 65 0 0 | ||||||||||||
10 | 7 |Goa 12 12 100 0 0 0 0 0 0 0 0 | ||||||||||||
11 | 8 |Gujarat 248 194 78 11 4 5 2 25 10 13 5 | ||||||||||||
12 | 9 |Haryana 128 26 20 21 16 3 2 78 61 0 0 | ||||||||||||
13 | 10 |Himachal Pradesh 8 3 38 1 13 0 0 4 50 0 0 | ||||||||||||
14 | 11. |Jammu & Kashmir 22 22 100 0 0 0 0 0 0 0 0 | ||||||||||||
15 | 12 |Jharkhand 260 245 94 10 4 2 1 3 1 0 0 | ||||||||||||
16 | 13 |Karnataka 176 97 55 26 15 8 § 45 26 0 0 | ||||||||||||
17 | 14 +|Kerala 152 119 78 30 20 2 1 1 1 0 0 | ||||||||||||
18 | 15 |Madhya Pradesh 313 240 77 44 14 7 2 22 7 0 0 | ||||||||||||
19 | 16 |Maharashtra 353 271 77 61 17 9 3 11 3 1 0 | ||||||||||||
20 | 17 |Manipur 9 9 100 0 0 0 0 0 0 0 0 | ||||||||||||
21 | 18 |Meghalaya 11 11 100 0 0 0 0 0 0 0 0 | ||||||||||||
22 | 19 |Mizoram 26 26 100 0 0 0 0 0 0 0 0 | ||||||||||||
23 | 20 |Nagaland 11 11 100 0 0 0 0 0 0 0 0 | ||||||||||||
24 | 21 |Odisha 314 303 96 5 2 0 0 0 0 6 2 | ||||||||||||
25 | 22 |Punjab 138 22 16 5 4 2 1 109 79 0 0 | ||||||||||||
26 | 23 |Rajasthan 295 45 15 29 10 33 11 185 63 3 1 | ||||||||||||
27 | 24 |Sikkim 4 4 100 0 0 0 0 0 0 0 0 | ||||||||||||
28 | 25 |Tamil Nadu 1166 427 37 163 14 79 7 462 40 ... | ||||||||||||
29 | 26 |Telangana 584 278 48 169 29 67 11 70 12 0 0 | ||||||||||||
30 | 27 |Tripura 59 59 100 0 0 0 0 0 0 0 0 | ||||||||||||
31 | 28 |Uttar Pradesh* 830 540 65 151 18 48 6 91 1... | ||||||||||||
32 | 29 |Uttarakhand 18 13 72 5 28 0 0 0 0 0 0 | ||||||||||||
33 | 30 |West Bengal “ 268 191 71 76 28 1 0 0 0 0 0 | ||||||||||||
34 | Total States 6828 4265 62 968 14 312 5 1185 17... | ||||||||||||
35 | Union Territories | ||||||||||||
36 | 1 |Andaman & Nicobar 36 35 97 0 0 0 0 0 0 1 3 | ||||||||||||
37 | 2 |Chandigarh 1 0 0 1 100 0 0 0 0 0 0 | ||||||||||||
38 | 3 |Dadra & Nagar Have} 1 1 100 0 0 0 0 0 0 0 0 | ||||||||||||
39 | 4 |Daman & Diu 2 1 50 0 0 1 50 0 0 0 0 | ||||||||||||
40 | 5 |Lakshdweep 9 6 67 3 33 0 0 0 0 0 0 | ||||||||||||
41 | 6 |Puducherry 4 2 50 0 0 0 0 1 25 1 25 | ||||||||||||
42 | Total UTs 53 45 85 4 8 1 2 1 2 2 4 | ||||||||||||
43 | Grand Total 6881 4310 63 972 14 313 5 1186 17 ... | ||||||||||||
44 | |Note | ||||||||||||
45 | |Blocks- Bihar, Chattisgarh, Haryana, Jharkhan... | ||||||||||||
46 | |Taluks -Karnataka, Goa, Gujarat, Maharashtra | ||||||||||||
47 | Mandals - Andhra Pradesh, Telangana a So a\nDi... | ||||||||||||
48 | Region - Puduche\n|UT - Chandigarh, Dadar & Na... | ||||||||||||
49 | Tehsil-NCT Delhi | ||||||||||||
50 | “Uttar Pradesh: There are total 820 block and ... | ||||||||||||
51 | **The Ground Water resources assessment as on ... |
Okay, awful, horrible situation! The problem is that when we OCR'd, I think it just slammed all of the characters into the same text element, which makes them impossible (?) to separate into columns.
Doing it semi-manually¶
On the other hand, they're correctly in rows, so maybe we can just split them apart if we get the text out?
from pdfminer.high_level import extract_text
text = extract_text("OCRhelp_GW-ocr.pdf")
print(text)
Dynamic Groundwater Resources Assessment of India — 2017 S.No. States / Union Total No. of Safe Semi-Critical Critical Over-Exploited Saline Territories Assessed Nos. % Nos. % Nos. % Nos. % Nos. % CATEGORIZATION OF BLOCKS/ MANDALS/ TALUKAS IN INDIA (2017) States 1 |Andhra Pradesh 670 501 75 60 9 24 4 45 7 40 6 2 |Arunachal Pradesh 11 11 100 0 0 0 0 0 0 0 0 3 |Assam 28 28 100 0 0 0 0 0 0 0 0 4 |Bihar 534 432 81 72 13 18 s 12 2 0 0 5 |Chattisgarh 146 122 84 22 15 2 1 0 0 0 0 6 |Dethi 34 3 9 7 21 2 6 22 65 0 0 7 |Goa 12 12 100 0 0 0 0 0 0 0 0 8 |Gujarat 248 194 78 11 4 5 2 25 10 13 5 9 |Haryana 128 26 20 21 16 3 2 78 61 0 0 10 |Himachal Pradesh 8 3 38 1 13 0 0 4 50 0 0 11. |Jammu & Kashmir 22 22 100 0 0 0 0 0 0 0 0 12 |Jharkhand 260 245 94 10 4 2 1 3 1 0 0 13 |Karnataka 176 97 55 26 15 8 § 45 26 0 0 14 +|Kerala 152 119 78 30 20 2 1 1 1 0 0 15 |Madhya Pradesh 313 240 77 44 14 7 2 22 7 0 0 16 |Maharashtra 353 271 77 61 17 9 3 11 3 1 0 17 |Manipur 9 9 100 0 0 0 0 0 0 0 0 18 |Meghalaya 11 11 100 0 0 0 0 0 0 0 0 19 |Mizoram 26 26 100 0 0 0 0 0 0 0 0 20 |Nagaland 11 11 100 0 0 0 0 0 0 0 0 21 |Odisha 314 303 96 5 2 0 0 0 0 6 2 22 |Punjab 138 22 16 5 4 2 1 109 79 0 0 23 |Rajasthan 295 45 15 29 10 33 11 185 63 3 1 24 |Sikkim 4 4 100 0 0 0 0 0 0 0 0 25 |Tamil Nadu 1166 427 37 163 14 79 7 462 40 35 3 26 |Telangana 584 278 48 169 29 67 11 70 12 0 0 27 |Tripura 59 59 100 0 0 0 0 0 0 0 0 28 |Uttar Pradesh* 830 540 65 151 18 48 6 91 11 0 0 29 |Uttarakhand 18 13 72 5 28 0 0 0 0 0 0 30 |West Bengal “ 268 191 71 76 28 1 0 0 0 0 0 Total States 6828 4265 62 968 14 312 5 1185 17 98 1 Union Territories 1 |Andaman & Nicobar 36 35 97 0 0 0 0 0 0 1 3 2 |Chandigarh 1 0 0 1 100 0 0 0 0 0 0 3 |Dadra & Nagar Have} 1 1 100 0 0 0 0 0 0 0 0 4 |Daman & Diu 2 1 50 0 0 1 50 0 0 0 0 5 |Lakshdweep 9 6 67 3 33 0 0 0 0 0 0 6 |Puducherry 4 2 50 0 0 0 0 1 25 1 25 Total UTs 53 45 85 4 8 1 2 1 2 2 4 Grand Total 6881 4310 63 972 14 313 5 1186 17 100 1 |Note |Blocks- Bihar, Chattisgarh, Haryana, Jharkhand, Kerala, M.P., Manipur, Mizoram, Orissa, Punjab, Rajasthan, Tripura, Uttar Pradesh, |Taluks -Karnataka, Goa, Gujarat, Maharashtra Mandals - Andhra Pradesh, Telangana a So a Districts/V alley- Arunachal Pradesh, Assam, Himachal Pradesh, Jammu & Kashmir, Meghalaya, Mizoram, Nagaland lIslands - Lakshdweep, Andaman & Nicobar Islands Firka-Tamil Nadu Region - Puduche |UT - Chandigarh, Dadar & Nagar Haveli, Daman & Diu Tehsil-NCT Delhi “Uttar Pradesh: There are total 820 block and 10 Cities **The Ground Water resources assessment as on 2013 has been considered for the state of West Bengal 113
# Let's just take the rows we want
rows = text.split("\n")[10:69]
rows
['1 |Andhra Pradesh 670 501 75 60 9 24 4 45 7 40 6', '', '2 |Arunachal Pradesh 11 11 100 0 0 0 0 0 0 0 0', '', '3 |Assam 28 28 100 0 0 0 0 0 0 0 0', '', '4 |Bihar 534 432 81 72 13 18 s 12 2 0 0', '', '5 |Chattisgarh 146 122 84 22 15 2 1 0 0 0 0', '', '6 |Dethi 34 3 9 7 21 2 6 22 65 0 0', '', '7 |Goa 12 12 100 0 0 0 0 0 0 0 0', '', '8 |Gujarat 248 194 78 11 4 5 2 25 10 13 5', '', '9 |Haryana 128 26 20 21 16 3 2 78 61 0 0', '', '10 |Himachal Pradesh 8 3 38 1 13 0 0 4 50 0 0', '', '11. |Jammu & Kashmir 22 22 100 0 0 0 0 0 0 0 0', '', '12 |Jharkhand 260 245 94 10 4 2 1 3 1 0 0', '', '13 |Karnataka 176 97 55 26 15 8 § 45 26 0 0', '', '14 +|Kerala 152 119 78 30 20 2 1 1 1 0 0', '', '15 |Madhya Pradesh 313 240 77 44 14 7 2 22 7 0 0', '', '16 |Maharashtra 353 271 77 61 17 9 3 11 3 1 0', '', '17 |Manipur 9 9 100 0 0 0 0 0 0 0 0', '', '18 |Meghalaya 11 11 100 0 0 0 0 0 0 0 0', '', '19 |Mizoram 26 26 100 0 0 0 0 0 0 0 0', '', '20 |Nagaland 11 11 100 0 0 0 0 0 0 0 0', '', '21 |Odisha 314 303 96 5 2 0 0 0 0 6 2', '', '22 |Punjab 138 22 16 5 4 2 1 109 79 0 0', '', '23 |Rajasthan 295 45 15 29 10 33 11 185 63 3 1', '', '24 |Sikkim 4 4 100 0 0 0 0 0 0 0 0', '', '25 |Tamil Nadu 1166 427 37 163 14 79 7 462 40 35 3', '', '26 |Telangana 584 278 48 169 29 67 11 70 12 0 0', '', '27 |Tripura 59 59 100 0 0 0 0 0 0 0 0', '', '28 |Uttar Pradesh* 830 540 65 151 18 48 6 91 11 0 0', '', '29 |Uttarakhand 18 13 72 5 28 0 0 0 0 0 0', '', '30 |West Bengal “ 268 191 71 76 28 1 0 0 0 0 0']
# Remove empty rows
rows = [row for row in rows if row != ""]
rows
['1 |Andhra Pradesh 670 501 75 60 9 24 4 45 7 40 6', '2 |Arunachal Pradesh 11 11 100 0 0 0 0 0 0 0 0', '3 |Assam 28 28 100 0 0 0 0 0 0 0 0', '4 |Bihar 534 432 81 72 13 18 s 12 2 0 0', '5 |Chattisgarh 146 122 84 22 15 2 1 0 0 0 0', '6 |Dethi 34 3 9 7 21 2 6 22 65 0 0', '7 |Goa 12 12 100 0 0 0 0 0 0 0 0', '8 |Gujarat 248 194 78 11 4 5 2 25 10 13 5', '9 |Haryana 128 26 20 21 16 3 2 78 61 0 0', '10 |Himachal Pradesh 8 3 38 1 13 0 0 4 50 0 0', '11. |Jammu & Kashmir 22 22 100 0 0 0 0 0 0 0 0', '12 |Jharkhand 260 245 94 10 4 2 1 3 1 0 0', '13 |Karnataka 176 97 55 26 15 8 § 45 26 0 0', '14 +|Kerala 152 119 78 30 20 2 1 1 1 0 0', '15 |Madhya Pradesh 313 240 77 44 14 7 2 22 7 0 0', '16 |Maharashtra 353 271 77 61 17 9 3 11 3 1 0', '17 |Manipur 9 9 100 0 0 0 0 0 0 0 0', '18 |Meghalaya 11 11 100 0 0 0 0 0 0 0 0', '19 |Mizoram 26 26 100 0 0 0 0 0 0 0 0', '20 |Nagaland 11 11 100 0 0 0 0 0 0 0 0', '21 |Odisha 314 303 96 5 2 0 0 0 0 6 2', '22 |Punjab 138 22 16 5 4 2 1 109 79 0 0', '23 |Rajasthan 295 45 15 29 10 33 11 185 63 3 1', '24 |Sikkim 4 4 100 0 0 0 0 0 0 0 0', '25 |Tamil Nadu 1166 427 37 163 14 79 7 462 40 35 3', '26 |Telangana 584 278 48 169 29 67 11 70 12 0 0', '27 |Tripura 59 59 100 0 0 0 0 0 0 0 0', '28 |Uttar Pradesh* 830 540 65 151 18 48 6 91 11 0 0', '29 |Uttarakhand 18 13 72 5 28 0 0 0 0 0 0', '30 |West Bengal “ 268 191 71 76 28 1 0 0 0 0 0']
import pandas as pd
df = pd.DataFrame({
'details': rows
})
df
details | |
---|---|
0 | 1 |Andhra Pradesh 670 501 75 60 9 24 4 45 7 40 6 |
1 | 2 |Arunachal Pradesh 11 11 100 0 0 0 0 0 0 0 0 |
2 | 3 |Assam 28 28 100 0 0 0 0 0 0 0 0 |
3 | 4 |Bihar 534 432 81 72 13 18 s 12 2 0 0 |
4 | 5 |Chattisgarh 146 122 84 22 15 2 1 0 0 0 0 |
5 | 6 |Dethi 34 3 9 7 21 2 6 22 65 0 0 |
6 | 7 |Goa 12 12 100 0 0 0 0 0 0 0 0 |
7 | 8 |Gujarat 248 194 78 11 4 5 2 25 10 13 5 |
8 | 9 |Haryana 128 26 20 21 16 3 2 78 61 0 0 |
9 | 10 |Himachal Pradesh 8 3 38 1 13 0 0 4 50 0 0 |
10 | 11. |Jammu & Kashmir 22 22 100 0 0 0 0 0 0 0 0 |
11 | 12 |Jharkhand 260 245 94 10 4 2 1 3 1 0 0 |
12 | 13 |Karnataka 176 97 55 26 15 8 § 45 26 0 0 |
13 | 14 +|Kerala 152 119 78 30 20 2 1 1 1 0 0 |
14 | 15 |Madhya Pradesh 313 240 77 44 14 7 2 22 7 0 0 |
15 | 16 |Maharashtra 353 271 77 61 17 9 3 11 3 1 0 |
16 | 17 |Manipur 9 9 100 0 0 0 0 0 0 0 0 |
17 | 18 |Meghalaya 11 11 100 0 0 0 0 0 0 0 0 |
18 | 19 |Mizoram 26 26 100 0 0 0 0 0 0 0 0 |
19 | 20 |Nagaland 11 11 100 0 0 0 0 0 0 0 0 |
20 | 21 |Odisha 314 303 96 5 2 0 0 0 0 6 2 |
21 | 22 |Punjab 138 22 16 5 4 2 1 109 79 0 0 |
22 | 23 |Rajasthan 295 45 15 29 10 33 11 185 63 3 1 |
23 | 24 |Sikkim 4 4 100 0 0 0 0 0 0 0 0 |
24 | 25 |Tamil Nadu 1166 427 37 163 14 79 7 462 40 ... |
25 | 26 |Telangana 584 278 48 169 29 67 11 70 12 0 0 |
26 | 27 |Tripura 59 59 100 0 0 0 0 0 0 0 0 |
27 | 28 |Uttar Pradesh* 830 540 65 151 18 48 6 91 1... |
28 | 29 |Uttarakhand 18 13 72 5 28 0 0 0 0 0 0 |
29 | 30 |West Bengal “ 268 191 71 76 28 1 0 0 0 0 0 |
# "Find me something that starts with | and ends with a number, but make it as short as possible"
df['state'] = df.details.str.extract("\|(.*?)\d")
df
details | state | |
---|---|---|
0 | 1 |Andhra Pradesh 670 501 75 60 9 24 4 45 7 40 6 | Andhra Pradesh |
1 | 2 |Arunachal Pradesh 11 11 100 0 0 0 0 0 0 0 0 | Arunachal Pradesh |
2 | 3 |Assam 28 28 100 0 0 0 0 0 0 0 0 | Assam |
3 | 4 |Bihar 534 432 81 72 13 18 s 12 2 0 0 | Bihar |
4 | 5 |Chattisgarh 146 122 84 22 15 2 1 0 0 0 0 | Chattisgarh |
5 | 6 |Dethi 34 3 9 7 21 2 6 22 65 0 0 | Dethi |
6 | 7 |Goa 12 12 100 0 0 0 0 0 0 0 0 | Goa |
7 | 8 |Gujarat 248 194 78 11 4 5 2 25 10 13 5 | Gujarat |
8 | 9 |Haryana 128 26 20 21 16 3 2 78 61 0 0 | Haryana |
9 | 10 |Himachal Pradesh 8 3 38 1 13 0 0 4 50 0 0 | Himachal Pradesh |
10 | 11. |Jammu & Kashmir 22 22 100 0 0 0 0 0 0 0 0 | Jammu & Kashmir |
11 | 12 |Jharkhand 260 245 94 10 4 2 1 3 1 0 0 | Jharkhand |
12 | 13 |Karnataka 176 97 55 26 15 8 § 45 26 0 0 | Karnataka |
13 | 14 +|Kerala 152 119 78 30 20 2 1 1 1 0 0 | Kerala |
14 | 15 |Madhya Pradesh 313 240 77 44 14 7 2 22 7 0 0 | Madhya Pradesh |
15 | 16 |Maharashtra 353 271 77 61 17 9 3 11 3 1 0 | Maharashtra |
16 | 17 |Manipur 9 9 100 0 0 0 0 0 0 0 0 | Manipur |
17 | 18 |Meghalaya 11 11 100 0 0 0 0 0 0 0 0 | Meghalaya |
18 | 19 |Mizoram 26 26 100 0 0 0 0 0 0 0 0 | Mizoram |
19 | 20 |Nagaland 11 11 100 0 0 0 0 0 0 0 0 | Nagaland |
20 | 21 |Odisha 314 303 96 5 2 0 0 0 0 6 2 | Odisha |
21 | 22 |Punjab 138 22 16 5 4 2 1 109 79 0 0 | Punjab |
22 | 23 |Rajasthan 295 45 15 29 10 33 11 185 63 3 1 | Rajasthan |
23 | 24 |Sikkim 4 4 100 0 0 0 0 0 0 0 0 | Sikkim |
24 | 25 |Tamil Nadu 1166 427 37 163 14 79 7 462 40 ... | Tamil Nadu |
25 | 26 |Telangana 584 278 48 169 29 67 11 70 12 0 0 | Telangana |
26 | 27 |Tripura 59 59 100 0 0 0 0 0 0 0 0 | Tripura |
27 | 28 |Uttar Pradesh* 830 540 65 151 18 48 6 91 1... | Uttar Pradesh* |
28 | 29 |Uttarakhand 18 13 72 5 28 0 0 0 0 0 0 | Uttarakhand |
29 | 30 |West Bengal “ 268 191 71 76 28 1 0 0 0 0 0 | West Bengal “ |
# You could do this to join, but it looks a little complicated??
pd.DataFrame(df.details.str.split(" ").str[-11:].tolist())
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 670 | 501 | 75 | 60 | 9 | 24 | 4 | 45 | 7 | 40 | 6 |
1 | 11 | 11 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 28 | 28 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 534 | 432 | 81 | 72 | 13 | 18 | s | 12 | 2 | 0 | 0 |
4 | 146 | 122 | 84 | 22 | 15 | 2 | 1 | 0 | 0 | 0 | 0 |
5 | 34 | 3 | 9 | 7 | 21 | 2 | 6 | 22 | 65 | 0 | 0 |
6 | 12 | 12 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 248 | 194 | 78 | 11 | 4 | 5 | 2 | 25 | 10 | 13 | 5 |
8 | 128 | 26 | 20 | 21 | 16 | 3 | 2 | 78 | 61 | 0 | 0 |
9 | 8 | 3 | 38 | 1 | 13 | 0 | 0 | 4 | 50 | 0 | 0 |
10 | 22 | 22 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 260 | 245 | 94 | 10 | 4 | 2 | 1 | 3 | 1 | 0 | 0 |
12 | 176 | 97 | 55 | 26 | 15 | 8 | § | 45 | 26 | 0 | 0 |
13 | 152 | 119 | 78 | 30 | 20 | 2 | 1 | 1 | 1 | 0 | 0 |
14 | 313 | 240 | 77 | 44 | 14 | 7 | 2 | 22 | 7 | 0 | 0 |
15 | 353 | 271 | 77 | 61 | 17 | 9 | 3 | 11 | 3 | 1 | 0 |
16 | 9 | 9 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 11 | 11 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 26 | 26 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 11 | 11 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | 314 | 303 | 96 | 5 | 2 | 0 | 0 | 0 | 0 | 6 | 2 |
21 | 138 | 22 | 16 | 5 | 4 | 2 | 1 | 109 | 79 | 0 | 0 |
22 | 295 | 45 | 15 | 29 | 10 | 33 | 11 | 185 | 63 | 3 | 1 |
23 | 4 | 4 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
24 | 1166 | 427 | 37 | 163 | 14 | 79 | 7 | 462 | 40 | 35 | 3 |
25 | 584 | 278 | 48 | 169 | 29 | 67 | 11 | 70 | 12 | 0 | 0 |
26 | 59 | 59 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
27 | 830 | 540 | 65 | 151 | 18 | 48 | 6 | 91 | 11 | 0 | 0 |
28 | 18 | 13 | 72 | 5 | 28 | 0 | 0 | 0 | 0 | 0 | 0 |
29 | 268 | 191 | 71 | 76 | 28 | 1 | 0 | 0 | 0 | 0 | 0 |
# Here's a regex to get a space, then some stuff - \s+(.+) - repeated many many times
cleaned = df.details.str.extract("\s+(.+)\s+(.+)\s+(.+)\s+(.+)\s+(.+)\s+(.+)\s+(.+)\s+(.+)\s+(.+)\s+(.+)\s+(.+)\s+(.+)")
cleaned
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | |Andhra Pradesh | 670 | 501 | 75 | 60 | 9 | 24 | 4 | 45 | 7 | 40 | 6 |
1 | |Arunachal Pradesh | 11 | 11 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | |Assam | 28 | 28 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | |Bihar | 534 | 432 | 81 | 72 | 13 | 18 | s | 12 | 2 | 0 | 0 |
4 | |Chattisgarh | 146 | 122 | 84 | 22 | 15 | 2 | 1 | 0 | 0 | 0 | 0 |
5 | |Dethi | 34 | 3 | 9 | 7 | 21 | 2 | 6 | 22 | 65 | 0 | 0 |
6 | |Goa | 12 | 12 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | |Gujarat | 248 | 194 | 78 | 11 | 4 | 5 | 2 | 25 | 10 | 13 | 5 |
8 | |Haryana | 128 | 26 | 20 | 21 | 16 | 3 | 2 | 78 | 61 | 0 | 0 |
9 | |Himachal Pradesh | 8 | 3 | 38 | 1 | 13 | 0 | 0 | 4 | 50 | 0 | 0 |
10 | |Jammu & Kashmir | 22 | 22 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | |Jharkhand | 260 | 245 | 94 | 10 | 4 | 2 | 1 | 3 | 1 | 0 | 0 |
12 | |Karnataka | 176 | 97 | 55 | 26 | 15 | 8 | § | 45 | 26 | 0 | 0 |
13 | +|Kerala | 152 | 119 | 78 | 30 | 20 | 2 | 1 | 1 | 1 | 0 | 0 |
14 | |Madhya Pradesh | 313 | 240 | 77 | 44 | 14 | 7 | 2 | 22 | 7 | 0 | 0 |
15 | |Maharashtra | 353 | 271 | 77 | 61 | 17 | 9 | 3 | 11 | 3 | 1 | 0 |
16 | |Manipur | 9 | 9 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | |Meghalaya | 11 | 11 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | |Mizoram | 26 | 26 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | |Nagaland | 11 | 11 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | |Odisha | 314 | 303 | 96 | 5 | 2 | 0 | 0 | 0 | 0 | 6 | 2 |
21 | |Punjab | 138 | 22 | 16 | 5 | 4 | 2 | 1 | 109 | 79 | 0 | 0 |
22 | |Rajasthan | 295 | 45 | 15 | 29 | 10 | 33 | 11 | 185 | 63 | 3 | 1 |
23 | |Sikkim | 4 | 4 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
24 | |Tamil Nadu | 1166 | 427 | 37 | 163 | 14 | 79 | 7 | 462 | 40 | 35 | 3 |
25 | |Telangana | 584 | 278 | 48 | 169 | 29 | 67 | 11 | 70 | 12 | 0 | 0 |
26 | |Tripura | 59 | 59 | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
27 | |Uttar Pradesh* | 830 | 540 | 65 | 151 | 18 | 48 | 6 | 91 | 11 | 0 | 0 |
28 | |Uttarakhand | 18 | 13 | 72 | 5 | 28 | 0 | 0 | 0 | 0 | 0 | 0 |
29 | |West Bengal “ | 268 | 191 | 71 | 76 | 28 | 1 | 0 | 0 | 0 | 0 | 0 |
Even if you clean up the state names: look at Bihar or Karnataka in column 7! Can you really trust anything in here???
In conclusion¶
Between setup, failure 1, more install/setup, failure 2, attempt 3 with many regular expressions, etc, we now maybe have something that works. While it would have taken maybe five minutes to type it out!
(and ten minutes to type it out again to see if we made any mistakes the first time).