Using Python & BeautifulSoup to scrape a Wikipedia table

Well, it was only a couple of weeks ago that I set myself a challenge to complete the Python course on Codecademy and I did it – I completed the Python track and it was fantastic! I was given the opportunity to put my newly found Python skills in to action this week as I needed to scrape some data from a Wikipedia page – I have a table of addresses and need to compare the County in the list that has been provided to the one that it really should be. This page on Wikipedia contains the data I need, for each Postcode district there’s a Postal County and I could use this data as a comparison – formatted in an HTML table like this:

WikiCapture

Normally, I’d just copy & paste the table in to Excel for use later on BUT it’s not as easy as that (oh no!), as there can be are multiple Postcode Districts within a row which is slightly annoying! To be of any use to me, I need the data to be formatted so that there is a row for each Postcode District like so (I don’t necessarily need the Postcode Area & Town but I’ll keep them anyway – I don’t like throwing away data!):

Postcode Area Postcode District Post Town Former Postal County
AB AB10 ABERDEEN Aberdeenshire
AB AB11 ABERDEEN Aberdeenshire
AB AB13 ABERDEEN Aberdeenshire
AB AB15 ABERDEEN Aberdeenshire

And so I thought this would be the perfect project for me to undertake in Python and to familiarise myself with friend-of-the-screen-scrapers, BeautifulSoup. I won’t jabber on too much about BeautifulSoup as I’m not fully up to speed on it myself yet, but from reading around the subject I gather it’s a great way to grab elements from web pages for further processing.

Step One: Wikipedia doesn’t like you…

Wikipedia doesn’t like this code:

from bs4 import BeautifulSoup
import urllib2
wiki = "http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page)
print soup
#urllib2.HTTPError: HTTP Error 403: Forbidden

Wikipedia only allows access to recognised user agents in order to stop bots retrieving bulk content. I am not a bot, I just want to practise my Python and so to get around this you just need some additional code to the header (thanks to Stack Overflow for coming to the rescue).

Step Two: Hunt the table

If you look at the code behind the Wikipedia article, you’ll see that there are multiple tables but only one (thankfully the one we want) uses the “wikitable sortable” class – this is great as we can use BeautifulSoup to find  the table with the “wikitable sortable” class and know that we will only get this table.

from bs4 import BeautifulSoup
import urllib2
wiki = "http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

area = ""
district = ""
town = ""
county = ""
table = soup.find("table", { "class" : "wikitable sortable" })
print table

Output looks like this:

TableOutput

Great! This means that we just have the HTML table stored in our variable. Now, it’s just a case of iterating through the rows and columns…easy…*ahem*

Step Three: For your iteration pleasure

We need to do the iteration in two stages – the first stage is to iterate through each row (tr element) and then assign each element in the tr to a variable. At this stage, we will grab everything in the Postcode Districts column and store it in a list for further iteration later. To do this, I used the following code:

for row in table.findAll("tr"):
	cells = row.findAll("td")
	#For each "tr", assign each "td" to a variable.
	if len(cells) == 4:
		area = cells[0].find(text=True)
		district = cells[1].findAll(text=True)
		town = cells[2].find(text=True)
		county = cells[3].find(text=True)

The .findAll function in Python returns a list and so on line 20, we obtain a list containing four elements, one for each of the columns in the table. This means they can be accessed via the cells[n].find(text=True) syntax. You’ll notice that I’ve used .findAll for the Postal Districts column, this is because I want a list of the items within the cell for iteration purposes later!

After this code executes, I have a value for the area, a list of districts, a town and a county. Now for the second part of my iteration:

	#district can be a list of lists, so we want to iterate through the top level lists first...
	for x in range(len(district)):
		#For each list, split the string
		postcode_list = district[x].split(",")
		#For each item in the split list...
		for i in range(len(postcode_list)):
			#Check it's a postcode and not other text
			if (len(postcode_list[i]) > 2) and (len(postcode_list[i]) <= 5):
				#Strip out the "\n" that seems to be at the start of some postcodes
				write_to_file = area + "," + postcode_list[i].lstrip('\n').strip() + "," + town + "," + county + "\n"
				print write_to_file

I found that, instead of district being a standard list of postcodes, in some cases it was a list of lists (oh joy!). I was expecting it to looks like this:

[u’AB10, AB11, AB12, AB15, AB16, \nAB21, AB22, AB23, AB24, AB25, \nAB99, non-geo’] *

*Ignore the \n signs and non-geo text – we’ll deal with them later!

I got this…

[u’AB10, AB11, AB12, AB15, AB16,‘, u’\nAB21, AB22, AB23, AB24, AB25,‘, u’\nAB99‘, u’non-geo‘]

And so I needed an additional layer of iteration: one for the whole list and then another for the items in the individual lists. Simple.

For each item in the list, the .split(",") function in Python allowed me to split out the comma separated list of postcodes in to a list that could be iterated over. For each item in that list, we just check to see if it’s a postcode (a check on string length sufficed nicely this time!) and then build up our output string. To deal with the \n that was appended to some of the postcodes, I just left-stripped the string to remove the \n characters and hey presto it worked!

I flushed the output to a CSV file as well as to the screen and it worked beautifully!

Here is the full code:

from bs4 import BeautifulSoup
import urllib2

wiki = "http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

area = ""
district = ""
town = ""
county = ""

table = soup.find("table", { "class" : "wikitable sortable" })

f = open('output.csv', 'w')

for row in table.findAll("tr"):
	cells = row.findAll("td")
	#For each "tr", assign each "td" to a variable.
	if len(cells) == 4:
		area = cells[0].find(text=True)
		district = cells[1].findAll(text=True)
		town = cells[2].find(text=True)
		county = cells[3].find(text=True)

	#district can be a list of lists, so we want to iterate through the top level lists first...
	for x in range(len(district)):
		#For each list, split the string
		postcode_list = district[x].split(",")
		#For each item in the split list...
		for i in range(len(postcode_list)):
			#Check it's a postcode and not other text
			if (len(postcode_list[i]) > 2) and (len(postcode_list[i]) <= 5):
				#Strip out the "\n" that seems to be at the start of some postcodes
				write_to_file = area + "," + postcode_list[i].lstrip('\n').strip() + "," + town + "," + county + "\n"
				print write_to_file
				f.write(write_to_file)

f.close()

Disclaimer(ish)

This code has no additional error checking or handling and was merely written to solve a small problem I had, and to put in to practise everything I’d learned so far. It does also just work for this particular table on the Wikipedia page – although it could be adapted for use on other tables. But, it was great fun to put the learning in to action and work on a real-life problem to solve. Here’s to more exercises like this!

Advertisement

9 comments

  1. […] Using Python & BeautifulSoup to scrape a Wikipedia table. […]

  2. Thank you for the great Article. 🙂 I will be implementing a similar system with a slightly different twist. You’ve also inspired me to take the CodeAcademy lessons as well! Cheers!

  3. cool, was having some issues but this has helped set me in a better direction

  4. very useful article! thanks

  5. I tried it with this page and it stalls about half way through…
    http://en.wikipedia.org/wiki/List_of_refrigerants
    Any ideas?

    1. I found a solution. Apparently, bs4 crashes with html error. I don’t know why but the html.parser fixes it.
      soup = BeautifulSoup(text, ‘html.parser’)

  6. sparshithsampath · · Reply

    Great article, thanks!

  7. think you very much

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: