Scraping the Election Results
I've been a bit quieter than I'd like to have been this week, partially because I've been working, and more because I've been following the UK election. There are up coming posts on some of the interesting technical features used, as well as my two cents on the present situation (that's currently pending the situation really going anywhere). In the mean time, I've been trying to get hold of the results. Sure, I know it's a hung parliament, but there's a world of numbers which I can find, albeit very slowly, showing the exact number of votes cast.
In light of this, I spent some of yesterday evening dismantling the BBC's election website, and refamilirising myself with the DOM API implementation in PHP. I accept that both python and perl would probably be better choices for this sort of task, but I've had very little experience with either.
So, after having a poke around, I managed to get a list of the BBC's constituency codes (off the find you constituency page, surprisingly :P). Armed with this, and some fun knowledge, I wrote the following script which loads each page in turn, gets the data section, and stores the votes for each party in an associative array1, with the party name as the key. This is then passed back to the main code as a nice little object.
Once this is done for each constituency, the keys for each are mapped into another associative array, which formed the mapping between a party and a column in the output table. This is used to make each column only contain values for that party.
Finally, the table is generated, and shown to the user. After about 30 minutes, because of the delays I put in so and to not appear to be attacking the BBC server (yeah, I know, they'd barely notice 650 requests, but, all the same...)
I'll be posting the data up in a few minutes (trying to convince Blogger to actually upload it), so there's no need to try this at home :D
EDIT: I'm not goign to be posting the result table, as it's just too damn huge. the data can be found here
loadHTML($data);
// Get the name of the region
$element = $doc->getElementById("crumb");
$return->name = $element->childNodes->item(1)->lastChild->previousSibling->textContent;
$return->zone = $element->childNodes->item(1)->lastChild->previousSibling->previousSibling->previousSibling->textContent;
$return->votes = array();
// Get
$element = $doc->getElementById("full-detail");
$candidates = $element->childNodes->item(3)->childNodes->item(1)->childNodes;
foreach ($candidates as $cand) {
$return->votes[$cand->childNodes->item(2)->textContent] = $cand->childNodes->item(4)->textContent;
}
return $return;
}
function mapColumns($data) {
$return = array();
foreach ($data as $datum) {
foreach (array_keys($datum->votes) as $party) {
if (!isset($return[$party])) {
$return[$party] = count($return) + 2; // Two offset for name + zone
}
}
}
return $return;
}
function makeRow($votedata, $columnmap) {
$return = "\r\n";
$row = array($votedata->name, $votedata->zone);
foreach (array_keys($votedata->votes) as $party) {
$row[$columnmap[$party]] = $votedata->votes[$party];
}
for ($i = 0; $i <= count($columnmap) + 1; $i ++) {
if (isset($row[$i])) {
$return .= "\r\n\t" . $row[$i] . ' ';
} else {
$return .= "\r\n\t ";
}
}
return $return."\r\n ";
}
?>
Results Table
Constituency Name
BBC Reigion
" . $party . '';
}
?>
]]>
- 1 ↑ Basically, a Key-Value list or array with trainer wheels on