Quick web scraper in Python

Quick little project this weekend that will hopefully grow into a larger one soon.

I wrote a scraper that cycles through and downloads all the voter registration statistics on the Oregon Secretary of State’s website. These are total numbers of all voters registered by parties in counties and state legislative districts.

The problem is that not only are the tables all formated slightly differently (the Oregon SOS at one point changed what parties they report so the columns don’t match up) but they’re all safely locked away in PDFs, some that are simply scans of printed reports.

That means the next step of this project is some OCR and then formatting/cleaning the text to fit in a database. Looking into some ways to do that in Python as well, but we’ll see.

Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>