Tagged: databases

Tracking Oregon newspaper endorsements

Another quick post for a quick project.

Putting together my ballot for the November election, I looked up a few Oregon newspapers to check on their endorsements on certain issues. I don’t put a whole lot of stock in these (the main value is a reality check on smaller ballot measures to make sure they’re not full of it), but having worked at a paper and knowing some of the smart, thoughtful people who write these endorsements, I always like to see who went for what.

Of course that ended up in a spreadsheet, which I saw as a great opportunity to practice using Twitter Bootstrap and Tabletop.js. Thus Oregon Newspaper Endorsements was born.

I know this has zero real-world value for anyone, but it was a good exercise for me to learn some new tools and keep up on jquery, using data I had lying around basically. It was a lot of fun, with the added bonus of having the election (and Oregon’s early vote by mail) give me a tangible deadline.

 

A bit of baseball geekery

A few days ago my editor came over to relay some baseball trivia one of our copy editors had found: What players have had 200 or more home runs with multiple franchises?

The question became an issue because wherever it came from got it wrong. The full list is Mark McGwire, Ken Griffey Jr., Manny Ramirez, Rafael Palmeiro and Jimmie Foxx. The originating agency decided Foxx didn’t count for some reason.

This moved quickly from 200 with two teams to ‘Hey? who has 100 with three teams? 50 with four teams?’

I am a proud owner of the book “Baseball Hacks.” It’s a how-to guide to just about anything you can do with baseball stats, basically starting with teaching basic MySQL and becoming more complicated from there. It’s where I got my first taste of not only database work but R (which I’m still trying to wrap my head around) and basic probability.

The idea is simple enough. Count each players home runs per team and see who hits certain milestones. Simple to people isn’t always simple to machines.

I had an English class in high school where the teacher told us to find the simplest, most mundane thing we could and write out the instructions on how to do it so an alien could understand it. The idea was to get us to show more in our writing instead of just telling. He didn’t throw the ball, he gripped the seams, cocked his arm … etc.

She took the instructions and acted them out just to drive home how inexact we all are. Someone wrote instructions on how to make a peanut butter sandwich and she buttered everything but the bread when the instructions read “spread peanut butter.”

I’m reminded of this when I try to right code. It’s the same basic idea. You’re trying to explain to something that has no concept of what you want how to get it for you.

I started simple (that word again …). I totaled up all home runs for each player. I know what that list looks like. Barry Bonds is on top, followed by Hank Aaron and then Babe Ruth. Easy enough:

select playerid, sum(hr) as hr
from batting
group by playerid
order by hr desc;

OK. So now to split it by teams:

select playerid, teamid, sum(hr) as hr
from batting
group by playerid, teamid
order by hr desc;

But here is what that gets me:

ruthba01    NYA    659
bondsba01    SFN    586
schmimi01    PHI    548

Where’s Aaron? Well, he played for a bunch of different teams, as the Braves moved around during his time there. So I need to search by franchises, a field not included in the general batting table in the baseball databank. For that I need to bring in the teams and teamsFranchises tables:

select b.playerid, f.franchid, sum(b.hr) as hr
from batting b join teams t join teamsfranchises f
where b.teamid = t.teamid and t.franchid = f.franchid
group by b.playerid, f.franchid
order by hr desc;

This is where peanut butter goes everywhere. SQLYog freezes up, my computer shivers and dies, but not before spitting out a result that shows Sammy Sose with 75,346 home runs with the Cubs.

I forgot that there is a separate team record for every year in the team table, so the join with batting multiplied every season in baseball by itself. OK, try again:

select b.playerid, f.franchid, sum(b.hr) as hr
from batting b join teams t join teamsfranchises f
where b.teamid = t.teamid and b.yearid = t.yearid and t.franchid = f.franchid
group by b.playerid, f.franchid
order by hr desc;
aaronha01    ATL    733
ruthba01    NYY    659
mayswi01    SFG    646

So while I have a list of home runs hit by franchises (also great for looking up team records of any sort, so that goes into my list of saved queries …) I can’t count on it, s0 I need to make it a subquery and select off of that. While I’m at it I add in a join on the master table to get player names, as the playerid field is a little cumbersome:

select player, nameFirst, nameLast, count(player) as times from (
select b.playerid as player, m.nameFirst as nameFirst, m.nameLast as nameLast, f.franchid, sum(b.hr) as hr from
batting b join teams t join teamsfranchises f join master m
where b.teamid = t.teamid and b.yearid = t.yearid and t.franchid = f.franchid and b.playerid = m.playerid
group by b.playerid, f.franchid) a
where hr > 200
group by player
order by times desc;

So there we are. It works. It may not be the best way to get it done (if you have any improvements let me know. I’d love to add a way to which teams they played for) but I was able to put it together fairly quickly and it got the job done.

Speaking of which, here are some the results from playing with the query for a while:

200 home runs with four teams

foxxji01    Jimmie    Foxx    2
palmera01    Rafael    Palmeiro    2
ramirma02    Manny    Ramirez    2
mcgwima01    Mark    McGwire    2
griffke02    Ken    Griffey    2

150 home runs with 3 teams

rodrial01    Alex    Rodriguez    3

100 home runs with three teams

evansda01    Darrell    Evans    3
jacksre01    Reggie    Jackson    3
rodrial01    Alex    Rodriguez    3

75 home runs with four teams

kingmda01    Dave    Kingman    4
mcgrifr01    Fred    McGriff    4