A few days ago my editor came over to relay some baseball trivia one of our copy editors had found: What players have had 200 or more home runs with multiple franchises?
The question became an issue because wherever it came from got it wrong. The full list is Mark McGwire, Ken Griffey Jr., Manny Ramirez, Rafael Palmeiro and Jimmie Foxx. The originating agency decided Foxx didn’t count for some reason.
This moved quickly from 200 with two teams to ‘Hey? who has 100 with three teams? 50 with four teams?’
I am a proud owner of the book “Baseball Hacks.” It’s a how-to guide to just about anything you can do with baseball stats, basically starting with teaching basic MySQL and becoming more complicated from there. It’s where I got my first taste of not only database work but R (which I’m still trying to wrap my head around) and basic probability.
The idea is simple enough. Count each players home runs per team and see who hits certain milestones. Simple to people isn’t always simple to machines.
I had an English class in high school where the teacher told us to find the simplest, most mundane thing we could and write out the instructions on how to do it so an alien could understand it. The idea was to get us to show more in our writing instead of just telling. He didn’t throw the ball, he gripped the seams, cocked his arm … etc.
She took the instructions and acted them out just to drive home how inexact we all are. Someone wrote instructions on how to make a peanut butter sandwich and she buttered everything but the bread when the instructions read “spread peanut butter.”
I’m reminded of this when I try to right code. It’s the same basic idea. You’re trying to explain to something that has no concept of what you want how to get it for you.
I started simple (that word again …). I totaled up all home runs for each player. I know what that list looks like. Barry Bonds is on top, followed by Hank Aaron and then Babe Ruth. Easy enough:
select playerid, sum(hr) as hr from batting group by playerid order by hr desc;
OK. So now to split it by teams:
select playerid, teamid, sum(hr) as hr from batting group by playerid, teamid order by hr desc;
But here is what that gets me:
ruthba01 NYA 659 bondsba01 SFN 586 schmimi01 PHI 548
Where’s Aaron? Well, he played for a bunch of different teams, as the Braves moved around during his time there. So I need to search by franchises, a field not included in the general batting table in the baseball databank. For that I need to bring in the teams and teamsFranchises tables:
select b.playerid, f.franchid, sum(b.hr) as hr from batting b join teams t join teamsfranchises f where b.teamid = t.teamid and t.franchid = f.franchid group by b.playerid, f.franchid order by hr desc;
This is where peanut butter goes everywhere. SQLYog freezes up, my computer shivers and dies, but not before spitting out a result that shows Sammy Sose with 75,346 home runs with the Cubs.
I forgot that there is a separate team record for every year in the team table, so the join with batting multiplied every season in baseball by itself. OK, try again:
select b.playerid, f.franchid, sum(b.hr) as hr from batting b join teams t join teamsfranchises f where b.teamid = t.teamid and b.yearid = t.yearid and t.franchid = f.franchid group by b.playerid, f.franchid order by hr desc;
aaronha01 ATL 733 ruthba01 NYY 659 mayswi01 SFG 646
So while I have a list of home runs hit by franchises (also great for looking up team records of any sort, so that goes into my list of saved queries …) I can’t count on it, s0 I need to make it a subquery and select off of that. While I’m at it I add in a join on the master table to get player names, as the playerid field is a little cumbersome:
select player, nameFirst, nameLast, count(player) as times from ( select b.playerid as player, m.nameFirst as nameFirst, m.nameLast as nameLast, f.franchid, sum(b.hr) as hr from batting b join teams t join teamsfranchises f join master m where b.teamid = t.teamid and b.yearid = t.yearid and t.franchid = f.franchid and b.playerid = m.playerid group by b.playerid, f.franchid) a where hr > 200 group by player order by times desc;
So there we are. It works. It may not be the best way to get it done (if you have any improvements let me know. I’d love to add a way to which teams they played for) but I was able to put it together fairly quickly and it got the job done.
Speaking of which, here are some the results from playing with the query for a while:
200 home runs with four teams
foxxji01 Jimmie Foxx 2 palmera01 Rafael Palmeiro 2 ramirma02 Manny Ramirez 2 mcgwima01 Mark McGwire 2 griffke02 Ken Griffey 2
150 home runs with 3 teams
rodrial01 Alex Rodriguez 3
100 home runs with three teams
evansda01 Darrell Evans 3 jacksre01 Reggie Jackson 3 rodrial01 Alex Rodriguez 3
75 home runs with four teams
kingmda01 Dave Kingman 4 mcgrifr01 Fred McGriff 4