Archive for How-To

Building a Retrosheet Database for the 2016 Season, Part 2

Hi, folks. Below you will find Part 2 of our video series involving building a Retrosheet database. If you haven’t, make sure to check out Part 1 before digging into this.

*The video explains this, but you’ll need to re-download the files from our GitHub page. I found a couple of small errors there. No harm done, just make sure you have the updated files before continuing. The video explains how to do this.

Enjoy! Let me know in the comments or on Twitter if you have any questions or problems.


Building a Retrosheet Database for the 2016 Season, Part 1

Baseball season is almost upon us. Soon, people will flood to ballparks in cities all over our great nation in search of entertainment and meaning, while baseball bloggers will continue their search for relevance and the mysterious Full Time Gig. If you fall into the latter camp (or if you just like having this kind of data handy), then it’s time to get your Retrosheet database installed/updated.

For those not in the know, Retrosheet is a magnificent project that essentially looks to turn box scores into computer records. And they’ve done a great job of it. They have all box scores from games since 1914, and play-by-play data since around 1940. What we’ll want to do is convert their records into an easily-searchable database that we can query for fun and profit.

Below is a video walking you through how to get your machine set up. We won’t actually be loading the data yet — that will come in Part 2 — but we’ll make sure your computer is prepped and has all the files and utilities is needs.

If you already installed a Retrosheet database using our instructions from last year, most of this won’t apply to you, but feel free to follow along. You’ll certainly need the links to the new packages that are now up on our GitHub page, but most of what you’ll need is in Part 2.

(Mac people: as I mentioned in the video, your instructions are coming)

Links mentioned in the video:

TechGraphs GitHub: https://github.com/techgraphs/2016Ret…

MySQL Server: https://dev.mysql.com/downloads/mysql/

Wget: http://gnuwin32.sourceforge.net/packa…

7-Zip: http://www.7-zip.org/

SQLyog: https://github.com/webyog/sqlyog-comm…


How To Use R For Sports Stats: Visualizing Projections

If you’re reading TechGraphs right now, there’s a good chance you’re prepping for fantasy baseball, and if you’re doing that, there’s a good chance you’re making use of projection systems like Steamer or ZiPS. In this post, we’ll explore some basic tools that might help you look at these projections in a new way — and brush up on those R skills that you probably haven’t touched since last fall.

(From a skills perspective, this post will assume that you’ve previously read through the “How To Use R For Sports Stats” series. Even if you haven’t, the insights below will hopefully still be worth your while. I’d also be terribly remiss if I didn’t point you towards Bill Petti’s recent THT post unveiling his baseballr R package.)

We’ll use Steamer projections for this post, though the methods we’ll look at can be used with ZiPS, FG Depth Charts, or, for that matter, actual by-season data. Download Steamer’s 2016 batting projections from FanGraphs, rename the file to “steamer16.csv”, and load it up in R. We’ll remove players projected for fewer than 100 AB to clean up the data a bit:

steamer = read.csv("steamer16.csv")
steamer = subset(steamer, PA > 100)

Visualizing Tiers

As fantasy baseball managers, we all have an innate ability to estimate a player’s value from their stats, judging how good a 30/10/.285 player is vs. a 15/15/.280. We get pretty good at this if we want to do well in our leagues — but we can still develop blind spots in our assessments, or hold on to an outdated idea of quality as MLB trends change. (For example, the average AVG in MLB has dropped from the high .260s 10 years ago to the low .250s today; if you’re still thinking a .255 hitter is below average, you might want to reconsider.)

The point, then, is that if you’re getting a sense of how good a player will be by looking at their projections, it can be helpful to step back and recalibrate your thinking from time to time by looking at the broader trends in an image or two.

Let’s look at Steamer’s projections for stolen bases, for example. We’ll draw on what we learned back in part 2 to make a quick-and-hasty histogram counting the number of MLB players who are projected for different SB totals:

hist(steamer$SB, breaks = 30)

Basic histograph of SB projections

Most of these players are projected for fewer than 10 SB. This is sort of interesting, but their huge counts are keeping us from seeing the trends on the right side. Let’s zoom in a bit:

hist(subset(steamer, SB > 10)$SB, breaks=30)

Histograph of SB projections for > 10 SB

Even among this crowd of speedsters, it’s uncommon to see someone projected for more than 20 SB, and incredibly rare to have more than 30.

You probably didn’t need to be reminded that the two folks on the far right (spoiler alert: Billy Hamilton and Dee Gordon) would stand out, though it’s useful to see just how distant they are from everyone else. But if you were thinking that players like Jarrod Dyson (35 projected SB) or Billy Burns (32) are solid, but not elite, on the basepaths, it may be time to reassess. (Did I mention that SB totals in MLB dropped 25% between 2011 and 2015?)

If you’re the kind of person who prefers boxplots instead, R’s got just the thing:

boxplot(steamer$SB)

Boxplot of SB projectionsThis makes it as plain as possible that any player projected for more than about 15 SB is, quite literally, a statistical outlier.

20/20 Vision

The same idea goes for getting a grasp on multi-category players. Most of us are looking for players who can bring in both HR and SB, but how many of those are really available? Let’s do a quick 2D plot:

plot(steamer$SB, steamer$HR)

Basic plot of HR vs. SB projectionsThis isn’t bad, but unfortunately it doesn’t give us a good sense of how many players fall into each category, since there’s only one dot for all of the 5 HR/3 SB players, one dot for all the 2 HR/4 SB players, etc. A quick workaround for this is the jitter() command, which moves the points around by tiny increments to get rid of some of the overlap:

plot(jitter(steamer$SB), jitter(steamer$HR))

And, for good measure, let’s add a grid on top:

grid()

Your plot should now look something like (but not exactly like) this:

Detailed plot of HR vs. SB projections

From the chart, we can see that it’s not impossible to find players projected for 30/10 or 10/30, but it looks like there’s only one 20/20 guy in Steamer’s projections:

subset(steamer, (SB >= 20 & HR >= 20))

            Name   Team  PA  AB   H X2B X3B HR  R RBI BB  SO HBP SB CS X.1   AVG   OBP   SLG   OPS
40 Carlos Correa Astros 636 571 157  33   3 22 80  82 54 110   4 20 11  NA 0.275 0.339 0.458 0.797

Of course. As if being a 21-year-old SS with plus average wasn’t enough.

Fun With Subsets

Let’s close this out by doing a bit more with subset() — possibly one of R’s most useful tools for our purposes because it’s just so much quicker and more customizable than online tools or Excel.

Say you want to find the prospective “five-category players”; you may have a sense of who some of the candidates are, but you might be surprised by what the numbers actually suggest. How many players, for example, are projected to do better than 10/80/80/10/.275?

subset(steamer, (HR > 10 & SB > 10 & R > 80 & RBI > 80 
 & AVG > .275))

               Name         Team  PA  AB   H X2B X3B HR   R RBI  BB  SO HBP SB CS X.1   AVG   OBP   SLG   OPS
1        Mike Trout       Angels 647 542 166  32   5 36 104 104  90 138   8 15  6  NA 0.307 0.410 0.585 0.995
5  Paul Goldschmidt Diamondbacks 652 543 158  36   2 30  93  93 100 142   3 14  7  NA 0.290 0.401 0.531 0.931
7  Andrew McCutchen      Pirates 653 554 165  34   3 23  88  87  84 123   9 12  6  NA 0.297 0.395 0.496 0.891
25    Manny Machado      Orioles 663 597 170  35   2 27  91  87  53  99   4 14  8  NA 0.285 0.345 0.484 0.829

Fewer than you may expect–which could well make them all the more valuable.

Conclusion

Projections, of course, are just projections, and you shouldn’t take one set — or even a combination of sets — to be a true predictor of what will happen this season. But if you typically look up projections player-by-player, or if you’re disinclined to take in a huge wall of stats at a single glance, looking at the broader trends in individual visualizations can help keep you on the right track as you prep for this fantasy season.

Here*, as always, is the code used for this post. If you have anything else you’d like to see us do with R as the new season comes near — or any suggestions with what you’ve used R for — let us know in the comments!

*download the ZIP and extract the R file.

How to Quickly Search Your Favorite Baseball (and Other Sports) Sites with Launchy

If one were brave enough to scour the Internet, one could find a myriad of articles explaining how to optimize their workflow. Our workflow is super screwed up, it seems, and only optimization will help us become the well-oiled, hyper-productive sacks of meat we were always meant to be. You can eat better, you can spend your money more wisely, you can take a more efficient way to work, and you can certainly manage your time better — again, according to the Internet. My friends and loved ones point out my shortcoming often enough that I don’t worry myself too much with my workflow. I have, however, found a few tips and/or tricks which make the act of sitting at the computer a little less terrible. One such tip is using the application Launchy to help me perform searches faster. And by tweaking the program a little bit, you can make Launchy get you your desired content from your favorite sports sites without using your mouse to dig around for those little search boxes.

Launchy can actually do a whole lot more than what I’m going to explain here. It markets itself as a “keystroke launcher,” which basically means it’s a little applet that helps you perform tasks with just a few keystrokes. You can open other programs, search for files, or play music through Launchy without ever having to touch your mouse. If you’re a Mac user, you’ve probably heard of a similar program called Alfred. Alfred has plenty of its own perks, but we’ll be focusing on Launchy since it does what we need, is free, and is cross-platform (Windows, Mac, Linux).

Download and install Launchy. By default, it should open by itself. If it doesn’t hold down the Alt key and press the Space bar.

Now that Launchy is open, let’s test some things out. Type in Google and hit tab. Then type techgraphs and hit enter. A new browser tab should open with a Google search. Now open Launchy again (Alt+Space) and simply type techgraphs.com and hit enter. It should bring you right to our home page. You can even do math in Launchy. Open it and type 4+2 and hit Enter. You should get a result. This is just a sliver of what Launchy can do, but now you know the general mechanics of the program. Now that we have that, we can start customizing.

The first thing you will want to look at is the keyboard shortcut for displaying Launchy. If you click the little gear icon you’ll be presented with the Launchy settings page. On the left-hand side, you’ll see a section to select the hotkey for Launchy. By default (as we’ve seen), it’s set to Alt+Space (on Windows), but you may want to change it to something that feels better to your fingers. It’s not necessary, but if you do feel like changing it, now you know where to do that.

(Apologies for the slightly-off screenshots. I can’t get my screen-cap software to pick up the Launch pop-up window.)

launchysettings

That same settings window is where we’re going to add our custom sports searches. At the top of the window, click Plugins, then click Weby from the list of plugins on the left.

webysettings

In this list you’ll see all the sites that Launchy can help you search, including some of the ones we tested earlier. These are very handy to have, but we want to add our own sites, which Launchy allows us to do. But first, we need the proper syntax. Launchy needs to know the search URL that we want in order to work properly. Let’s track one down together, like old friends. Click Cancel for now and close out of Launchy by pressing ESC.

Let’s start with FanGraphs, which my boss tells me is a fantastic baseball site. We’ll want to use Launchy to quickly search on a player, so we’ll need to start with the proper search URL. Go to FanGraphs and do a search for “Stanton.” Don’t click anything, just type “Stanton” in the search box and press Enter. In the next page, note the URL in the address bar. It should say http://www.fangraphs.com/players.aspx?lastname=stanton. Notice the last bit. The string “stanton” is passed along with the search URL. The address up to the equal sign is what we want: http://www.fangraphs.com/players.aspx?lastname=. Copy that and go back to your Launchy settings.

Back in the Weby plugin window, we’ll want to click the + sign. This will create a new entry. The name field is up to you, make it something you can remember. I use the boring but explanitory “fangraphs” (no quotes). In the address field, paste the URL you copied, and add %1 to the end (the %1 just means that we want Launchy to ask for a variable). The whole thing should look like http://www.fangraphs.com/players.aspx?lastname=%1. Press the + button again to save. Now press the OK button, which should bring you back to the main Launchy window.

webysettings2

Let’s test. In the Launchy box, type fangraphs (or the name you chose) and hit Tab. You should see the name, followed by a right-facing arrow. Now type Stanton and hit Enter. A new web browser tab should open up and bring you to the same page we saw when we grabbed the search URL. Pretty cool, right? Now, open Launchy again. Type fangraphs and hit Tab, but this time enter Giancarlo Stanton after the arrow and hit Enter. Boom, you’re brought right to the corresponding page. Pretty cool, yeah?

This only works with unique names, so searching for Alex Gonzalez by full name will still bring you to a search page where you’ll have to choose the right person.

This will work for all kinds of sites, so long as you know the search URL. Here are a few examples:

  • Baseball-Reference: http://www.baseball-reference.com/search/search.fcgi?results=&search=%1 (You can also search for specfic teams by using syntax like 2015 Royals)
  • Baseball Prospectus: http://www.baseballprospectus.com/player_search.php?search_name=%1
  • Pro-Football-Reference: http://www.pro-football-reference.com/search/search.fcgi?search=%1
  • ESPN: http://espn.go.com/search/results?q=%1

You can plug all of those into Launchy to create your own custom searches. Of course, feel free to search out your own. Just search for something on your favorite site and pay attention to the URL of the corresponding page. Copy out everything that isn’t your search term, and you should be able to plop that right into Launchy.

This is a great tool for writers, fans, or even fantasy owners during a draft. It might not save you hours off your workflow, but it will at least limit the amount of time you need to spend at your computer — an exercise that is probably killing you. Happy searching!

(Header image via Leo Leung)

TechGraphs’ Monthly Streaming Device List — October 2015

Much like Futurama’s Lrrr (ruler of the planet Omicron Persei 8) in the episode When Aliens Attack, many are getting upset at their current television situation. At this point it isn’t just the millennial crowd versus traditional TV package customers. Certain companies who have helped bring cord cutting to the forefront of people’s minds are now splintering.

Recently Amazon, creators of the Fire Stick and Fire TV, announced they will no longer sell Google’s Chromecast or Apple’s AppleTV. Via Bloomberg, Amazon has not offered new listings for said devices nor will the online retailer continue sales of the Google or Apple products after October 29th. As the linked article notes, it is curious to see Roku among other stream devices, unaffected by these events. Both Google and Apple still have various options for purchasing in stores or online, but seeing the the streaming options battle each other rather than the likes of Comcast, Charter and AT&T et al is disappointing.

For now, the table below is as up-to-date as possible, though again, any Amazon price listed for Chromecast or AppleTV is a deal with an October 29 expiration date. I’ve also separated the table to separate apps based on cost, determined by whether they are a monthly recurring cost, one-time payment, free, or needing a cable subscription.

Streaming Devices
Box/Dongle Opening Price Point Recuring Cost Apps Notable Free Apps One-time Cost Apps Cable TV Subscription Required
Amazon Fire Stick $39.00 HBO Now, Hulu, Netflix, SlingTV, GaiamTV Twitch.tv, YouTube MLBtv, NBA League Pass, Amazon Video WatchESPN, Fox Sports Go
Amazon Fire TV $99.00 HBO Now, Hulu, Netflix, SlingTV, GaiamTV Twitch.tv, YouTube MLBtv, NBA League Pass, Amazon Prime WatchESPN, Fox Sports Go
AppleTV (Revision A) $73.28* HBO Now, Hulu, Netflix, SlingTV, GaiamTV CBS Sports, 120 Sports, Twitch.tv, YouTube NFL Game Pass, MLBtv, NBA League Pass, NHL Gamecenter, MLS Live, Amazon Prime, Fox Soccer 2Go HBO Go, Watch ESPN, Fox Sports Go, USA Now
Google Chromecast $35.00 HBO Now, Hulu, Netflix, SlingTV, GaiamTV 120 Sports, Twitch.tv, YouTube NFL Game Pass, HBO Go, NBA League Pass, NHL Gamecenter, Fox Sports Go, Fox Soccer 2Go, MLS Live HBO Go, Watch ESPN, Fox Sports Go
Google Nexus Player $49.99 HBO Now, Hulu, Netflix, SlingTV, GaiamTV 120 Sports, Twitch.tv, YouTube NFL Game Pass, HBO Go, NBA League Pass, NHL Gamecenter, Fox Sports Go, Fox Soccer 2Go, MLS Live HBO Go, Watch ESPN, Fox Sports Go
Microsoft Xbox One $349.99 HBO Now, Hulu, Netflix, SlingTV, GaiamTV 120 Sports, Twitch.tv, YouTube MLBtv, NFL Game Pass, NBA League Pass, NHL Gamecenter, ESL, Major League Gaming HBO Go, Watch ESPN, Fox Sports Go
Nvidia Shield $199.99 HBO Now, Hulu, Netflix, SlingTV Twitch.tv, YouTube MLBtv Fox Sports Go, Watch ESPN
Roku Stick $46.99* NFL Now, HBO Now, Hulu, Netflix, SlingTV, UFC TV Twitch.tv, YouTube, Vevo, PBS NFL Game Pass, MLBtv, MLS Live, NBA Game Pass, NHL Gamecenter NBC Sports Live Extra, Watch ESPN
Roku 3 Player $97.95* NFL Now, HBO Now, Hulu, Netflix, SlingTV, UFC TV Twitch.tv, YouTube, Vevo, PBS NFL Game Pass, MLBtv, MLS Live, NBA Game Pass, NHL Gamecenter NBC Sports Live Extra, Watch ESPN
Slingbox M1 $129.99* N/A N/A N/A Steam any currently subscribed channel
Sony Playstation 4 $399.00* HBO Now, Hulu, Netflix, SlingTV, GaiamTV YouTube, PlayStation F.C. MLBtv, NBA League Pass, NFL Sunday Ticket, NHL Gamecenter, Amazon Video HBO Go

*Limited to Amazon Prime members


How to Stream the 2015 MLB Playoffs

It’s early October, which means it’s time to start making some mental notes for the approaching decorative gourd season while enjoying some playoff baseball. You’re on your own for the former, but we can help with the latter. The MLB playoffs start on Tuesday, beginning with the American League Wild Card game. Here’s how to watch:

  • Traditional television: For you squares (like me!) who still want to make use of that giant box around which your living rooms are oriented, all of these games should be available to you. In general, look for the AL games on Fox and the NL games on TBS (somebody maybe tell Ted Turner the Braves didn’t quite make it this year), although you’ll need ESPN, MLB Network, and Fox Sports 1 (now being rebranded as simply FS1) for some of the early action. Here’s a helpful schedule.
  • MLB.com Gameday: This old standby makes you feel both seriously plugged-in, by virtue of its integrated PITCHf/x and StatCast data, video clips, and social media streams, and seriously devoid of a human experience, by virtue of the fact that, say, Prince Fielder and Don Kelly do not really have the same body types. Free for anyone with a computer and an internet connection. Compare MLB At Bat app for mobile users.
  • MLB.tv: MLB’s premium, web-based service brought you live streams of all out-of-market television broadcasts during the regular season for a non-negligible fee. Due to the exclusivity of MLB’s playoff television broadcast rights agreements, however, you should not expect this service to be available in the postseason, with the exception of Gameday Audio, which the fine print indicates will remain available to MLB.tv subscribers. Instead, things shift to what the league calls Postseason.tv, which will allow subscribers to view “live alternate video feeds,” but not the main television broadcast feed. This will set you back $9.99 for the entire postseason, and does require you authenticate with you TV provider. But if you already pay for TV, the next two options might be your best bet.
  • TBS.com: It is not exactly clear what TBS will be offering this year in terms of playoff baseball streaming, but, in 2014, live streams were available to cable subscribers on TBS.com and through the Watch TBS mobile app, giving us every reason to believe the same will be available in 2015. TBS is also available to Sling TV subscribers.
  • Fox Sports Go: It’s the same story for Fox, which does not appear to have announced the scope of its playoff streaming, but we assume that, at a minimum, live streams will be available to cable subscribers through the Fox Sports Go app and at FoxSportsGo.com, as they were for the 2014 postseason.
  • Terrestrial radio: ESPN Radio will have every game. Find the dial location of your local ESPN Radio affiliate here. ESPN Radio also provides iOS and Android apps.
  • Satellite radio: SiriusXM. A subscription is required, but they almost always are offering free trials.

Enjoy cold weather baseball!

(Header image via Keith Allison)

Streamable Introduces Brand New Extension

We here at TechGraphs are pretty big fans of Streamable — one of the latest, and in my opinion best ways of sharing media clips, and they’ve just given us a reason to like them even more. Yesterday the company unrolled a new feature for Google Chrome users, an extension that allows users to clip video (and audio if so desired) directly from YouTube as well as Twitch.tv, and a number of smaller sites.

Being able to grab media files directly from Twitch.tv via a Chrome extension isn’t a particularly new option as Oddshot.tv has specialized in that area for months now, however support for web sites other than the esports-streaming site is a welcome sight. Unfortunately traditional sports sites haven’t caught on just yet — which should not surprise — however I’m looking forward to being able to grab a plate appearance or pitch sequence, or perhaps an incredible soccer cross and goal in near real-time.

The maximum length of the clip is 60 seconds, which is mostly enough time to capture whatever in-game moment you’re attempting to save. For the walk-through, I’ve grabbed a clip of my beloved A’s winning their 20th consecutive game, courtesy of MLB’s YouTube Channel. Before getting too far of ourselves, make sure to install the extension. It’s available for free at the Chrome Web Store and can be found by clicking here.

Once you’ve got it installed, make sure to see the Streamable icon/the infinity symbol. It will pop up on any page that is compatible with the extension, in this case the MLB channel.

streamable2

Now, after finding the clip you want — say, the glory days of seasons past — simply click on the Streamable icon and you’ll open a new page, specifically the editing page for Streamable.

streamable3

The extension allows users to re-name the clip (YouTube vids retain their name, but you can change it), crop, adjust the speed (as slow as x.50 or s fast as x2 with .25 increments) and to nail down a specific clip to one-tenth of a second of accuracy.

Note the picture below, at first glance it seems like I’m just ripping the clip, but I’ve cropped the black bars out of the original YouTube video and I’ve typed in exactly where in the video I want the clip to begin and end.

streamable4

Once the clip is finished, you’re given the direct URL or you can use an HTML embed code to share. Within the embed code options you can mute the clip, something I prefer doing because loud (not to mention auto-playing) clips are pretty terrible if you’re not expecting it. Don’t worry, muting the clip within the embed code doesn’t lose the audio. It’s just like a Vine or Instagram video where you can tap the speaker icon for the sound and full experience.

Armed with the perfect clip of the video, observe the final product:

Grab the extension and play around with it. You’ll be sharing priceless moments in non-terrible formats in no time.


How to Get the Most Out of “Format as a Table”

One of the Excel features that I feel too many people ignore is the “Format as a Table” tool. Let’s say we’ve got this data, 2014 team defense data from Pro-Football-Reference.com. I’ve cleaned it up a little — gotten rid of the totals at the bottom and consolidated the headers into one row — but it should still be a fairly recognizable table.

File -> PFR2014.csv

The process of formatting data as a table is really stupidly simple. Click anywhere within the table’s data (so anywhere from A1 to Z33). Now choose “Format as a Table” from the Home tab:

A variety of designs will appear when you click the button. They all have the same functionality, and you can change the colors easily at any time, so just pick one.
A variety of designs will appear when you click the button. They all have the same functionality, and you can change the colors easily at any time, so just pick one.

I am going to select the third from the top red design because red is the greatest color. When I do pick that design, I get a popup and a blinking border around my data:

Excel, being full of magical and distressing insights, knows where my data begins and ends. So all I need to do is make sure the "My table has headers" button is checked (it is) and then press "OK."
Excel, being full of magical and distressing insights, knows where my data begins and ends. So all I need to do is make sure the “My table has headers” button is checked (it is) and then press “OK.”

Presto magnifico, I have a table!

“But Bradley, you had a table all along.”

Shut up, and yes, I did — but this one is better. Lemme show you why.

The first thing I like to do is enable “Wrap Text” on the header row (select the top row, then Home > Wrap Text). It allows me to see the whole table much better. (This is a simple operation, but if you’re confused, Professor Google should have a plethora of resources on the matter.)

The second thing I like to do is rename this table — especially if I’m going to do work with multiple tables. To rename the table, head to Formulas > Name Manager. The name manager popup window should have only one table in it — Table1. Double click that line, and a subsequent pop up will enable us to edit the name:

Let's rename this one "Defense."
Let’s rename this one “Defense.” If we ever want to rename the table, we can follow these same steps without any harm to formulas throughout the spreadsheet.

Once you choose a name, click OK then Close. Now when we select all the data in the table, it will highlight the name “Defense” in a names drop-down menu:

Conversely, when we select "Defense" from the names drop-down menu, Excel will highlight the table data.
Conversely, when we select “Defense” from the names drop-down menu, Excel will highlight the table data.

Of course, we can name a table without it being formatted through the “Format as Table” function. But what’s nice about the “Format” table is that if I add a row or a column, the named table “Defense” automatically expands. Let’s try this out.

Go to cell AA1 and type “Pass – Rush AVG” (we’re going to create a new column). Excel will automatically create and format a new column, like this:

The new column will automatically carry the formatting to the end of the table.
The new column will automatically carry the formatting to the end of the table.

Now, I’m going to type a new formula into it:

=[Passing NY/A]-[Rushing Y/A]

Whoa! That’s not a normal Excel formula! That’s right. It’s a table formula. Now that we have a named table, we can use the column names to complete formulas within (and without) the table. More on that in a second.

Depending on your settings, the formula may have automatically filled to the bottom of the spreadsheet. If it didn’t, just click on the autofill icon that appeared at the bottom left corner, and choose to allow the formula to overwrite the contents of the column:

No need to click and drag a formula to the end of a spreadsheet. This is particularly useful with lots of data.
No need to click and drag a formula to the end of a spreadsheet. This is particularly useful when you have many rows of data.

And hey, remember that strange formula we made a second ago? Let’s do something similar on a different tab. I’m going to add another sheet, Sheet2, and type into any cell:

=AVERAGE(Defense[Pass - Rush AVG])

That’s a basic =AVERAGE formula, but because of the named table “Defense” and the named column “Pass – Rush AVG,” I’m able to write the formula without using the mouse or worrying about cells moving or data changing. In fact, I can go back and change the name of that column to “Net Pass AVG” and guess what happens to my formula? It changes automatically to:

=AVERAGE(Defense[Net Pass AVG])

One of the biggest advantages to using named tables (and editing those table names) is that when formula get REALLY complicated, you can read something that’s close to English, not a collection of meaningless cell references (“Wait, was A1:D22 the running data? Or was that in F2:Y54?” as opposed to “Oh, I have Running[AVG] not Running[Data] selected. Oops!”).

Consider this formula from my Scoresheet dataset:

=IF(VLOOKUP(CONCATENATE([@firstName]," ",[@lastName]),FGDC,19,0)=0,"",VLOOKUP(CONCATENATE([@firstName]," ",[@lastName]),FGDC,19,0))

“FGDC” is data pulled from the FanGraphs Depth Charts leaderboard. I have that data on a separate tab. Here’s how the above formula would look without named sections:

=IF(VLOOKUP(CONCATENATE(Combined!G2:G1010," ",Combined!H2:H1010),'FG DC'!A2:X1141,19,0)=0,"",VLOOKUP(CONCATENATE(Combined!G2:G1010," ",Combined!H2:H1010),'FG DC'!A2:X1141,19,0))

Both formulas are unwieldy, but at least the first one is intrinsically sensible. I’m combining the column “firstName” with “lastName” and looking them up in table FGDC. If I don’t find them, then I want Excel to put nothing into the cell (i.e. print “”).

If you want to learn more about structured references, I recommend this rundown of the syntax and various uses of structured refs.

It’s also important to note a named table automatically adds filters and applies those filters even to new columns and rows.

Filters allow us to sort and sift through the data much more easily.
Filters allow us to sort and sift through the data much more easily.

Another small, but useful component of tables is that the jump shortcuts (e.g. CTRL+→) will jump to the end of the table, even if the row or column is empty. In other words, in the empty Sc% column, if we press CTRL+↓, the cursor will move to X33 instead of X1048576, which is where it would normally and uselessly end.

There’s a multitude of other little handy features when it comes to structured tables, like being able to neatly and easily select whole columns of data without also selecting the header and the empty cells beneath the data. But for now, I hope this is enough to get the Excel newbie started with exploring this surprisingly robust feature.


How To Use R For Sports Stats, Part 3: Projections

In this series, we’ve walked through how exactly you can use R for statistical analysis, from the absolute basics of R coding (in part 1) to visualizing data and correlation tests (in part 2).

Since you’re reading this on TechGraphs, though, you might be interested in statistical projections, so that’s how we’ll wrap this up. If you’re just joining us, feel free to follow along, though looking through parts 1 and 2 first might help everything make more sense.

In this post, we’ll use R to create and test a few different projection systems, focusing on a bare-bones Marcel and a multiple linear regression model for predicting home runs. I’ve said a couple times before that we’re just scratching the surface of what you can do — but this is especially true in this case, since people write graduate theses on the sort of stuff we’re exploring here. At the end, though, I’ll point you to some places where you can learn more about both baseball projections and R programming.

Baseline

Let’s get everything set up. We’ll have to start by abandoning –well, modifying– that test data set that served us so well in Parts 1/2; we’ll add another two years of data (2011-14), trim out some unnecessary stats, and add a few which might prove useful later on. It’s probably easiest just to download this file.

Then we’ll load it:

fouryr = read.csv("FG1114.csv")

convert some of the percentage stats to decimal numbers:

fouryr$FB. = as.numeric(sub("%","",fouryr$FB.))/100
fouryr$K. = as.numeric(sub("%","",fouryr$K.))/100
fouryr$Hard. = as.numeric(sub("%","",fouryr$Hard.))/100
fouryr$Pull. = as.numeric(sub("%","",fouryr$Pull.))/100
fouryr$Cent. = as.numeric(sub("%","",fouryr$Cent.))/100
fouryr$Oppo. = as.numeric(sub("%","",fouryr$Oppo.))/100

and create subsets for each individual year.

yr11 = subset(fouryr, Season == "2011")
colnames(yr11) = c("2011", "Name", "Team11", "G11", "PA11", "HR11", "R11", "RBI11", "SB11", "BB11", "K11", "ISO11", "BABIP11", "AVG11", "OBP11", "SLG11", "WAR11", "FB11", "Hard11", "Pull11", "Cent11", "Oppo11", "playerid11")
yr12 = subset(fouryr, Season == "2012")
colnames(yr12) = c("2012", "Name", "Team12", "G12", "PA12", "HR12", "R12", "RBI12", "SB12", "BB12", "K12", "ISO12", "BABIP12", "AVG12", "OBP12", "SLG12", "WAR12", "FB12", "Hard12", "Pull12", "Cent12", "Oppo12", "playerid12")
yr13 = subset(fouryr, Season == "2013")
colnames(yr13) = c("2013", "Name", "Team13", "G13", "PA13", "HR13", "R13", "RBI13", "SB13", "BB13", "K13", "ISO13", "BABIP13", "AVG13", "OBP13", "SLG13", "WAR13", "FB13", "Hard13", "Pull13", "Cent13", "Oppo13", "playerid13")
yr14 = subset(fouryr, Season == "2014")
colnames(yr14) = c("2014", "Name", "Team14", "G14", "PA14", "HR14", "R14", "RBI14", "SB14", "BB14", "K14", "ISO14", "BABIP14", "AVG14", "OBP14", "SLG14", "WAR14", "FB14", "Hard14", "Pull14", "Cent14", "Oppo14", "playerid14")

(We’re renaming the columns for each subset because the merge() function has some problems if you try to merge too many sets with the same names. If you want to explore the less hacked-together way of reassembling data frames in R, take a look at the dplyr package.)

Anyway, we’ll merge these all back into one set:

set = merge(yr11, yr12, by = "Name")
set = merge(set, yr13, by = "Name")
set = merge(set, yr14, by = "Name")

Still with me? Good. Thanks for your patience. Let’s start testing projections.

Specifically, we’re going to see how well we can use the 2011-2013 data to predict the 2014 data. For simplicity’s sake, we’ll focus mostly on a single stat: the home run. It’s nice to test with–it’s a 5×5 stat, it has a decent amount of variation, it gives us experience with testing counting stats while being more player-controlled than R/RBI… and, come on, we all dig the long ball.

Now when you’re testing your model, it’s nice to have a baseline–a sense of the absolute worst that a reasonable model could do. For our baseline, we’ll use previous-year stats: we’ll project that a player’s 2013 HR count will be exactly what they hit in 2014.

To test how well this works, we’ll follow this THT post and use the mean absolute error–the average number of HRs that the model is off by per player. So if a system projects two players to each hit 10 homers, but one hits zero and the other hits 20, the MAE would be 10.

(If you end up doing more projection work yourself, you may want to try a more fine-tuned metric like r² or RMSE, but I like MAE for a basic overview because the value is directly measurable to the stat you’re examining.)

To find the mean absolute error, take the absolute value of the difference between the projected and actual stats, sum it up for every player, then divide by the number of players you’re projecting:

sum(abs(set$HR13 - set$HR14))/length(set$HR14)
> [1] 6.423729

So the worst projection system possible should be able to beat an average error of about six and a half homers per player.

Marcel, Marcel

Now let’s try a slightly-less-than-absolute-worst model.

Marcel is the gold standard of bare-bones baseball projections. At its core, Marcel predicts a player’s stats using the last 3 years of MLB data. The previous year (Year X) gets a weight of 5, the year before (X-1) gets a weight of 4, and X-2 gets a weight of 3. As originally created, Marcel also includes an adjustment for regression to the mean and an age factor, but we’ll set aside such fancies for this demonstration.

To find Marcel’s prediction, we’ll create a new column in our dataset weighing the last 3 years of HRs. Since our weights are 5 + 4 + 3 = 12, we’ll take 5/12 from the 2013 data, 4/12 from the 2012 data, and 3/12 from the 2011 data. Then we’ll round it to the nearest integer.

set$marHR = (set$HR13 * 5/12) + (set$HR12 * 4/12) + (set$HR11 * 3/12)
set$marHR = round(set$marHR,0)

Voila! Your first (real) projections. How do they perform?

sum(abs(set$marHR - set$HR14))/length(set$HR14)
> [1] 5.995763

Better by nearly half a home run. Not bad for two minutes’ work. 6 HR per player still seems like a lot, though, so let’s take a closer look at the discrepancies. We’ll create another column with the (absolute) difference between each player’s projected 2014 HRs and actual 2014 HRs, then plot a histogram displaying these differences.

set$mardiff = abs(set$marHR-set$HR14)
hist(set$mardiff, breaks=30, col="red")

Histogram of Marcel HR errors

Not as bad as you might have thought. Many players are only off by a few home runs, some off by 10+, and a few fun outliers hanging out at 20+. Who might those be?

set = set[order(-set$mardiff),]
head(set[c(1,72,90,91)], n=10)

(In that last line, we’re calling specific column names so we don’t have to search through 100 columns for the data we want when we display this. You can find the appropriate numbers using colnames(set).)

List of players with largest Marcel HR errors

A list headlined by a season-ending injury and two players released by their teams in July; fairly tough to predict in advance, IMO.

While we’re here, let’s go ahead and create Marcel projections for the other 5×5 batting stats:

set$marAVG = (set$AVG13 * 5/12) + (set$AVG12 * 4/12) + (set$AVG11 * 3/12)
set$marAVG = round(set$marAVG,3)
set$marR = (set$R13 * 5/12) + (set$R12 * 4/12) + (set$R11 * 3/12)
set$marR = round(set$marR,0)
set$marRBI = (set$RBI13 * 5/12) + (set$RBI12 * 4/12) + (set$RBI11 * 3/12)
set$marRBI = round(set$marRBI,0)
set$marSB = (set$SB13 * 5/12) + (set$SB12 * 4/12) + (set$SB11 * 3/12)
set$marSB = round(set$marSB,0)

And, for good measure, save it all in an external file. We’ll create a new data frame from the data we just created, rename the columns to look nicer, and write the file itself.

marcel = data.frame(set$Name, set$marHR, set$marR, set$marRBI, set$marSB, set$marAVG)
colnames(marcel) = c("Name", "HR", "R", "RBI", "SB", "AVG")
write.csv(marcel, "marcel.csv")

Before we move on, I want to quickly cover one more R skill: creating your own functions. We’re going to be using that absolute mean error command a couple more times, so let’s create a function to make writing it a bit easier.

modtest = function(stat){
 ame = sum(abs(stat - set$HR14))/length(set$HR14)
 return(ame)
}

The ‘stat’ inside function(stat) is the argument you’ll be including in the function (here, the column of projected data we’re testing); the ‘stat’ shows up inside the bracketed text where your projected data did when we originally used this command. The return() is what your function outputs to you. Let’s make sure it works by double-checking our Marcel HR projection:

modtest(set$marHR)
> [1] 5.995763

Now we can just use modtest() to find the absolute mean error. Functions can be as long or as short as you’d like, and are incredibly helpful if you’re using a certain set of commands repeatedly or doing any sort of advanced programming.

Hold The Line

With Marcel, we used three factors–HR counts from 2013, 2012, and 2011–with simple weights of 5, 4, and 3. For our last projection model, let’s take this same idea, but fine-tune the weights and look at some other stats which might help us project home runs. This, basically, is multiple linear regression. I’m going to handwave over a lot of the theory behind regressions, but Bradley’s how-to from last week does a fantastic job of going through the details.

Remember back in part 2, when we were looking at correlation tests in r² and we mentioned how we were basically modeling a y = mx + b equation? That’s basically what we did with Marcel just now, where ‘y’ was our projected HR count and we had three different mx values, one each for the 2013, 2012 and 2011 HR counts. (In this example, ‘b’, the intercept, is 0.)

So we can then use the same lm() function we did last time to model the different factors that can predict home run counts. We’ll give R the data and the factors we want it to use, and it’ll tell us how to best combine them to most accurately model the data. We can’t model the 2014 data directly in this example–since we’re testing our model against it, it’d be cheating to use it ‘in advance’–but we can model the 2013 HR data, then use that model to predict 2014 HR counts.

This is where things start to get more subjective, but let’s start by creating a model using the last two years (2013/2012) of HR data, plus the last year (2012) of ISO, Hard%, and Pull%. In the lm() function, the data we’re attempting to model will be on the left, separated by a ‘~’; the factors we’re including will be on the right, separated by plus signs.

hrmodel = lm(set$HR13 ~ set$HR12 + set$HR11 + set$Hard12 + set$Pull12 + set$ISO12)
summary(hrmodel)

Screenshot of initial linear model

There’s a lot of stuff to unpack here, but the first things to check out are those “Pr(>|t|)” values in the right corner. Very simply, a p-value less than .05 there means that that factor is significantly improving your model. (The r² for this model, btw, is .4611, so this is accounting for roughly 46% of the 2013 HR variance.) So basically, ISO and Pull% don’t seem to add much value to this model, but Hard% does.

It’s generally a good practice to remove any factors that don’t have a significant effect and re-run your model, so let’s do that:

hrmodel = lm(set$HR13 ~ set$HR12 + set$HR11 + set$Hard12)
summary(hrmodel)

Screenshot of R model with significant factors

And there’s your multiple linear regression model. The format for the actual projection formula is basically the same as what we did for Marcel, except your weights will take the coefficient estimates and you’ll include the intercept listed above them. Remember that “HR12”, “HR11”, etc., are standing in for “last year’s HR total”, “the year before that’s HR total”, etc., so make sure to increment the stats by a year to project for 2014.

set$betHR = (-5.3 + (set$HR13 * .32) + (set$HR12 * .13) + (set$Hard13 * 40))
set$betHR = round(set$betHR,0)

Survey says…?

modtest(set$betHR)
> [1] 5.95339

…oh. Yay. So that’s an improvement of, uh…

modtest(set$marHR) - modtest(set$betHR)
> [1] 0.04237288

1/20th of a home run per player. Isn’t this fun? Some reasons why we might not have seen the improvement we expected:

  • We probably overfit the data. Since we ran the model on 2013 data, it probably did really well on 2013 data, but not as great on 2014. If we check the model on the 2013 data:
set$fakeHR = (-5.3 + (set$HR12 * .33) + (set$HR11 * .13) + (set$Hard12 * 40))
set$fakeHR = round(set$fakeHR,0)
sum(abs(set$fakeHR - set$HR13))/length(set$HR13)
> [1] 4.877119

It runs pretty well.

  • We didn’t include useful factors we could have. We just tested a few obvious ones; maybe looking at Cent% or Oppo% would be more helpful than Pull%? (They aren’t, just so you know.) More abstract factors like age, ballpark, etc., would obviously help–but including these would also require a stronger model.
  • Finally, projections are hard. Even if you have an incredibly customized set of projections, you’re going to miss some stuff. Take a system like Steamer, one of the most accurate freely-available projection tools around. How did their 2014 preseason projections stack up?
steamer = read.csv("steamer.csv")
steamcomp = merge(yr14, steamer, by = "playerid14")
steamcomp$HR = as.numeric(paste(steamcomp$HR))
steamcomp$HR = round(steamcomp$HR, 0)
steamcomp$HR[is.na(steamcomp$HR)] = 0
sum(abs(steamcomp$HR - steamcomp$HR14))/length(steamcomp$HR14)
> [1] 4.892157

That said, the lesson you should not take away from this is “oh, our homemade model is only 1 HR/player worse than Steamer!” Our data set is looking at players for whom we have several seasons’ worth of data —   the easiest players to project. If we had to create a full-blown projection system including players recovering from injury, rookies, etc., we’d look even worse.

If anything, this hopefully shows how much work the Silvers, Szymborskis, and Crosses of the world have put in to making projections better for us all. Here’s the script with everything we covered.

This Is Where I Leave You

Well, that about wraps it up. There’s plenty, plenty more to learn, of course, but at this point you’ll do well to just experiment a little, do some Googling, and see where you want to go from here.

If you want to learn more about R coding, say, or predictive modeling, I’d definitely recommend picking up a book or trying an online class through somewhere like MIT OpenCourseWare or Coursera. (By the end of which, most likely, you’ll be way beyond anything I could teach you.) If there’s anything particular about R you’d still like to see covered, though, let me know and I’ll see if I can do a writeup in the future.

Thanks to everyone who’s joined us for this series — the kudos I’ve read here and elsewhere have been overwhelming — and thanks again to Jim Hohmann for being my perpetual beta tester/guinea pig. Have fun!


How To Run Sports Data Regressions in Microsoft Excel

The shorthand description of a regression: It’s the best possible trend line between a scatter of dots. Like this:

The orange line (and the connected equation) represent the most basic idea of a regression.
The orange line (and the connected equation) represent the most basic idea of a regression.

One of the fun things about regressions is that they give us formulas — line equations, specifically. So if we have a quarterback with a 100 QB rating, we can plug his 100 into our formula (y = 0.097 * 100 – 2.495) and get a reasonable estimation of what his Adjusted Net Yards per Pass Attempt (ANY/A) would be (about 7.2). The R2 tells us essentially how reliable the regression is — or, more specifically, how much of the variation in QB Rating is explained by ANY/A.

Of course, the problem with the data here (which I just kind of threw together as an example) is that QBR and ANY/A use almost the same inputs and attempt to do the same thing. It’s nice to see they have about 91% overlap (it basically says they’re just about interchangeable), but no one is going to use QB rating to derive or forecast a ANY/A.

Regressions are more useful when we start with something small and reliable, then move our way to more all-encompassing but volatile stats. This is like how we use contact or plate discipline data (small and reliable) to expand into an xBABIP calculator (big and more meaningful).

There are two ways to run some regressions in Excel:

  1. Use the scatterplot tool (as above) and create a simple, two-variable regression.
  2. Use the Data Analysis ToolPack to run a more complete and useful regression.

The first method is the easiest, but it doesn’t output the peripheral data that is essential to fully understanding a regression’s findings.

The Scatterplot Regression

For the first method, just select two columns of data and make a scatterplot (Insert > Scatter). That will give you something like this:

Here's a scatterplot of the 2015 Durham Bulls' strikeout and home run totals.
Here’s a scatterplot of the 2015 Durham Bulls’ strikeout and home run totals, min. 50 PA.

With the chart selected, choose to add a linear trendline (Layout > Trendline > Linear Trendline):

Adding a linear trendline will create a basic linear regression.
Adding a linear trendline will create a basic linear regression.

Now double-click the trendline to produce the “Format Trendline” window. In that menu, check the boxes for “Display Equation on chart” and “Display R-squared on chart”:

These two boxes give you the bare minimum of data necessary to interpret a regression.
These two boxes give you the bare minimum of data necessary to interpret a regression.

So now we have a regression! The formula (HR = 3.5367 * SO + 29.166) tell us there is a positive connection between home run totals and strikeout totals. And the R2 tells us the relationship between HR and SO explains 48% of the variation between the two of them.

What this regression doesn’t tell us:

  • What direction, if any, is the causality? Are homers causing players to strikeout? Or do more strikeouts make more homers?
  • Are their peculiarities in the residuals? This article does a great job of teaching how to interpret residuals plots.
  • Does the regression fit the data? And ANOVA analysis can be useful in augmenting what the R2 tells us.

The first issue is a matter of deeper research. A regression won’t tell us direction of causality. But we can still answer those other two questions — as well as add more variables — using Excel data Analysis ToolPack. The first thing we’ll need to do is enable that ToolPack.

In the File > Options > Add-Ins section, you’ll notice a “Go…” button at the bottom of the window.

This button opens a dialogue that allows us to turn on the data Analysis ToolPack. Why is this not enabled by default? Who knows? Maybe Bill Gates.
This button opens a dialogue that allows us to turn on the data Analysis ToolPack. Why is this not enabled by default? Who knows? Maybe Bill Gates.

Select the top option in the available Add-Ins (“Analysis ToolPack”) and then click “OK.”

You can also add in these other ones if you're feeling frisky. I rarely use them, though.
You can also add in these other ones if you’re feeling frisky. I rarely use them, though.

Now, after this first step, you should have a new option in your Data tab. Let’s explore that. Go to Data > Data Analysis. That will open a simple dialogue with a list of various operations. Choose “Regression” and click “OK”.

You should then get this screen:

The Y Range will be what you are regression against, so to speak.
The Y Range will be what you are regression against, so to speak.

In the Y Range text box, you will want to add only a single column of data. I prefer to include the column headings so that the output screen will be more easily understood. In this instance, I’m choosing a big column of completions percentage data from Pro-Football-Reference.com (from this data: NFL QB seasons since 1969 with min. 10 TD). I’m regression this Cmp% data against the quarterbacks sack total and yards per attempt (Y/A) total.

In short, I’m asking: Can Y/A and sack totals predict a QB’s accuracy?

So in the X Range, I’m going to select the Q and R columns (titles and all). The output is something like this:

Here are the big three components of a regression.
Here are the big three components of a regression.

So this is kinda what it will look like after a regression. Let’s break down the three big areas one at a time, in the typical order I look at them:

  1. Residual Plots: These look good! You want a shotgun blast looks. If you start to see anything other than a circle, in any of your residual plots, then you’ll need to rework your regression. (See that above article for more details.)
  2. R2 Results: What is a good R2? Well, higher is always (well, usually) better, but there’s no clear perfect R2. Truth is, we have to be as intellectually honest as possible and determine how much explanation is the right amount of explanation. With multiple variables, it’s important to look at Adjusted R2 because it helps combat the unintentional increase in R2 caused by just adding more data. In this case, though, R2 and Adjusted R2 are about the same, so whutevz.
  3. Coefficients: The coefficients tell us both the formula of the regression (Cmp% = 36 + 0.02 * Sacks + 3.03 * Y/A) but also the strength of the variables involved. And it doesn’t take much work realized which variable is more important. Even a QB who has been sacked will only see his Cmp% moved by 1.5%. What’s more: It increases 1.5%. That’s a red flag right there for a bad variable. Maybe Sack% would be a more useful tool because the sack totals are merely telling us he played more (and QBs who played more probably had better performances because otherwise they would have been benched).

Anyway, I hope this has been helpful. I encourage readers to learn more about regression before attempting any, as they are a complicated and tricky tool and can lead a researcher astray quickly if used incorrectly.

NOTE: For those wondering why I haven’t gone into detail about significance testing for P-values, it’s because I believe that field of statistical study is generally arbitrary and altogether intellectually bankrupt. But there are dozens of great tutorials out there on the subject. I just can’t in good ethical faith write one.