Author Archive

How To Use R For Sports Stats: Visualizing Projections

If you’re reading TechGraphs right now, there’s a good chance you’re prepping for fantasy baseball, and if you’re doing that, there’s a good chance you’re making use of projection systems like Steamer or ZiPS. In this post, we’ll explore some basic tools that might help you look at these projections in a new way — and brush up on those R skills that you probably haven’t touched since last fall.

(From a skills perspective, this post will assume that you’ve previously read through the “How To Use R For Sports Stats” series. Even if you haven’t, the insights below will hopefully still be worth your while. I’d also be terribly remiss if I didn’t point you towards Bill Petti’s recent THT post unveiling his baseballr R package.)

We’ll use Steamer projections for this post, though the methods we’ll look at can be used with ZiPS, FG Depth Charts, or, for that matter, actual by-season data. Download Steamer’s 2016 batting projections from FanGraphs, rename the file to “steamer16.csv”, and load it up in R. We’ll remove players projected for fewer than 100 AB to clean up the data a bit:

steamer = read.csv("steamer16.csv")
steamer = subset(steamer, PA > 100)

Visualizing Tiers

As fantasy baseball managers, we all have an innate ability to estimate a player’s value from their stats, judging how good a 30/10/.285 player is vs. a 15/15/.280. We get pretty good at this if we want to do well in our leagues — but we can still develop blind spots in our assessments, or hold on to an outdated idea of quality as MLB trends change. (For example, the average AVG in MLB has dropped from the high .260s 10 years ago to the low .250s today; if you’re still thinking a .255 hitter is below average, you might want to reconsider.)

The point, then, is that if you’re getting a sense of how good a player will be by looking at their projections, it can be helpful to step back and recalibrate your thinking from time to time by looking at the broader trends in an image or two.

Let’s look at Steamer’s projections for stolen bases, for example. We’ll draw on what we learned back in part 2 to make a quick-and-hasty histogram counting the number of MLB players who are projected for different SB totals:

hist(steamer$SB, breaks = 30)

Basic histograph of SB projections

Most of these players are projected for fewer than 10 SB. This is sort of interesting, but their huge counts are keeping us from seeing the trends on the right side. Let’s zoom in a bit:

hist(subset(steamer, SB > 10)$SB, breaks=30)

Histograph of SB projections for > 10 SB

Even among this crowd of speedsters, it’s uncommon to see someone projected for more than 20 SB, and incredibly rare to have more than 30.

You probably didn’t need to be reminded that the two folks on the far right (spoiler alert: Billy Hamilton and Dee Gordon) would stand out, though it’s useful to see just how distant they are from everyone else. But if you were thinking that players like Jarrod Dyson (35 projected SB) or Billy Burns (32) are solid, but not elite, on the basepaths, it may be time to reassess. (Did I mention that SB totals in MLB dropped 25% between 2011 and 2015?)

If you’re the kind of person who prefers boxplots instead, R’s got just the thing:


Boxplot of SB projectionsThis makes it as plain as possible that any player projected for more than about 15 SB is, quite literally, a statistical outlier.

20/20 Vision

The same idea goes for getting a grasp on multi-category players. Most of us are looking for players who can bring in both HR and SB, but how many of those are really available? Let’s do a quick 2D plot:

plot(steamer$SB, steamer$HR)

Basic plot of HR vs. SB projectionsThis isn’t bad, but unfortunately it doesn’t give us a good sense of how many players fall into each category, since there’s only one dot for all of the 5 HR/3 SB players, one dot for all the 2 HR/4 SB players, etc. A quick workaround for this is the jitter() command, which moves the points around by tiny increments to get rid of some of the overlap:

plot(jitter(steamer$SB), jitter(steamer$HR))

And, for good measure, let’s add a grid on top:


Your plot should now look something like (but not exactly like) this:

Detailed plot of HR vs. SB projections

From the chart, we can see that it’s not impossible to find players projected for 30/10 or 10/30, but it looks like there’s only one 20/20 guy in Steamer’s projections:

subset(steamer, (SB >= 20 & HR >= 20))

            Name   Team  PA  AB   H X2B X3B HR  R RBI BB  SO HBP SB CS X.1   AVG   OBP   SLG   OPS
40 Carlos Correa Astros 636 571 157  33   3 22 80  82 54 110   4 20 11  NA 0.275 0.339 0.458 0.797

Of course. As if being a 21-year-old SS with plus average wasn’t enough.

Fun With Subsets

Let’s close this out by doing a bit more with subset() — possibly one of R’s most useful tools for our purposes because it’s just so much quicker and more customizable than online tools or Excel.

Say you want to find the prospective “five-category players”; you may have a sense of who some of the candidates are, but you might be surprised by what the numbers actually suggest. How many players, for example, are projected to do better than 10/80/80/10/.275?

subset(steamer, (HR > 10 & SB > 10 & R > 80 & RBI > 80 
 & AVG > .275))

               Name         Team  PA  AB   H X2B X3B HR   R RBI  BB  SO HBP SB CS X.1   AVG   OBP   SLG   OPS
1        Mike Trout       Angels 647 542 166  32   5 36 104 104  90 138   8 15  6  NA 0.307 0.410 0.585 0.995
5  Paul Goldschmidt Diamondbacks 652 543 158  36   2 30  93  93 100 142   3 14  7  NA 0.290 0.401 0.531 0.931
7  Andrew McCutchen      Pirates 653 554 165  34   3 23  88  87  84 123   9 12  6  NA 0.297 0.395 0.496 0.891
25    Manny Machado      Orioles 663 597 170  35   2 27  91  87  53  99   4 14  8  NA 0.285 0.345 0.484 0.829

Fewer than you may expect–which could well make them all the more valuable.


Projections, of course, are just projections, and you shouldn’t take one set — or even a combination of sets — to be a true predictor of what will happen this season. But if you typically look up projections player-by-player, or if you’re disinclined to take in a huge wall of stats at a single glance, looking at the broader trends in individual visualizations can help keep you on the right track as you prep for this fantasy season.

Here*, as always, is the code used for this post. If you have anything else you’d like to see us do with R as the new season comes near — or any suggestions with what you’ve used R for — let us know in the comments!

*download the ZIP and extract the R file.

Behind the Code: Ken Pomeroy

Behind the Code is an interview series centered around the sports-related web sites we use every day.

For many college basketball fans, Ken Pomeroy’s is the first–and often the last–word in statistics and analytics.

Pomeroy’s efficiency-based team ratings have received praise from the likes of Nate Silver and have led Pomeroy to consulting opportunities for several college and professional teams. Yet they only scratch the surface of the info “KenPom” provides, including player comparisons, posts on college basketball trends, and preseason game-by-game projections — the latter of which just went live for 2015-16.

TechGraphs writer Brice Russ spoke recently with Pomeroy about what’s new and what’s next for

Brice Russ: What’s new on for the 2015-16 season? It seems like every fall, there’s another half-dozen features on the site. What should we be looking for this time around?

Ken Pomeroy: The main offseason addition was expanding the player stats to break out performance against different levels of competition.

Marcus Paige stats
Marcus Paige career stats breakdown by level of competition

That is an extension of the conference-only player stats and minutes tracker that I added towards the end of last season. I’ve added some search boxes to the navigation bar so people can find teams and coaches a little more quickly.

There are some other things on the burner for this season, but they’ll be surprises for people when they appear. I’m usually reluctant to call my shots because I don’t want to over-promise something.

BR: What’s your motivation when you’re adding new stats to KenPom? Are you looking to incorporate what you think will be most popular? Most useful? Anything you’ve added as a result of your consulting work?

KP: There’s definitely a selfish motive. Usually, I’m thinking of things that I’d like to see, but that’s also consistent with doing something that will be useful to an audience interested in analytics.

I’m not opposed to adding what I’d call trivial stats that have little analytical value. I mean, I have team free throw percentage defense on the site. But I try to avoid adding trivial stats that might be misconstrued as useful. I always get a few requests for home/road splits and largely the differences in home/road performance over the course of the brief college hoops season is noise and not useful from a predictive standpoint, so that’s why they aren’t on the site.

BR: KenPom is unquestionably one of the oldest and most well-established sports analytics sites around, particularly in the basketball arena.

How have you seen the perception and role of analytics change in sports over the years, especially in the public eye? Has the field grown more competitive, or does it feel like there’s still plenty of ground for everyone to cover?

KP: It’s definitely more accepted than it was a decade ago. A lot more people understand the concept and utility of points per possession. But then again, almost every broadcaster and coach still cites regular field goal percentage to measure shooting accuracy, so it’s not like there has been a revolution.

As far as competition, there are certainly more people coming out of college with the goal of working in sports analytics. But it seems like most of the people interested in basketball gravitate to the NBA level, where the data is so much more granular and there are fewer teams to cover.

BR: One of the biggest stories for the upcoming season is the introduction of the new 30-second shot clock. Presumably, this will lead to an increase in tempo, but how else do you expect this to change the game from a metrics standpoint?

You took a brief look at the clock before it was tested in last year’s NIT; have you had a chance to look at the data since then?

KP: I haven’t looked at it any further, although my series of blog posts over the summer was partly inspired by trying to develop a theoretical framework to figure out how we got to this point. And what I found was some evidence that the offense deserves a good chunk of the blame for the slowing of the game.

The cool thing is that with 200+ games during the opening weekend we’ll get a real good idea of the impact of the clock (and the expanded charge circle) fairly quickly.

BR: 2015 will mark the fifth season that team-level KenPom data has gone behind the paywall. Is it fair to say this has been a successful experiment at this point? Any data or trends you can provide about subscriptions?

KP: It’s worked out well. I had a choice between appealing to a mass audience and blasting people with ads, or putting up the paywall and keeping the site clean and limiting the audience to folks who really wanted a source for advanced stats.

BR: What’s next?

KP: Usually the events of the season dictate this, so it’s difficult to say.  But I’m sure it won’t take long for something interesting to happen.

Thanks to Ken for speaking with us! You can follow Ken on Twitter at @kenpomeroy and, of course, at

An Update on Drones in Sports

Drones racing each other at breakneck speeds through an abandoned industrial complex, controlled through VR goggles that broadcast a first-person view simultaneously to the operators and fans alike.

This may seem futuristic, but it was actually one of the most imminent ideas recently discussed in a special panel dedicated to “Drones In Sports”. At last week’s On Deck Sports and Technology Conference, three drone sport experts came together to talk about the potential–and the potential pitfalls–that drones may provide for the future of sports technology.

As one panelist put it, drones in sports are currently “kind of a Wild West opportunity”, and Matt Higgins, CEO of RSE Ventures and vice-chairman of the Miami Dolphins, kicked the panel off with the example of the Drone Racing League. Earlier this year, the DRL started informally as groups of hobbyists flying drones through parking garages and abandoned buildings, but plans are in the works to turn it into a full-fledged sport, pitting sponsored teams of self-taught drone specialists against ex-military UAV pilots.

The DRL held their first two test flights earlier this year in the New York metro area, and plans to release their first official video “in the next few weeks.” When Higgins (an investor in the DRL) was asked how to make this a spectator sport, though, he responded candidly, “I have no clue.” He expects that streaming services like Twitch and Meerkat will be critical in building fan engagement with drone racing — and there certainly aren’t a lack of ideas to spice the sport up, as Bradley Woodrum outlined earlier this year.

Companies are also thinking about how drones can be used to improve the experience for players and fans in more traditional sports. In the eyes of the panelists, the “low-hanging fruit” was using drones to provide new angles for gameplay and practice, but the Amazon “drone-delivery” model was also a clear influence. The possibilities bantered about by the panel seemed limitless. Dispatching drones on golf courses for tee-side food delivery, handing out tickets with drones during games, the drone t-shirt cannon…

…wait a second, “the drone t-shirt cannon”?

It’s an undeniably cool idea, but taking a few seconds to visualize an unmanned aerial vehicle launching anything at fans may remind you that drones still face a number of psychological (and legal) obstacles. Eben Novy-Williams (a reporter for Bloomberg and the panel’s moderator) noted, as one example, that he was at a recent triathlon where a drone filming at the finish line elicited divided reactions–half the audience loved it, but the other half was clearly uncomfortable.

Chris Proudlove (an aerospace insurance expert with Global Aerospace) agreed that “we’re not quite there” when it comes to drones flying over crowds and that we need to “build a public sense that drones will be operated safely”, but that at the same time we don’t want to overstate the risk. The top concern of Proudlove’s clients when discussing drones was “invasion of privacy,” but he predicted this would be “moot” in less than 10 years. We’ve already become accustomed to ubiquitous smartphones that record video; from a certain perspective, drones aren’t terribly novel, just an extension of technology. The rest of the panel was even more bullish about the long-term prospects of us welcoming our new drone overlords.

But there’s still a lot to be done to make drones seem — and be — safer. Chris described geofencing technology to keep drones in and out of certain locations, such as ‘within the stadium, but at least 20 meters from the stands’, and better use of parachutes and other physical features. (Many of us probably remember the recent drone crash during the 2015 tennis U.S. Open, but the panelists were also well-aware of the 1979 mishap where a lawnmower-shaped model plane killed a fan during halftime at a Jets-Patriots game.) Chris also suggested that regulations distinguishing ‘micro-drones’ (of less than 2 pounds) and ‘the big ones’ could make drone safety easier.

In fact, the panel generally agreed that regulatory issues were the primary roadblock currently facing drones in sports. Right now, commercial use of drones (in the US) is illegal except when exemptions are provided by the FAA, and this is expected to remain the status quo until the FAA issues a full ruling on drones sometime in 2017. Until then, companies will need to build their usage of drones on a case-by-case basis. (Late last week, for example, the news broke that the FAA had allowed the NFL an exemption for drone usage — but only in empty stadiums.)

In short, it looks like we haven’t yet reached a drone-filled future of sport, but that day is likely drawing nigh. When Eben asked the panel for a bold “five-year prediction,” panelist Jon Ollwerther suggested that by 2020, we’d see drones being used in “every major American sport”– even down to the high school level.

TechGraphs Report: On Deck Sports and Technology Conference

Earlier this week, NYC’s Bohemian National Hall played host to hundreds of sports executives, entrepreneurs, and others looking to learn about the very latest in sports technology. Since 2013, the On Deck Sports and Technology Conference (organized and presented by SeatGeek) has provided a forum to showcase what products are “on deck” to help fans follow, analyze, and participate in sports.

On Deck has a slight bent towards sports startups, so a decent amount of the conference was geared more towards raising capital, scaling businesses, etc. Still, there were plenty of fascinating talks, panels and interviews for anyone interested in straight sports tech.

Statcast And Beyond

Possibly the most entertaining talk of the day was Joe Inzerillo’s (CTO, MLBAM) update on MLB’s Statcast, which is finally getting its moment in the sun this season. For those who needed a refresher on how Statcast operates, Inzerillo discussed its missile-technology radar system, its stereoscopically-placed cameras, and how these allow each Major League ballpark to track the movements of every player on the field (plus the ball) at any given time.

Once Statcast has this information, as Inzerillo pointed out, it can then provide real-time data on pitch velocity (actual and perceived), player velocity and reaction time, and a horde of other quantitative metrics, plus more advanced data on a 12-second delay, like fielding route efficiency. This data is just inherently cool (as you likely know if you’ve seen a Statcast-enhanced game or highlight on television), but it’s also already being used to both question and confirm existing baseball strategies.

For an example of the latter, Inzerillo looked at the fallacy of sliding into first using Statcast to plot Eric Hosmer’s 1B slide in Game 7 of the last World Series. Hosmer hit a peak speed of 20.9 MPH before sliding and being out by less than a tenth of a second. If he had just kept running, Statcast found, he would have been safe by nearly a foot. Statcast is already getting noticed by clubs, and even players — batters like to talk smack, apparently, over who has the highest exit velocity.

During questions, Inzerillo was slightly cautious about committing to the future of Statcast, but he did mention that minor league stadiums were a natural next step, and that there was plenty of work being done on developing new metrics. Statcast already tracks ‘defensive range’ for fielders, for example, but since a player doesn’t travel the same speed in every direction, there’s a need to find the more amorphous ‘effective defensive range’ and how it changes–such as during defensive shifts.

On the football side of things, Sportradar’s Tom Masterman talked about the NFL’s NGS (Next Gen Stats) platform, which is collecting data on every single game in 2015 to track, analyze, and visualize how players are moving on the field. NGS is already being distributed to clubs, media, and health and safety personnel; the long-term goal is to have X,Y,Z coordinates for every player and official, plus the ball.

Go Bucks

On Deck’s attendees weren’t just league officials and startup managers–the conference started with a live interview of Wes Edens, who became co-owner of the NBA’s Milwaukee Bucks in 2014. Much of the conversation focused on the new Bucks arena, which was being voted on by the Milwaukee city council literally as the interview was ongoing. As it’s currently planned, the presently-unnamed arena will focus heavily on keeping fans digitally connected — giving attendees plenty of WiFi, for example. At the same time, Edens noted, they want to avoid fans using technology to become distracted from the game going on in front of them. (Edens used the phrase “Instagram culture”, specifically, though he noted that he himself has had these sorts of problems before.)

Edens was similarly balanced when the discussion turned to analytics. One of the first things Edens did after buying the Bucks was to build their analytics program — bringing on employees, consultants and even discussing methodologies with other owners. There’s definitely a “golden age” of analytics in basketball going on.  Edens even thinks the NBA will end up surpassing the MLB as the leader in sports technology. But when he was asked about how the players feel?

“It’s a good question,” Edens replied. “There’s definitely lines that can be crossed” with having too much data being made public, at least when it can affect the privacy of the players (such as rest/injury issues).

Edens also briefly discussed the role of the referees and the potential benefits of replay and “the new center across the river“. Could we see yet more referee technology, even an Oculus Rift-type headset for NBA officials, in the future? “Totally possible.”

Era of Mobility

When it came time to look at how fans themselves were interacting with sports, technologically, it became clear that mobile is “it.” In that panel about growing sports startups I mentioned earlier, representatives from SeatGeek, FanDuel and Krossover all praised the importance of the mobile web for their companies–SeatGeek’s rep described it as a “tale of two companies”, pre- and post-mobile, and Krossover’s founder mentioned they’re considering dumping their web app altogether in lieu of just being on smartphones and tablets. When Yahoo Sports’ VP of engineering presented a chart showing their fantasy football traffic from this season’s Week 1, the fraction of non-mobile data was a pretty small sliver at the top.

Yahoo fantasy data graph
Trust me, it’s there.

Even companies you might never expect to get in the mobile game are joining and succeeding. Jeremy Strauser had 20 years of gaming experience at EA Sports and Zynga before joining one of the most loved and enduring brands in the sports industry, Topps. Yep, they’re digital playing cards.

Topps first got into the digital game 4 years ago and how has three top-selling sports card apps, plus a newly launched Star Wars-themed set. Why should you be interested in buying trading cards on your phone? One starting point is the capabilities the digital platform provides — literally hundreds of thousands of different designs, the ability to create all manner of rare and unique cards, etc.

Topps is also rolling out a daily fantasy sports feature (DFS was a major topic of conversation at On Deck) that allows you to compete using the players in your card deck and swapping them in and out in real time as they go up to pitch or bat. It probably doesn’t hurt, either, that they won’t take up space under your bed or get thrown out by your mom when you’re away at college.

Topps conference talk

Coming To Your Hometown

If you want to look for the next wave of sports technology, though, look to your neighborhood.

Rather than providing new tools or analytics for MLB, the NFL or the NBA, the newest sports apps want to help you participate in sports in your own town. On Deck wrapped up with a “Startup Pitch Contest” a la Shark Tank where teams had four minutes to present their groundbreaking app to a group of judges. The six competitors included:

  • Wooter – a search engine for finding and joining sports and activities like local rec leagues. Wooter provides profiles for leagues looking to form teams, players looking to join them, and the tools to process payment and set up other logistics.
  • NextPlay – helping youth coaches conduct tryouts and league drafts. For $15/month, instead of taking a stopwatch, a bunch of handwritten notes, and an Excel spreadsheet to put together youth rosters, NextPlay handles all the data collection and analytics itself. Their beta has been used by “a couple hundred organizations” and over 10,000 athletes.
  • ScoreStream – filling a gap in local journalism by crowdsourcing reports on high school sports.

With a really impressive presentation, broad coverage (10,300 HS games covered last week alone) and the #1 iOS app for high school sports, I really thought Scorestream would walk away with the prize, but it ended up going to…

  • SidelineSwap, a P2P marketplace for sporting goods. SidelineSwap has over 43,000 registered users who’re interested in trading out sporting gear just collecting dust in their basement or garage. They’re working on building partnerships with youth organizations and promoting used college-branded material, which should play very well with their chief audience of high school students.

On the whole, On Deck was a whirlwind experience for learning about cutting-edge sports tech. This report only covers part of everything I caught there. Watch for further updates and profiles soon!

Preventing Concussions in the Next Generation of Football Players

Concussions are bad.

Nobody has ever really disputed this, but over the last decade, it has become increasingly apparent that repetitive head injuries, seen particularly often in football, can lead to significant long-term medical effects.

The “concussion debate” has largely taken place on the professional stage, from the controversies generated by League of Denial to Will Smith’s forthcoming feature film Concussion and beyond. Yet the true impact is being felt across the nation, as schools and innovators work to protect the more than 1,000,000 young adults who play college and high school football each season.

This fall, new devices large and small are being tested to reduce the frequency and effect of football-related concussions.

The Dartmouth Dummy

Five years ago, the Dartmouth Big Green football program eliminated athlete-on-athlete tackling during practices. Cutting out these collisions in favor of tackle sleds and dummies cuts down on injuries and concussions–which makes sense–but made it harder to actually practice tackling against a moving target–which also makes sense.

Enter the MVP–the “Mobile Virtual Player”.

Designed by two Dartmouth engineering students, the MVP is a remote-controlled, human-sized dummy that resembles a cross between the Headless Horseman and a Weeble. Less bone-crushing than an actual human, the MVP allows for relatively realistic tackling simulations while significantly decreasing the risk of head and neck injuries.

Two MVPs were deployed in August, with a third on the way, and the experiment has received the attention of major media, tech blogs–and, reportedly, a few NFL teams.

New Helmets InSite

This doesn’t do much to prevent contact during games–and, as long as there’s tackling in football, there’s only so much you can do–but some new tools are being developed to limit the effects of major hits when they do happen.

The sporting company Riddell is in the process of bringing a new line of helmets to high schools around the country. The SpeedFlex helmets, equipped with Riddell’s InSite Impact Response System, use six built-in accelerometers to measure the individual and combined force of every impact a player receives. This data is sent live to a laptop on the sidelines, where trainers and staff can monitor players for potential danger signs.

As programs continue to adopt the system, one trainer says, this data will itself be useful for better understanding what leads to football brain injuries.

Watch Your Mouth

In fact, before long it might not even take a special helmet to easily detect potential concussions. Smithsonian reports on FITGuard, a mouth guard co-created by two Arizona State grads–one a veteran of the rugby team.

Like InSite, FITGuard uses sensors to measure hits to the head and can transfer data to a nearby computer. If FITGuard sees any signs of danger, though, it simply lights up the player’s mouth using LEDs. FITGuard is scheduled for release in early 2016.

Image courtesy of
Image courtesy of

These tools aren’t without their caveats. A recent Stanford study, for example, found that some currently existing concussion-measuring devices (particularly helmets) can significantly mismeasure the actual force of impact.

Nevertheless, with room for improvement and no end to the concussion crisis in sight, technology like this can still have great potential to help protect our next generation of football players.

(Featured Image via Dartmouth)

How To Use R For Sports Stats, Part 3: Projections

In this series, we’ve walked through how exactly you can use R for statistical analysis, from the absolute basics of R coding (in part 1) to visualizing data and correlation tests (in part 2).

Since you’re reading this on TechGraphs, though, you might be interested in statistical projections, so that’s how we’ll wrap this up. If you’re just joining us, feel free to follow along, though looking through parts 1 and 2 first might help everything make more sense.

In this post, we’ll use R to create and test a few different projection systems, focusing on a bare-bones Marcel and a multiple linear regression model for predicting home runs. I’ve said a couple times before that we’re just scratching the surface of what you can do — but this is especially true in this case, since people write graduate theses on the sort of stuff we’re exploring here. At the end, though, I’ll point you to some places where you can learn more about both baseball projections and R programming.


Let’s get everything set up. We’ll have to start by abandoning –well, modifying– that test data set that served us so well in Parts 1/2; we’ll add another two years of data (2011-14), trim out some unnecessary stats, and add a few which might prove useful later on. It’s probably easiest just to download this file.

Then we’ll load it:

fouryr = read.csv("FG1114.csv")

convert some of the percentage stats to decimal numbers:

fouryr$FB. = as.numeric(sub("%","",fouryr$FB.))/100
fouryr$K. = as.numeric(sub("%","",fouryr$K.))/100
fouryr$Hard. = as.numeric(sub("%","",fouryr$Hard.))/100
fouryr$Pull. = as.numeric(sub("%","",fouryr$Pull.))/100
fouryr$Cent. = as.numeric(sub("%","",fouryr$Cent.))/100
fouryr$Oppo. = as.numeric(sub("%","",fouryr$Oppo.))/100

and create subsets for each individual year.

yr11 = subset(fouryr, Season == "2011")
colnames(yr11) = c("2011", "Name", "Team11", "G11", "PA11", "HR11", "R11", "RBI11", "SB11", "BB11", "K11", "ISO11", "BABIP11", "AVG11", "OBP11", "SLG11", "WAR11", "FB11", "Hard11", "Pull11", "Cent11", "Oppo11", "playerid11")
yr12 = subset(fouryr, Season == "2012")
colnames(yr12) = c("2012", "Name", "Team12", "G12", "PA12", "HR12", "R12", "RBI12", "SB12", "BB12", "K12", "ISO12", "BABIP12", "AVG12", "OBP12", "SLG12", "WAR12", "FB12", "Hard12", "Pull12", "Cent12", "Oppo12", "playerid12")
yr13 = subset(fouryr, Season == "2013")
colnames(yr13) = c("2013", "Name", "Team13", "G13", "PA13", "HR13", "R13", "RBI13", "SB13", "BB13", "K13", "ISO13", "BABIP13", "AVG13", "OBP13", "SLG13", "WAR13", "FB13", "Hard13", "Pull13", "Cent13", "Oppo13", "playerid13")
yr14 = subset(fouryr, Season == "2014")
colnames(yr14) = c("2014", "Name", "Team14", "G14", "PA14", "HR14", "R14", "RBI14", "SB14", "BB14", "K14", "ISO14", "BABIP14", "AVG14", "OBP14", "SLG14", "WAR14", "FB14", "Hard14", "Pull14", "Cent14", "Oppo14", "playerid14")

(We’re renaming the columns for each subset because the merge() function has some problems if you try to merge too many sets with the same names. If you want to explore the less hacked-together way of reassembling data frames in R, take a look at the dplyr package.)

Anyway, we’ll merge these all back into one set:

set = merge(yr11, yr12, by = "Name")
set = merge(set, yr13, by = "Name")
set = merge(set, yr14, by = "Name")

Still with me? Good. Thanks for your patience. Let’s start testing projections.

Specifically, we’re going to see how well we can use the 2011-2013 data to predict the 2014 data. For simplicity’s sake, we’ll focus mostly on a single stat: the home run. It’s nice to test with–it’s a 5×5 stat, it has a decent amount of variation, it gives us experience with testing counting stats while being more player-controlled than R/RBI… and, come on, we all dig the long ball.

Now when you’re testing your model, it’s nice to have a baseline–a sense of the absolute worst that a reasonable model could do. For our baseline, we’ll use previous-year stats: we’ll project that a player’s 2013 HR count will be exactly what they hit in 2014.

To test how well this works, we’ll follow this THT post and use the mean absolute error–the average number of HRs that the model is off by per player. So if a system projects two players to each hit 10 homers, but one hits zero and the other hits 20, the MAE would be 10.

(If you end up doing more projection work yourself, you may want to try a more fine-tuned metric like r² or RMSE, but I like MAE for a basic overview because the value is directly measurable to the stat you’re examining.)

To find the mean absolute error, take the absolute value of the difference between the projected and actual stats, sum it up for every player, then divide by the number of players you’re projecting:

sum(abs(set$HR13 - set$HR14))/length(set$HR14)
> [1] 6.423729

So the worst projection system possible should be able to beat an average error of about six and a half homers per player.

Marcel, Marcel

Now let’s try a slightly-less-than-absolute-worst model.

Marcel is the gold standard of bare-bones baseball projections. At its core, Marcel predicts a player’s stats using the last 3 years of MLB data. The previous year (Year X) gets a weight of 5, the year before (X-1) gets a weight of 4, and X-2 gets a weight of 3. As originally created, Marcel also includes an adjustment for regression to the mean and an age factor, but we’ll set aside such fancies for this demonstration.

To find Marcel’s prediction, we’ll create a new column in our dataset weighing the last 3 years of HRs. Since our weights are 5 + 4 + 3 = 12, we’ll take 5/12 from the 2013 data, 4/12 from the 2012 data, and 3/12 from the 2011 data. Then we’ll round it to the nearest integer.

set$marHR = (set$HR13 * 5/12) + (set$HR12 * 4/12) + (set$HR11 * 3/12)
set$marHR = round(set$marHR,0)

Voila! Your first (real) projections. How do they perform?

sum(abs(set$marHR - set$HR14))/length(set$HR14)
> [1] 5.995763

Better by nearly half a home run. Not bad for two minutes’ work. 6 HR per player still seems like a lot, though, so let’s take a closer look at the discrepancies. We’ll create another column with the (absolute) difference between each player’s projected 2014 HRs and actual 2014 HRs, then plot a histogram displaying these differences.

set$mardiff = abs(set$marHR-set$HR14)
hist(set$mardiff, breaks=30, col="red")

Histogram of Marcel HR errors

Not as bad as you might have thought. Many players are only off by a few home runs, some off by 10+, and a few fun outliers hanging out at 20+. Who might those be?

set = set[order(-set$mardiff),]
head(set[c(1,72,90,91)], n=10)

(In that last line, we’re calling specific column names so we don’t have to search through 100 columns for the data we want when we display this. You can find the appropriate numbers using colnames(set).)

List of players with largest Marcel HR errors

A list headlined by a season-ending injury and two players released by their teams in July; fairly tough to predict in advance, IMO.

While we’re here, let’s go ahead and create Marcel projections for the other 5×5 batting stats:

set$marAVG = (set$AVG13 * 5/12) + (set$AVG12 * 4/12) + (set$AVG11 * 3/12)
set$marAVG = round(set$marAVG,3)
set$marR = (set$R13 * 5/12) + (set$R12 * 4/12) + (set$R11 * 3/12)
set$marR = round(set$marR,0)
set$marRBI = (set$RBI13 * 5/12) + (set$RBI12 * 4/12) + (set$RBI11 * 3/12)
set$marRBI = round(set$marRBI,0)
set$marSB = (set$SB13 * 5/12) + (set$SB12 * 4/12) + (set$SB11 * 3/12)
set$marSB = round(set$marSB,0)

And, for good measure, save it all in an external file. We’ll create a new data frame from the data we just created, rename the columns to look nicer, and write the file itself.

marcel = data.frame(set$Name, set$marHR, set$marR, set$marRBI, set$marSB, set$marAVG)
colnames(marcel) = c("Name", "HR", "R", "RBI", "SB", "AVG")
write.csv(marcel, "marcel.csv")

Before we move on, I want to quickly cover one more R skill: creating your own functions. We’re going to be using that absolute mean error command a couple more times, so let’s create a function to make writing it a bit easier.

modtest = function(stat){
 ame = sum(abs(stat - set$HR14))/length(set$HR14)

The ‘stat’ inside function(stat) is the argument you’ll be including in the function (here, the column of projected data we’re testing); the ‘stat’ shows up inside the bracketed text where your projected data did when we originally used this command. The return() is what your function outputs to you. Let’s make sure it works by double-checking our Marcel HR projection:

> [1] 5.995763

Now we can just use modtest() to find the absolute mean error. Functions can be as long or as short as you’d like, and are incredibly helpful if you’re using a certain set of commands repeatedly or doing any sort of advanced programming.

Hold The Line

With Marcel, we used three factors–HR counts from 2013, 2012, and 2011–with simple weights of 5, 4, and 3. For our last projection model, let’s take this same idea, but fine-tune the weights and look at some other stats which might help us project home runs. This, basically, is multiple linear regression. I’m going to handwave over a lot of the theory behind regressions, but Bradley’s how-to from last week does a fantastic job of going through the details.

Remember back in part 2, when we were looking at correlation tests in r² and we mentioned how we were basically modeling a y = mx + b equation? That’s basically what we did with Marcel just now, where ‘y’ was our projected HR count and we had three different mx values, one each for the 2013, 2012 and 2011 HR counts. (In this example, ‘b’, the intercept, is 0.)

So we can then use the same lm() function we did last time to model the different factors that can predict home run counts. We’ll give R the data and the factors we want it to use, and it’ll tell us how to best combine them to most accurately model the data. We can’t model the 2014 data directly in this example–since we’re testing our model against it, it’d be cheating to use it ‘in advance’–but we can model the 2013 HR data, then use that model to predict 2014 HR counts.

This is where things start to get more subjective, but let’s start by creating a model using the last two years (2013/2012) of HR data, plus the last year (2012) of ISO, Hard%, and Pull%. In the lm() function, the data we’re attempting to model will be on the left, separated by a ‘~’; the factors we’re including will be on the right, separated by plus signs.

hrmodel = lm(set$HR13 ~ set$HR12 + set$HR11 + set$Hard12 + set$Pull12 + set$ISO12)

Screenshot of initial linear model

There’s a lot of stuff to unpack here, but the first things to check out are those “Pr(>|t|)” values in the right corner. Very simply, a p-value less than .05 there means that that factor is significantly improving your model. (The r² for this model, btw, is .4611, so this is accounting for roughly 46% of the 2013 HR variance.) So basically, ISO and Pull% don’t seem to add much value to this model, but Hard% does.

It’s generally a good practice to remove any factors that don’t have a significant effect and re-run your model, so let’s do that:

hrmodel = lm(set$HR13 ~ set$HR12 + set$HR11 + set$Hard12)

Screenshot of R model with significant factors

And there’s your multiple linear regression model. The format for the actual projection formula is basically the same as what we did for Marcel, except your weights will take the coefficient estimates and you’ll include the intercept listed above them. Remember that “HR12”, “HR11”, etc., are standing in for “last year’s HR total”, “the year before that’s HR total”, etc., so make sure to increment the stats by a year to project for 2014.

set$betHR = (-5.3 + (set$HR13 * .32) + (set$HR12 * .13) + (set$Hard13 * 40))
set$betHR = round(set$betHR,0)

Survey says…?

> [1] 5.95339

…oh. Yay. So that’s an improvement of, uh…

modtest(set$marHR) - modtest(set$betHR)
> [1] 0.04237288

1/20th of a home run per player. Isn’t this fun? Some reasons why we might not have seen the improvement we expected:

  • We probably overfit the data. Since we ran the model on 2013 data, it probably did really well on 2013 data, but not as great on 2014. If we check the model on the 2013 data:
set$fakeHR = (-5.3 + (set$HR12 * .33) + (set$HR11 * .13) + (set$Hard12 * 40))
set$fakeHR = round(set$fakeHR,0)
sum(abs(set$fakeHR - set$HR13))/length(set$HR13)
> [1] 4.877119

It runs pretty well.

  • We didn’t include useful factors we could have. We just tested a few obvious ones; maybe looking at Cent% or Oppo% would be more helpful than Pull%? (They aren’t, just so you know.) More abstract factors like age, ballpark, etc., would obviously help–but including these would also require a stronger model.
  • Finally, projections are hard. Even if you have an incredibly customized set of projections, you’re going to miss some stuff. Take a system like Steamer, one of the most accurate freely-available projection tools around. How did their 2014 preseason projections stack up?
steamer = read.csv("steamer.csv")
steamcomp = merge(yr14, steamer, by = "playerid14")
steamcomp$HR = as.numeric(paste(steamcomp$HR))
steamcomp$HR = round(steamcomp$HR, 0)
steamcomp$HR[$HR)] = 0
sum(abs(steamcomp$HR - steamcomp$HR14))/length(steamcomp$HR14)
> [1] 4.892157

That said, the lesson you should not take away from this is “oh, our homemade model is only 1 HR/player worse than Steamer!” Our data set is looking at players for whom we have several seasons’ worth of data —   the easiest players to project. If we had to create a full-blown projection system including players recovering from injury, rookies, etc., we’d look even worse.

If anything, this hopefully shows how much work the Silvers, Szymborskis, and Crosses of the world have put in to making projections better for us all. Here’s the script with everything we covered.

This Is Where I Leave You

Well, that about wraps it up. There’s plenty, plenty more to learn, of course, but at this point you’ll do well to just experiment a little, do some Googling, and see where you want to go from here.

If you want to learn more about R coding, say, or predictive modeling, I’d definitely recommend picking up a book or trying an online class through somewhere like MIT OpenCourseWare or Coursera. (By the end of which, most likely, you’ll be way beyond anything I could teach you.) If there’s anything particular about R you’d still like to see covered, though, let me know and I’ll see if I can do a writeup in the future.

Thanks to everyone who’s joined us for this series — the kudos I’ve read here and elsewhere have been overwhelming — and thanks again to Jim Hohmann for being my perpetual beta tester/guinea pig. Have fun!

How To Use R For Sports Stats, Part 2: Visualization and Analysis

Welcome back! In Part 1 of this series, we went over the bare bones of using R–loading data, pulling out different subsets, and doing basic statistical tests. This is all cool enough, but if you’re going to take the time to learn R, you’re probably looking for something… more out of your investment.

One of R’s greatest strengths as a programming language is how it’s both powerful and easy-to-use when it comes to data visualization and statistical analysis. Fortunately, both of these are things we’re fairly interested in. In this post, we’ll work through some of the basic ways of visualizing and analyzing data in R–and point you towards where you can learn more.

(Before we start, one commenter reminded me that it can be very helpful to use an IDE when coding. Integrated development environments, like RStudio, work similarly to the basic R console, but provide helpful features like code autocompletion, better-integrated documentation, etc. I’ll keep taking screenshots in the R console for consistency, but feel free to try out an IDE and see if it works for you.)

Look At That Data

We’ll be using the same set of 2013-14 batter data that we did last time, so download that (if you haven’t already) and load it back up in R:

fgdata = read.csv("FGdat.csv")

Possibly my favorite thing about R is how, often, all it takes is a very short function to create something pretty cool. Let’s say you want to make a histogram–a chart that plots the frequency counts of a given variable. You might think you have to run a bunch of different commands to name the type of chart, load your data into the chart, plot all the points, and so on? Nope:


Basic R histogramThis Instant Histogram(™) displays how many players have a wRC+ in the range a given bar takes up in the x-axis. This histogram looks like a pretty normal, bell-curveish distribution, with an average a bit over 100–which makes sense, since the players with a below-average wRC+ won’t get enough playing time to qualify for our data set.

(You can confirm this quantitatively by using a function like summary(fgdata$wRC).)

The hist() function, right out of the box, displays the data and does it quickly–but it doesn’t look that great. You can spend endless amounts of time customizing charts in R, but let’s add a few parameters to make this look nicer.

hist(fgdata$wRC, breaks=25, main="Distribution of wRC+, 2013 - 2014", xlab="wRC+", ylab= NULL, col="darkorange2")

In this command, ‘breaks’ is the number of bars in the chart, ‘main’ is the chart title, ‘xlab’ and ‘ylab’ are the axis titles, and ‘col’ is the color. R recognizes a pretty wide range of colors, though you can use RGB, hex, etc. if you’re more familiar with them.

Anyway, here’s the result:

Visually appealing R histogramA bit better, right? The distribution doesn’t look quite as normal now, but it’s still pretty close–we can actually add a bell curve to eyeball far off it is.

hist(fgdata$wRC, breaks=25, freq = FALSE, main="Distribution of wRC+, 2013 - 2014", xlab="wRC+", ylab= NULL, col="darkorange2")
curve(dnorm(x, mean=mean(fgdata$wRC), sd=sd(fgdata$wRC)), add=TRUE, col="darkblue", lwd=2)

Visually appealing R histogram with curve

(In the first line above, “freq = FALSE” indicates that the y-axis will be a probability density rather than a frequency count; the second line creates a normal curve with the same mean and standard deviation as your data set. Also, it’s blue.)

You can also plot multiple charts at the same time–use the par(mfrow) function with the preferred number of rows and columns:

hist(fgdata$wOBA, breaks=25) 
hist(fgdata$wRC, breaks=25) 
hist(fgdata$Off, breaks=25) 
hist(fgdata$BABIP, breaks=25)

2x2 grid of R histogramsWhen you want to save your plots, you can copy them to your clipboard–or create and save an image file directly from R:

hist(fgdata$WAR, breaks=25)

(It’ll show up in the same directory you’re loading your data set from.)

So that covers histograms. You can create bar charts, pie charts, and all of that, but you’re probably more interested in everyone’s favorite, the scatterplot.

At its most basic, the plot function is literally plot() with the two variables you want to compare:

plot(fgdata$SLG, fgdata$ISO)
Basic R scatterplot
Unsurprisingly, slugging percentage and ISO are fairly well-correlated. Results-wise, we’re starting to push against the limits of our data set–too many of these stats are directly connected to find anything interesting.

So let’s take a different tack and look at year-over-year trends. There are several ways you could do this in R, but we’ll use a fairly straightforward one. Subset your data into 2013 and 2014 sets,

fg13 = subset(fgdata, Season == "2013")
fg14 = subset(fgdata, Season == "2014")

then merge() the two by name. This will create one large dataset with two sets of columns: one with a player’s 2013 stats and one with their 2014 stats. (Players who only appeared in one season will be omitted automatically.)

yby= merge(fg13, fg14, by=("Name"))

Year-by-year dataAs you can see, 2013 stats have an .x after them and 2014 stats have a .y. So instead of comparing ISO to SLG, let’s see how ISO holds up year-to-year:

plot(yby$ISO.x, yby$ISO.y, pch=20, col="red", main="ISO year-over-year trends", xlab="ISO 2013", ylab="ISO 2014")

Visually appealing R scatterplot(The ‘pch’ argument sets the shape of the data points; ‘xlim’ and ‘ylim’ set the extremes of each axis.)

Again, a decent correlation–but just *how* decent? Let’s turn to the numbers.

Relations and Correlations

If you’re a frequent FanGraphs reader, you’re probably familiar with at least one statistical metric: r², the square of the correlation coefficient. An r² near 1 indicates that two variables are highly-correlated; an r² near 0 indicates they aren’t.

As a refresher without getting too deep into the stats: when you’re ‘finding the r²’ of a plot like the one above, what you’re usually doing is saying there’s a linear relationship between the two variables, that could be described in a y = mx + b equation with an intercept and slope; the r² is then basically measuring how accurately the data fits that equation.

So to find the r² that we all know and love, you want R to create a linear model between the two variables you’re interested in. You can access this by getting a summary of the lm() function:

summary(lm(yby$ISO.x ~ yby$ISO.y))

R linear model summaryThe coefficients, p-values, etc., are interesting and would be worth examining in a more theory-focused post, but you’re looking for the “Multiple R-squared” value near the bottom–turns out to be .4715 here, which is fairly good if not incredible. How does this compare to other stats?

summary(lm(yby$BsR.x ~ yby$BsR.y))
> Multiple R-squared:  0.4306
summary(lm(yby$WAR.x ~ yby$WAR.y))
> Multiple R-squared:  0.1568
summary(lm(yby$BABIP.x ~ yby$BABIP.y))
> Multiple R-squared:  0.2302

BsR is about as consistent as ISO, but WAR has a smaller year-to-year correlation than you might expect. BABIP, less surprisingly, is even less correlated.

Let’s do one more basic statistical test: the t-test, which is often used to see if two sets of numeric data are significantly different from one another. This isn’t as commonly seen in sports analysis (because it doesn’t often tell us much for the data we most often work with), but just to run through how it works in R, let’s compare the ISO of low-K versus high-K hitters. First, we need to convert the percentages in the K% column to actual numbers:

fgdata$K. = as.numeric(sub("%","",fgdata$K.))/100

then subset out the low-K% and high-K% hitters:

lowk = subset(fgdata, K. < .15)
highk = subset(fgdata, K. > .2)

Then, finally, run the t-test:

t.test(lowk$ISO, highk$ISO)

R T-test resultsThe “p-value” here is about 4.5 x 10^-11 (or 0.000000000045); a p-value less than .05 is generally considered significant, so we can consider this evidence that the ISO of high-K% hitters is significantly different than that of low-K% hitters. We can check this out visually with a boxplot–and you thought we were done with visualization, didn’t you?

boxplot(lowk$ISO, highk$ISO, names=c("K% < 15%","K% > 20%"), ylab="ISO", main="Comparing ISO of low-K% vs. high-K% batters", col="goldenrod1")

Visually appealing R boxplotSo now you can do some standard statistical tests in R–but be careful. It’s incredibly tempting to just start testing every variable you can get your hands on, but doing so makes it much more likely that you’ll run into a false positive or a random correlation. So if you’re testing something, try to have a good reason for it.

…And Beyond

We’ve covered a fair amount, but again, this only begins to cover the potential R provides for visual and statistical analysis. For one example of what’s possible in both these areas, check out this analysis of an online trivia league that was done entirely within R.

If you want to replicate his findings, though (which you can, since he’s posted the code and data online!), you’ll need to install packages, extensions for R that give you even more functionality. The ggplot2 package, for example, is incredibly popular for people who want to create especially cool-looking charts. You can install it with the command


and visit to learn more. If R doesn’t do something you want it to out of the box, odds are there’s a package out there that will help you.

That’s probably enough for this week; here’s the script with all of this week’s code. In our next (last?) part of this series, we’ll look at taking one more step: using R to create (very) basic projections.

How To Use R For Sports Stats, Part 1: The Absolute Basics

If you’ve spent a sufficient amount of time messing around with sports statistics, there’s a good chance the following two things have happened, in order:

  1. You probably started off with Excel, because Excel does a lot of stuff pretty easily and everyone has Microsoft Office.
  2. At some point, you mentioned to someone that you use Excel to do statistical analysis and got a response along the lines of, “Oh, that’s cool, but you should really be using R.”

Politeness issues aside, they might well be right.

R is a programming language and software platform commonly used, particularly in research and academia, for data analysis and visualization. Because it’s a programming language, the learning curve is a bit steeper than it is for something like Excel–but if you dig into it, you’ll find that R makes it possible to do a wider variety of tasks more quickly. If you’re interested in finding interesting insights with just a few lines of code, if you want to easily work with large sets of data, or if you’re interested in using most any statistical test known to man, you should take a look at R.

Also, R is totally free, both as in “open-source” and as in “costs no money”. So that’s nice.

In this series, we’ll learn the basics of working in R with the goal of exploring sports data—baseball, in particular. I’m going to presume that you have no background whatsoever in coding or programming, but to keep things moving, I’ll try not to get too bogged down in the details (like how “=” does something different from “==”) unless absolutely necessary. This guide was made using R on Windows 7, but most everything should be the same on whatever OS you use.

Okay, let’s do this.

Getting Started

You can download R from

You’ll have to click on a few links (you want the ‘base’ install) and actually install R, but once that’s done you should have a screen that looks like:

Screenshot #1: R consoleThe “R console” is where your code is soon going to run–but first, we need some data. Let’s take FanGraphs’ standard dashboard data for qualifying MLB batters in 2013 and 2014. Save it as something short, like “FGdat.csv”. (If you have a custom FG dashboard or just want to take a shortcut, you can just download the data we’ll be using here.)

In R, we’ll be focusing mostly on functions (that look like, say, function(arg1, arg2)), which are what actually do things, and naming the output of these functions so we can refer back to it later. For example, a line of R code might look like this:

fgdata = read.csv("FGdat.csv")

The function here is the read.csv(), which basically means “read this CSV file into R”, and the argument inside is the file that we want to read. The left part (fgdata =) is us saying that we want to take the data we’re reading and name it “fgdata”.

This is, in fact, the first line we want to run in R to load our data, so type/paste it in and hit Enter to execute it.

(You may get an error like cannot open file ‘FGdat.csv’: No such file or directory; if you do, you likely need to change the directory that R is trying to read files from. Go to “File” -> “Change dir”, and change the working directory to the folder you saved the CSV in, or just move the CSV to the folder R has listed as the working directory.)

If you didn’t get an error and R simply moves on to the next line, you should be good to go!

Basic Stats

The head() function returns the first 6 rows of data; since our data set is named “fgdata”, we can try this out with the line of code:

> head(fgdata)

R Screenshot #2: head(fgdata)And to get a basic overview of the entire data set, there’s the summary() function:

> summary(fgdata)

R Screenshot #3: summary(fgdata)See! Already, data on 20 variables in the blink of an eye.

“1st Qu.” and “3rd Qu.” are the first and third quartiles; the mean, median, minimum and maximum should be self-explanatory. So we can see that the average player in this data set had roughly a .270 average with 17 dingers and 10 steals in 146 games–not far from Alex Gordon’s 2014, basically.

Want to compare how the 2013 and 2014 stats stack up? R makes it pretty easy to pick out subsets of data. It’s called, reasonably, the “subset” function, and all you need to include is the data set you’re taking a subset of and the criteria the subset data should conform to.

Since we have “Season” as a field in the table, we just need to say “Season == “2013”” to get the 2013 players and “Season == “2014”” to get the 2014 players. We’ll name these new data sets ‘fg13’ and ‘fg14’:

> fg13 = subset(fgdata, Season == "2013")
> fg14 = subset(fgdata, Season == "2014")

A quick check should confirm that, yes, the data did subset correctly:

> summary(fg13)

R Screenshot #4: summary(fg13)and now we can do some basic statistical comparisons, like comparing the mean BABIPs between 2013 and 2014. (To single out a specific column in a data set, use the $ symbol.)

> mean(fg13$BABIP)
> mean(fg14$BABIP)

You can do whatever basic statistical tests you like–sd() for the standard deviation, et cetera–and pull out different subsets of the data based on whatever criteria you like. So “HR > 20” for all players who hit more than 20 home runs, or “Player == “Mike Trout”” to get data for all players named Mike Trout:

> fgtrout = subset(fgdata, Name == "Mike Trout")
> fgtrout

R Screenshot #5: fgtroutLastly, it’s not too common to need to reorder your data in R, but if you do, you can do so with the order() function. This line sorts the data by wRC+, ascending order:

> fgdata = fgdata[order(fgdata$wRC.),]

then returns the top 10 rows:

> head(fgdata, n = 10)

You can sort in descending order by placing a minus sign before the column:

> fgdata = fgdata[order(-fgdata$wRC.),]

R Screenshot #6: head(fgdata, n = 10)And, as you’ve probably noticed, most of these functions can be tweaked or expanded depending on the different arguments you use–adding “n = 10” to head(), for example, to view 10 rows instead of 6. One of the more fascinating and infuriating things about R is that pretty much every function is like that–but at least they’re all documented!

And, of course, you can access the documentation through a function. Use help() (help(head), help(summary), etc.) and a page will pop up with the arguments, and more additional details than you probably ever wanted.


One final note: typing code directly into the console is fine, but it gets a bit annoying if you want to write more than a line or two. Instead, you can create a new window within R to load, edit and run scripts. In Windows, use “Ctrl+N” to open a new script window. Type some code; to run it, highlight the lines you want to run and hit “Ctrl+R”.

You can also use these windows to save your R script in R files–as I’ve done here for all the code used in this article. Feel free to download and start tinkering.

So those are the basics of R; not enough to really show its potential, but enough to start experimenting and exploring as you wish. For Part 2, we’ll start some data plotting and correlation tests, and in Part 3 we’ll try to recreate some basic baseball projection models. I actually haven’t done this before in R, so it should be interesting. Stay tuned!

(Thanks to Jim Hohmann for helping test this article.)

On-Court Headsets for NBA Referees Might be Coming

Most of the attention in the NBA right now is focused off the court, as NBA teams and free agents continue to negotiate. Recently, though, we saw the start of actual play in the NBA summer leagues, where the new arrivals included Jahlil Okafor, Justise Winslow–and some new tech for the refs.

This summer, NBA referees are experimenting with Bluetooth-esque wireless headsets to quickly communicate with each other across the court. The headsets, which were first tested in the D-League this spring, also allow the three-man crew to confer with an outside reviewer–this summer, that’s a courtside ‘sideline supervisor’, but down the road the headsets could provide instant contact with the NBA’s replay center in Seacaucus, N.J.

Veteran NBA referee Scott Foster, who served as a sideline supervisor during testing in the D-League, had good things to say about the referee headsets in an interview with

“[We can] hear them talking to one another and can understand when they’re telling one another, ‘Hey, I’m watching the ball right now.’ It’s easier, it’s better than having them screaming across the floor. […] We’ll be able to communicate in loud arenas in critical situations during live play. We’ll be able to make sure the entire crew is at a higher level of concentration.”

As it turns out, though, the NBA is somewhat late to the game when it comes to testing referee headsets. The NFL, as you may recall, provided wireless headsets to on-field officials starting last fall, though their impact was a bit overlooked amid the megahype for the sideline Surface tablets. The NHL tested wireless communication for its referees as early as 2011, but ultimately chose not to move forward; off-ice headsets are instead used for reviewing goals. The MLB uses a similar system to handle instant replay.

One referee who tested the NHL’s system pointed out a few of the cons, including volume calibration (if a referee blows his whistle next to his mike, you can imagine the other refs would pick it up a little loud) and physical issues caused by the headset itself:

“[I]t blocks your hearing on one side. There was one time where a player came out of the penalty box and I couldn’t hear him coming, and he almost ran me over.”

Not to mention, of course, the problems with interference that any wireless headset could have, as has been known to happen with quarterback helmet receivers in the NFL.

So should we expect any huge referee communication developments in the major leagues? Probably not for the MLB — tradition aside, there just isn’t as much need for umpires to confer mid-play as there is elsewhere — though it wouldn’t be surprising to see the NHL give it another go. And of course, the jury’s still out on the NBA experiment. Though there has been discussion of introducing referee headsets in the NBA regular season as soon as 2015-16, no formal announcement has yet been made.

(Image via Keith Allison)