Archive for August, 2015

How to Follow the PGA Championship Online This Week

Golf’s final men’s major championship of the year kicks off today in my home state of Wisconsin at Whistling Straits Golf Course. The main headline revolves around two young starts, Rory McIlroy and Jordan Speith, and their reluctant rivalry coming in to the tail end of the PGA season. Whether you are a fan of those two, other competitors in the field, or just like to follow along with the event, you have multiple options for staying in the know during the tournament.

Watching

If you plan on being a couch potato all weekend, then a mixture of the TNT network and CBS will have you covered. You can check the tournament’s main page for broadcasting schedules. If you have a cool boss or a strategically-placed cubicle, you can also use PGA.com to stream video on Friday. You will be able to watch the traditional broadcast (when available), or follow a featured group around the course.

The PGA is also offering dedicated apps for both Android and iOS. The apps will allow you watch much of the same offerings as the web site, though it does appear that the app will make you register with an email address. At the time of this writing, I couldn’t find out if cable/satellite credentials are necessary once CBS takes over coverage for the weekend.

The PGA Championship app for iOS.
The PGA Championship app for iOS.

Other Ways to Follow Along

If you’ll be out and about and unable to glue yourself to a screen, you will have other options. The aforementioned apps also have leaderboard functionality baked in, and also allow you to select favorite players to follow. You can set up alerts for these players, or for tournament news in general. A helpful buzz might be more convenient than having to pull out your phone every five minutes. The app also cultivates tweets for you, if you wish, so you can see what journalists and other big names in golf are saying about the course and players’ performances.

If you don’t feel like adding yet another app to your growing stable, any sports news app that you currently have should suffice in keeping you up to date. I am a big fan of the CBS Sports app in general, and as that network is covering the tournament, you can bet they’ll be on the ball with updates. The CBS app also provides cultivated tweets from people of import in golf.

The PGA section of the CBS Sports app on Android.
The PGA section of the CBS Sports app on Android.

A difficult course coupled with some challenging weather should make for some interesting golf. Whether one of the household names or a more unknown pro makes a run at the Wanamaker Trophy remains to be seen, but armed with the proper tech, you should have no problem following along.


How To Run Sports Data Regressions in Microsoft Excel

The shorthand description of a regression: It’s the best possible trend line between a scatter of dots. Like this:

The orange line (and the connected equation) represent the most basic idea of a regression.
The orange line (and the connected equation) represent the most basic idea of a regression.

One of the fun things about regressions is that they give us formulas — line equations, specifically. So if we have a quarterback with a 100 QB rating, we can plug his 100 into our formula (y = 0.097 * 100 – 2.495) and get a reasonable estimation of what his Adjusted Net Yards per Pass Attempt (ANY/A) would be (about 7.2). The R2 tells us essentially how reliable the regression is — or, more specifically, how much of the variation in QB Rating is explained by ANY/A.

Of course, the problem with the data here (which I just kind of threw together as an example) is that QBR and ANY/A use almost the same inputs and attempt to do the same thing. It’s nice to see they have about 91% overlap (it basically says they’re just about interchangeable), but no one is going to use QB rating to derive or forecast a ANY/A.

Regressions are more useful when we start with something small and reliable, then move our way to more all-encompassing but volatile stats. This is like how we use contact or plate discipline data (small and reliable) to expand into an xBABIP calculator (big and more meaningful).

There are two ways to run some regressions in Excel:

  1. Use the scatterplot tool (as above) and create a simple, two-variable regression.
  2. Use the Data Analysis ToolPack to run a more complete and useful regression.

The first method is the easiest, but it doesn’t output the peripheral data that is essential to fully understanding a regression’s findings.

The Scatterplot Regression

For the first method, just select two columns of data and make a scatterplot (Insert > Scatter). That will give you something like this:

Here's a scatterplot of the 2015 Durham Bulls' strikeout and home run totals.
Here’s a scatterplot of the 2015 Durham Bulls’ strikeout and home run totals, min. 50 PA.

With the chart selected, choose to add a linear trendline (Layout > Trendline > Linear Trendline):

Adding a linear trendline will create a basic linear regression.
Adding a linear trendline will create a basic linear regression.

Now double-click the trendline to produce the “Format Trendline” window. In that menu, check the boxes for “Display Equation on chart” and “Display R-squared on chart”:

These two boxes give you the bare minimum of data necessary to interpret a regression.
These two boxes give you the bare minimum of data necessary to interpret a regression.

So now we have a regression! The formula (HR = 3.5367 * SO + 29.166) tell us there is a positive connection between home run totals and strikeout totals. And the R2 tells us the relationship between HR and SO explains 48% of the variation between the two of them.

What this regression doesn’t tell us:

  • What direction, if any, is the causality? Are homers causing players to strikeout? Or do more strikeouts make more homers?
  • Are their peculiarities in the residuals? This article does a great job of teaching how to interpret residuals plots.
  • Does the regression fit the data? And ANOVA analysis can be useful in augmenting what the R2 tells us.

The first issue is a matter of deeper research. A regression won’t tell us direction of causality. But we can still answer those other two questions — as well as add more variables — using Excel data Analysis ToolPack. The first thing we’ll need to do is enable that ToolPack.

In the File > Options > Add-Ins section, you’ll notice a “Go…” button at the bottom of the window.

This button opens a dialogue that allows us to turn on the data Analysis ToolPack. Why is this not enabled by default? Who knows? Maybe Bill Gates.
This button opens a dialogue that allows us to turn on the data Analysis ToolPack. Why is this not enabled by default? Who knows? Maybe Bill Gates.

Select the top option in the available Add-Ins (“Analysis ToolPack”) and then click “OK.”

You can also add in these other ones if you're feeling frisky. I rarely use them, though.
You can also add in these other ones if you’re feeling frisky. I rarely use them, though.

Now, after this first step, you should have a new option in your Data tab. Let’s explore that. Go to Data > Data Analysis. That will open a simple dialogue with a list of various operations. Choose “Regression” and click “OK”.

You should then get this screen:

The Y Range will be what you are regression against, so to speak.
The Y Range will be what you are regression against, so to speak.

In the Y Range text box, you will want to add only a single column of data. I prefer to include the column headings so that the output screen will be more easily understood. In this instance, I’m choosing a big column of completions percentage data from Pro-Football-Reference.com (from this data: NFL QB seasons since 1969 with min. 10 TD). I’m regression this Cmp% data against the quarterbacks sack total and yards per attempt (Y/A) total.

In short, I’m asking: Can Y/A and sack totals predict a QB’s accuracy?

So in the X Range, I’m going to select the Q and R columns (titles and all). The output is something like this:

Here are the big three components of a regression.
Here are the big three components of a regression.

So this is kinda what it will look like after a regression. Let’s break down the three big areas one at a time, in the typical order I look at them:

  1. Residual Plots: These look good! You want a shotgun blast looks. If you start to see anything other than a circle, in any of your residual plots, then you’ll need to rework your regression. (See that above article for more details.)
  2. R2 Results: What is a good R2? Well, higher is always (well, usually) better, but there’s no clear perfect R2. Truth is, we have to be as intellectually honest as possible and determine how much explanation is the right amount of explanation. With multiple variables, it’s important to look at Adjusted R2 because it helps combat the unintentional increase in R2 caused by just adding more data. In this case, though, R2 and Adjusted R2 are about the same, so whutevz.
  3. Coefficients: The coefficients tell us both the formula of the regression (Cmp% = 36 + 0.02 * Sacks + 3.03 * Y/A) but also the strength of the variables involved. And it doesn’t take much work realized which variable is more important. Even a QB who has been sacked will only see his Cmp% moved by 1.5%. What’s more: It increases 1.5%. That’s a red flag right there for a bad variable. Maybe Sack% would be a more useful tool because the sack totals are merely telling us he played more (and QBs who played more probably had better performances because otherwise they would have been benched).

Anyway, I hope this has been helpful. I encourage readers to learn more about regression before attempting any, as they are a complicated and tricky tool and can lead a researcher astray quickly if used incorrectly.

NOTE: For those wondering why I haven’t gone into detail about significance testing for P-values, it’s because I believe that field of statistical study is generally arbitrary and altogether intellectually bankrupt. But there are dozens of great tutorials out there on the subject. I just can’t in good ethical faith write one.


GameOn Releases Sports Social Networking App

In a continued effort to personalize and curate the world of sports to each person’s preference, the GameOn app — developed by GameOn Technologies — recently secured funding for additional expansion across the iOS and Android platforms. Among the backers who helped raise 1.5 million dollars was Hall of Fame quarterback Joe Montana as well as West Indian cricket player Dwayne Bravo. In addition to Montana and Brave a number of other athletes have signed on to give exclusive content — ranging from USWNT striker Sydney Leroux, former USMNT midfielder Cobi Jones and current Denver Broncos safety T.J. Ward.

The app itself ties news feeds from other sources in one convenient place called The Five. Think of The Five like an aggregator or RSS for all sports.

gameonthefive1

From browsing GameOn — drawing articles and tweets ESPN, Grantland, SB Nation, BBC, CBS Sports and others, — there is plenty of reading material for the major sports teams. Unfortunately there is not a way to customize what appears on The Five, it seems to pick whichever articles or tweets are currently getting the most attention. The good news is just about anything else can be tweaked to show which teams you’d like to follow.

It’s a personal preference, but when opening a story from The Five, it does not give the option to open in Chrome — I don’t have an iOS device to see if Safari was an option — as tapping a link simply opens up the story within GameOn. It isn’t a hindrance or particularly inconvenient, I just like opening links in new tabs and windows. Call it a hangover effect from years of opening links in Chrome with my Mouse3 button.

In addition to specific teams and featured articles, there are individual “Featured Public Huddles” where fans can join and debate with each other, related links are posted, and a host of emoji-like stickers can be used.

gameonfeaturedhuddle
For an example of a what can only be assumed to be a clearly unbiased opinion, user Cristiano The Beaut (presumably named for Real Madrid star Cristiano Ronaldo) calls Lionel Messi a “dirty and overrated player.” I’d recommend a spoon to take all those grains of salt with that opinion.

gameonfanbanter

Huddles is just the name GameOn gives team or game threads, and in addition to the public “Featured” ones, any user can create private Huddles as well.

As I find myself more and more interested in following the German soccer league, the Bundesliga, I decided to create a pair of Huddles for their upcoming fixtures against Hoffenheim and Hannover. From my phonebook or friends within the GameOn app, I can invite people to join in the Huddle.

gameonprivatehuddle

The stickers — there are hundreds if not thousands of them — are a unique feature and I really like the multi-site integration and aggregate feed. But really, the stickers are awesome.

gameonstickers

Even with the fun and smack-talk-integration the stickers offer, GameOn really doesn’t differentiate itself from other sports apps, specifically Fancred. Additionally, Twitter’s influence on the sports social network scene, despite not being marketed as a sports-centric app, looks as strong as ever given numbers from their second quarter of this year. According to the financial report, Twitter increased their average active monthly user base from 308 million in Q2 2014 to 316 million Q2 2015. GameOn has a solid beginning, however with a modest 50,000 or so downloads in its first year of Beta testing, it has a long way to go to reach the top of the sports social network world.


TechGraphs News Roundup: 8/7/2015

It’s a busy time for sports these days. The baseball pennant chase is heating up, NFL training camps are starting, the EPL is about to start another season, and the final PGA major of the season begins in just a few days. We can understand if you missed news, so here are all the sports-tech stories we found interesting this week.

Daily fantasy sites are making money hand over fist, but not everything is sunshine and rainbows for them. DraftKings is facing some class-action lawsuits, and the Boston Globe has some details.

SAP teamed up with women’s professional tennis a while back to provide in-game stats and insight to coaches and players. It seem that, now, coaches will be able to use these numbers — via an iPad — on the court.

Our very own (and very brave) Bradley Woodrum experimented with the meal replacement program from Soylent to mixed reviews. Well, Soylent is back with a new formula and a new ready-to-drink delivery method.

Microsoft partnered up with the NFL last year to provide Surface Pro 2 tablets to teams to use on the sidelines. It went … not as well as expected. But, the Surfaces are back for another NFL season, this time in their fancy Pro 3 form.

While on the topic of Microsoft, they also debuted a fancy new app for both Windows 10 and the Xbox One. At the heart of the new tech is something called Next Gen Stats, a hyper-granular replay system utilizing RFID chips embedded within players’ shoulder pads.

EPL teams are getting their own emoji on Twitter this season. So … that’s a thing.

Finally, Disney’s Bob Iger had an interview with the Wall Street Journal in which he opined that ESPN will be a direct-to-customer product at some point down the road. We, of course, have figured this for some time, but it’s nice to hear that from a muckety-muck. No news on when that will happen yet, of course.

That’s all for this week. Have a great weekend, and be excellent to each other.


Heart Rate Sensor Assists U.S. Women’s National Team

The United States women’s national team won their third World Cup title this summer in Canada. That same Women’s World Cup, along with this summer’s Under 20 World Cup in New Zealand, marked the first time FIFA allowed players to wear tracking devices during game action. The success of the devices during these events led FIFA to greenlight the use of wearables in future competitions, subject to the approval of individual leagues.

Part of the team’s success was the players’ dedication to the training plans developed by strength and fitness coach Dawn Scott.

“I think it’s a testament to the players that they trusted us and stuck to the program, and did what they needed to even when they had their commitments with their [club] teams,” Scott said in a previous interview.

An earlier Wired article discussed the USWNT’s relationship with Polar Global devices. When reached for comment, Polar Global’s Josh Simonsen confirmed that the players were wearing the H7 heart rate monitor on the pitch, and using M400 GPS watches in training sessions. Simonsen, the company’s national training resource specialist for the U.S., said the Finnish company had worked with Scott and the USWNT since 2010.

“What that gave Dawn was the ability to track speed, distance, and activity of the athlete while they’re away,” Simonsen said. “And they were able to send it her in a much easier environment than the previous models.”

The H7, the heart rate monitor worn during the games, consists of a strap worn across the chest and a small transmitter a few inches wide. The strap contains an electrode that collects the ECG signal from the athlete; after some basic processing, the transmitter then sends out a Bluetooth signal. The system reports heart rate on a per-second basis, using only basic peak-to-peak measurements, which are less susceptible to the kind of movement artifacts you would expect with an athlete wearing the device during competitions.

Polar’s top-of-the-line system, the Team Pro, also includes a GPS and IMU sensor. Switching to Bluetooth Smart also allows the transmitters to communicate directly with a tablet. But Simonsen said the USWNT was still using the older Team2 solution this summer.

“They didn’t want to transition prior to the World Cup,” he said.

The company has a long history with heart rate sensors, having built the first monitor for an athlete in 1977. But it was not until the early 2000s that Polar began developing systems for whole teams, rather than for individuals.

“Essentially the coach would log into the software and each player would have their own page, but they really weren’t able to compare the team as a whole,” Simonsen said. “We couldn’t look at the big picture.”

This functionality would not become available until Polar’s Team2 system was introduced in 2009. Unlike their previous offerings, Team2 allowed coaches to collect and analyze data in much less time. The addition of Bluetooth transmitters also allowed coaches to monitor their players in real time.

“[Team2] was like a 50 percent cut in time that it took to do everything,” Simonsen said. “Everything was exponentially faster.”

The Team2 system is currently used by “about 450 teams” in the U.S., and Simonsen said new coaches are typically surprised by the feedback provided by the data.

“A lot of the time they’re just blown away at how long things were or how hard things truly are,” he said. “Or that their easy day really wasn’t that easy, or that their hard day was a lot harder than they really thought it was.”

Simonsen argues that this experience in the field is what separates Polar Global from the plethora of other companies offering heart rate monitors.

“We created heart rate,” Simonsen said. “And with us using HR from the beginning, accuracy is always our number one thing.”


Could MLBAM Be Making a Push to Become the Next Sling TV?

By now, you may have heard that the NHL has partnered up with MLB Advanced Media (MLBAM) in a new distribution deal. It’s a bold move for both leagues, and one that shows that the NHL is serious about increasing their online presence and offerings. But hidden in the details are even bigger revelations about the future of MLBAM. From the CBS article:

NHL COO John Collins would not confirm these figures, but word is the league valued the deal at $200 million per year.

The annual breakdown: a $100M rights fee to the NHL, $20M in savings from the league not having to invest in the capital resources/expertise it would take to go on its own, and $80M in equity in MLBAM’s technology business.

The equity portion may not figure in revenue calculations for the purposes of the salary cap. “We were told to expect $120M per year in added revenue… $4M per team,” one governor said.

This new deal is indicative of a fairly serious pivot. MLBAM made a name for itself as a content partner — a company that provided the infrastructure for those who wished to offer online content streaming. The massive system that they built to host their MLB.tv service was essentially leased out to the likes of ESPN, HBO, and WWE.

But with the NHL deal, MLBAM is no longer serving as the back end. They aren’t the ones being paid for hosting, they are paying for distribution rights. And they are buying everything lock, stock, and barrel. Besides being in charge of streaming NHL Center Ice, MLBAM is taking over NHL Network, NHL.com, and individual team web sites. Basically, if you want to view NHL content online, you have to go through MLBAM.

And there’s more. According to Forbes:

Along with the deal, the NHL would have equity in what is now called BAM Tech, a wholly new digital company that will be spun-off of MLB Advanced Media.

There’s the other shoe. What once started as a distribution channel for baseball games has become a lucrative technology business.

But I doubt this is the end for BAM Tech. They could certainly take their subscription fees from baseball and hockey fans along with their licensing fees from HBO and be content being a very profitable company for some time. But if that was the plan, they would have just taken the NHL’s money for distribution rather than paying them for the rights. Sure, they’ll make money from Center Ice and the NHL Network, but it could be indicative of a bigger move.

The “problem” with MLBAM’s business model is that it’s easily repeatable. Any company with enough capital and infrastructure can get in on the action. What MLBAM has that others don’t is partnerships. Right now, it works with two major sports leagues, *edit: I originally neglected to mention that MLBPA also hosts streaming for the PGA*, the biggest name in professional wrestling, the largest sports media company in the world, and the most popular premium cable channel. If one were inclined to, say, start their own over-the-top online TV provider, this would be a pretty good start.

It’s speculation at this point, but it wouldn’t be surprising to see a BAM app available on smartphones and set-top boxes in the near future. While they’d be competing against the likes of Sling TV and PlayStation Vue, the already-formed partnerships along with their world-class technology platform would certainly make them a formidable opponent. And don’t forget that HBO owns Cinemax and is a subsidiary of Time Warner (which just merged with Charter)*, while ESPN happens to be owned by Disney. There are a lot of fingers in a lot of pies here.

* – a studious commenter pointed out my mistake here.

Say you want to pay $120 for MLB.tv. What if BAM Tech could offer that plus HBO, NHL games, the Disney channel, and ESPN offerings for an extra $40 a month? Would the availability of live sports be enough to convince you to cut the cord?

MLBAM already built the gun, and now they’re starting to buy the bullets. There isn’t much stopping the once-quaint sports video service from becoming one of the biggest players in TV.


Bundesliga Gaining Traction in Attendance and Streaming Services

Things are on the rise in the top German soccer league, the Bundesliga. Not just the level of play, but the depth of teams as well as popularity have been trending upwards for several years now. The Union of European Football Association (UEFA) noticed the league’s rising talent as well, increasing their bids to the UEFA Champions League — the highest level of international club play in Europe — to three automatic slots plus one playoff bid.

With local fans already showing up to more Bundesliga games in person more than any other sport save for the NFL in the world, it’s hard to understate the league’s current impact and potential growth. Via Statista, the graph below displays the 2013-14 average game attendance for the 11 top ranking leagues.

statistic_id270301_average-attendance-of-major-sports-leagues-around-the-world-2013-14

Not even the English Premier League juggernauts of Manchester United, Arsenal Man City or Liverpool nor Spanish La Liga one-two punch of Barcelona and Real Madrid could draw more fans than the Bundesliga’s top draw in the 2013-14 season. Somewhat surprising, Borussia (there is a typo in the table below) leads all soccer clubs in Europe for attendance.

statistic_id382940_european-football-clubs-average-attendance-2013-14

Beyond local fans and UEFA, Fox Soccer has also taken note of the German league. The broadcasting network already streamed some Bundesliga fixtures on the Fox Soccer 2Go platform, but never all 306 matches. In addition to streaming every single league match, Fox has doubled down on the league by adding televised games as well. A total of 58 matches will be shown on TV on the Fox Sports 1, 60 on Fox Sports 2 and the final 188 games being shown on Fox Sports Plus.

The United States Men’s National Team (USMNT) has also seen an increased presence in Germany as six active members are currently on Bundesliga squads, with three more US players currently on clubs rating in Germany’s second tier. The German league is has the most US players in foreign leagues, barely edging out the Championship, England’s second tier, and trailing only Mexico’s Liga MX that boasts seven US capped players.

According to the Fox Soccer schedule, the opening match will be three time reigning champion Bayern Munich against the near-relegated Hamburg side. The match is set to be broadcasted on Fox Sports 2 at 2:30 pm eastern on Friday, August 14. While up-and-coming 20-year-old USMNT member Julian Green is under contract with Bayern until 2017, it’s possible, albeit unlikely, he could make an appearances and further boost the Bundesliga’s profile in the United States. If you happen to be busy on that Friday, tune in the next day as Werder Bremen just signed USMNT striker Aron Johannssen. Werder is set to kick off at 9:30 eastern on Saturday the 15th. While causation does not determine correlation, the United States has seen the the profile of the national men’s team rise in recent years, possibly due to Tim Howard, Clint Dempsey and Landon Donovan all crossing the ocean to play in the Premier League. Hopefully a similar level of fan interest will happen with the Bundesliga.

(Header image via Wikipedia)

How To Use R For Sports Stats, Part 2: Visualization and Analysis

Welcome back! In Part 1 of this series, we went over the bare bones of using R–loading data, pulling out different subsets, and doing basic statistical tests. This is all cool enough, but if you’re going to take the time to learn R, you’re probably looking for something… more out of your investment.

One of R’s greatest strengths as a programming language is how it’s both powerful and easy-to-use when it comes to data visualization and statistical analysis. Fortunately, both of these are things we’re fairly interested in. In this post, we’ll work through some of the basic ways of visualizing and analyzing data in R–and point you towards where you can learn more.

(Before we start, one commenter reminded me that it can be very helpful to use an IDE when coding. Integrated development environments, like RStudio, work similarly to the basic R console, but provide helpful features like code autocompletion, better-integrated documentation, etc. I’ll keep taking screenshots in the R console for consistency, but feel free to try out an IDE and see if it works for you.)

Look At That Data

We’ll be using the same set of 2013-14 batter data that we did last time, so download that (if you haven’t already) and load it back up in R:

fgdata = read.csv("FGdat.csv")

Possibly my favorite thing about R is how, often, all it takes is a very short function to create something pretty cool. Let’s say you want to make a histogram–a chart that plots the frequency counts of a given variable. You might think you have to run a bunch of different commands to name the type of chart, load your data into the chart, plot all the points, and so on? Nope:

hist(fgdata$wRC)

Basic R histogramThis Instant Histogram(™) displays how many players have a wRC+ in the range a given bar takes up in the x-axis. This histogram looks like a pretty normal, bell-curveish distribution, with an average a bit over 100–which makes sense, since the players with a below-average wRC+ won’t get enough playing time to qualify for our data set.

(You can confirm this quantitatively by using a function like summary(fgdata$wRC).)

The hist() function, right out of the box, displays the data and does it quickly–but it doesn’t look that great. You can spend endless amounts of time customizing charts in R, but let’s add a few parameters to make this look nicer.

hist(fgdata$wRC, breaks=25, main="Distribution of wRC+, 2013 - 2014", xlab="wRC+", ylab= NULL, col="darkorange2")

In this command, ‘breaks’ is the number of bars in the chart, ‘main’ is the chart title, ‘xlab’ and ‘ylab’ are the axis titles, and ‘col’ is the color. R recognizes a pretty wide range of colors, though you can use RGB, hex, etc. if you’re more familiar with them.

Anyway, here’s the result:

Visually appealing R histogramA bit better, right? The distribution doesn’t look quite as normal now, but it’s still pretty close–we can actually add a bell curve to eyeball far off it is.

hist(fgdata$wRC, breaks=25, freq = FALSE, main="Distribution of wRC+, 2013 - 2014", xlab="wRC+", ylab= NULL, col="darkorange2")
curve(dnorm(x, mean=mean(fgdata$wRC), sd=sd(fgdata$wRC)), add=TRUE, col="darkblue", lwd=2)

Visually appealing R histogram with curve

(In the first line above, “freq = FALSE” indicates that the y-axis will be a probability density rather than a frequency count; the second line creates a normal curve with the same mean and standard deviation as your data set. Also, it’s blue.)

You can also plot multiple charts at the same time–use the par(mfrow) function with the preferred number of rows and columns:

par(mfrow=c(2,2)) 
hist(fgdata$wOBA, breaks=25) 
hist(fgdata$wRC, breaks=25) 
hist(fgdata$Off, breaks=25) 
hist(fgdata$BABIP, breaks=25)

2x2 grid of R histogramsWhen you want to save your plots, you can copy them to your clipboard–or create and save an image file directly from R:

png(file="whatisitgoodfor.png",width=400,height=350)
hist(fgdata$WAR, breaks=25)
dev.off()

(It’ll show up in the same directory you’re loading your data set from.)

So that covers histograms. You can create bar charts, pie charts, and all of that, but you’re probably more interested in everyone’s favorite, the scatterplot.

At its most basic, the plot function is literally plot() with the two variables you want to compare:

plot(fgdata$SLG, fgdata$ISO)
Basic R scatterplot
Unsurprisingly, slugging percentage and ISO are fairly well-correlated. Results-wise, we’re starting to push against the limits of our data set–too many of these stats are directly connected to find anything interesting.

So let’s take a different tack and look at year-over-year trends. There are several ways you could do this in R, but we’ll use a fairly straightforward one. Subset your data into 2013 and 2014 sets,

fg13 = subset(fgdata, Season == "2013")
fg14 = subset(fgdata, Season == "2014")

then merge() the two by name. This will create one large dataset with two sets of columns: one with a player’s 2013 stats and one with their 2014 stats. (Players who only appeared in one season will be omitted automatically.)

yby= merge(fg13, fg14, by=("Name"))
head(yby)

Year-by-year dataAs you can see, 2013 stats have an .x after them and 2014 stats have a .y. So instead of comparing ISO to SLG, let’s see how ISO holds up year-to-year:

plot(yby$ISO.x, yby$ISO.y, pch=20, col="red", main="ISO year-over-year trends", xlab="ISO 2013", ylab="ISO 2014")

Visually appealing R scatterplot(The ‘pch’ argument sets the shape of the data points; ‘xlim’ and ‘ylim’ set the extremes of each axis.)

Again, a decent correlation–but just *how* decent? Let’s turn to the numbers.

Relations and Correlations

If you’re a frequent FanGraphs reader, you’re probably familiar with at least one statistical metric: r², the square of the correlation coefficient. An r² near 1 indicates that two variables are highly-correlated; an r² near 0 indicates they aren’t.

As a refresher without getting too deep into the stats: when you’re ‘finding the r²’ of a plot like the one above, what you’re usually doing is saying there’s a linear relationship between the two variables, that could be described in a y = mx + b equation with an intercept and slope; the r² is then basically measuring how accurately the data fits that equation.

So to find the r² that we all know and love, you want R to create a linear model between the two variables you’re interested in. You can access this by getting a summary of the lm() function:

summary(lm(yby$ISO.x ~ yby$ISO.y))

R linear model summaryThe coefficients, p-values, etc., are interesting and would be worth examining in a more theory-focused post, but you’re looking for the “Multiple R-squared” value near the bottom–turns out to be .4715 here, which is fairly good if not incredible. How does this compare to other stats?

summary(lm(yby$BsR.x ~ yby$BsR.y))
> Multiple R-squared:  0.4306
summary(lm(yby$WAR.x ~ yby$WAR.y))
> Multiple R-squared:  0.1568
summary(lm(yby$BABIP.x ~ yby$BABIP.y))
> Multiple R-squared:  0.2302

BsR is about as consistent as ISO, but WAR has a smaller year-to-year correlation than you might expect. BABIP, less surprisingly, is even less correlated.

Let’s do one more basic statistical test: the t-test, which is often used to see if two sets of numeric data are significantly different from one another. This isn’t as commonly seen in sports analysis (because it doesn’t often tell us much for the data we most often work with), but just to run through how it works in R, let’s compare the ISO of low-K versus high-K hitters. First, we need to convert the percentages in the K% column to actual numbers:

fgdata$K. = as.numeric(sub("%","",fgdata$K.))/100

then subset out the low-K% and high-K% hitters:

lowk = subset(fgdata, K. < .15)
highk = subset(fgdata, K. > .2)

Then, finally, run the t-test:

t.test(lowk$ISO, highk$ISO)

R T-test resultsThe “p-value” here is about 4.5 x 10^-11 (or 0.000000000045); a p-value less than .05 is generally considered significant, so we can consider this evidence that the ISO of high-K% hitters is significantly different than that of low-K% hitters. We can check this out visually with a boxplot–and you thought we were done with visualization, didn’t you?

boxplot(lowk$ISO, highk$ISO, names=c("K% < 15%","K% > 20%"), ylab="ISO", main="Comparing ISO of low-K% vs. high-K% batters", col="goldenrod1")

Visually appealing R boxplotSo now you can do some standard statistical tests in R–but be careful. It’s incredibly tempting to just start testing every variable you can get your hands on, but doing so makes it much more likely that you’ll run into a false positive or a random correlation. So if you’re testing something, try to have a good reason for it.

…And Beyond

We’ve covered a fair amount, but again, this only begins to cover the potential R provides for visual and statistical analysis. For one example of what’s possible in both these areas, check out this analysis of an online trivia league that was done entirely within R.

If you want to replicate his findings, though (which you can, since he’s posted the code and data online!), you’ll need to install packages, extensions for R that give you even more functionality. The ggplot2 package, for example, is incredibly popular for people who want to create especially cool-looking charts. You can install it with the command

install.packages("ggplot2")

and visit http://ggplot2.org/ to learn more. If R doesn’t do something you want it to out of the box, odds are there’s a package out there that will help you.

That’s probably enough for this week; here’s the script with all of this week’s code. In our next (last?) part of this series, we’ll look at taking one more step: using R to create (very) basic projections.