How to Split Strings of Sports Data into Individual Cells in Excel

May 6, 2015

This is a really simple trick that I think a lot of people will never need, but a few people will love forever. It’s this problem:

Year   Age  Tm  ERA   G     IP  FIP
1906    29 CHC 1.04  36  277.1 2.08
1907    30 CHC 1.39  34  233.0 1.97
1908    31 CHC 1.47  44  312.1 1.87

Here’s Mordecai Brown’s 1906 through 1908 stats from Baseball-Reference. But some nimcompoop has pasted them as pre-formatted values. Paste this biz into Excel and you get something like this:

Well, you’re in luck. Because there’s a super easy way to make this into legit data. Go to the appropriately-named “Data” tab and you’ll find the “Text to Columns” function:

With the data selected, click on this “Text to Columns” button. A pop up window will then prompt you whether you want to treat the text as “Fixed width” or “Delimited.” These concepts are weird-sounding, but actually quite simple:

Fixed-width: The data has spaces inserted into it so that it visually lines up. When you’re pasting from pre-formatted text, this is almost always the right choice.
Delimited: This is text that has special characters within it to help signify the differences between columns. So, if you’re pasting something from a CSV file (a file that uses commas between each data point), then delimited would be the right choice. You can also use delimited if there are spaces between data points — and this can be very helpful when translating a list not intended for spreadsheet use into a spreadsheet.

For this instance, let’s use Delimited to see what happens:

Highlight the text and then open the conversion wizard.

Step 2 asks us what our delimiters are. We can choose one of the usual delimiters (like tabs and commas) or we can use spaces or even insert our own characters (this can be usual for when, say, we scrape data off a Korean baseball site and special Korean characters divide the columns).

With “Tab” selected, you’ll probably notice the data in the preview box doesn’t change — as in, there are no vertical lines indicating the creation of new columns. If you click the “Space” box though, you should see something like this:

Hy! Look! New columns! — Hey! Look! New columns!

Note the check next to “Treat consecutive delimiters as one.” That is important, because without it, we end up with a lot of columns we don’t need:

Yup, that's too many columns. — Yup, that’s too many columns.

In the third step, we have the option to change the data formats and choose an alternative location for the new separated data to appear — both of which are handy — but in general, our work his here is done. Hit “Finish.”

That should give us something like:

Now the data is separated into unique columns. Note how Excel automatically identified the numbers as numerical despite originating within a text string.

So, the alternative to delimited data is fixed-width. Let’s go Pro on this thing and used the fixed-width setting. But this time, I’m going to use data from my Scoresheet league. I encourage any other Scoresheet users to play along with their own data.

I really like Scoresheet. It is one of the very few fantasy baseball sites to include defense. In fact, it might be the only one. But it has problem. The site’s interface was designed by DARPA to fend off the Soviets, or at least that’s what it feels like:

The 1980s called, and they want to warn us about a time-fissure in the land-line system of the 1980s that allows them to communicate with the future. They also browsed our internet, saw the Scoresheet interface, and they want their design principles back.

So, here’s my undrafted players list (and before you league-mates start thinking you’ve got a peak at my undrafted players list, just know: I’m no longer in the market for a second catcher or extra bat, so ogle all you want). In the middle, big frame, I have a list of players organized first by position and then by playing time accrued (i.e. their real world stats). But what I want to do is combine this data with, say, some projection data like the FanGraphs Depth Charts leaderboard.

But in order to match these undrafted player with the FanGraphs DC projections, I need to first suss out the names of the players. I can achieve that through scraping the numbers of available players here in this undrafted screen and then combine them with the Scoresheet master roster (downloadable here).

For those keeping track at home, I’m combining data like this:

The undrafted list will connect to the master roster which will connect to the FanGraphs projections which will connect to my league championship.

We need to scrape the data from the undrafted list. This amounts to clicking in the middle frame of that previous screen, pressing CTRL+A (select all) and then CTRL+C (copy).

Remember, you only need the stuff in the middle frame.

Pasting that into Excel gets something like this (note, I deliberately pasted using source formatting):

First, we can get rid of those top two rows. Really don’t need them. BUT DON’T DELETE THAT THIRD ROW. We need that.

Let’s go ahead and split our data. Select the entirety of Column A (just click the “A” at the top). Then run our handy converter (Data > Text to Columns). This time, let’s go with Fixed Width (though, once again, we could probably make it work with either).

We see reach a screen like this:

We can click and drag the arrows to change where the columns start. We can also click on spaces in the ruler section to create a new column break.

The key here is that the numbers on the far left are isolated into their own column. It’s fine that the word “Pitchers” is getting sliced into multiple columns; we don’t really need it anyway.

Scrolling down the preview panel, I can confirm that all the numbers are indeed isolated into the far left column. Now I’m going to actually use the Step 3 of the conversion wizard.

In Step 3, I can choose an alternative destination for the conversion. I want to do this. So I’m going to change the “Destination” section to the next available column — in other words, change it from $A$1 to $B$1:

We can change the destination by either typing in a new one or just clicking on where we want the first row and first column to start.

Great! Now I’ve got my data separated out:

Now I've got my original data (on the left) sitting next to my split data (on the right). — Now I’ve got my original data (on the left) sitting next to my split data (on the right).

Now I want to get rid of all the blank or white-font lines. These are leftovers from the column titles on the Scoresheet website. First I select all the data (CTRL+A and then CTRL+A again to select everything), and then I sort by the first column.

Sorting Z to A, I find — midway down the data — this blank spot where white-font titles have gather together:

Let's get rid of these now useless titles. — Let’s get rid of these now useless titles.

After I purge those rows, the data should be clean. Row B should be nothing but numbers. Beautiful, delicious numbers.

So let’s go back to the top and add a title row. I really only need the first two columns for this, so I’m going to name them Original and Number and then delete the rest. Then I’m going — once again in the “Data” tab — add filters (Data > Filter). If you had any one of the data cells selected (i.e. not a blank cell selected), then Excel should add a neat little pair of gray arrows at the top like such:

We can now sort this data in more fun ways.

You may also notice my data is already sorted in the preceding image (note the little up arrow inside the gray box). But this data is clearly not sorted numerically or alphabetically — No. 444 Greg Bird is near the top. It is sorted, however, by color.

See how the items with gray fonts are now all in the top together? These are players already drafted or selected for my queue. This is why we need the original formatting in the first pasting job.

So now I’ve got my numbers isolated. Using VLOOKUP (or INDEX and MATCH), matching these players to their full names should be a cinch (also, a CONCATENATE formula will help join the first and last names of the players). And with first and last names, we can do one more VLOOKUP / INDEX-MATCH to join them to their projections.

The net result, with some formatting and careful data pollination, can look something like this:

Now the real trick will be for me to find a way to hide this article from the guys in my league.

7 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Eric F

8 years ago

Wonderful stuff as usual Bradley. Side note: I haven’t really had time to play fantasy the past couple years, but Scoresheet seems basically like an online OOTP league that uses current week stats instead of past performance, that sounds pretty awesome. The only thing I saw that I probably wouldn’t like is no live drafts, just makes it seem like you don’t have real, full control of your team.

obsessivegiantscompulsive

I’ve always wondered how to tie together two different data sources and match up data sets. I’ve never used VLOOKUP, will have to check out that article you linked. Thanks!

I did know abou that Text Conversion in Excel, and agree that it is WONDERFUL, but wonder if any such function exists in Google’s Sheet application. Are you aware of any? Or if there is another way to get it done in that app?

Bradley Woodrummember

Reply to obsessivegiantscompulsive

Well I can’t recommend this article enough:

http://www.techgraphs.com/using-vlookup-for-sports-data/

The author is like stupid handsome.

Reply to Bradley Woodrum

Yes, he is quite the handsome dude, I agree.

And thanks for the link too!

Scott

Tech Graphs keeps bringing excellent content. This article and the VLOOKUP article are phenomenal. If my leaguemates are reading I am about to lose what I consider to be one of my big edges in the fantasy game: facile number crunching and indexing 🙂

Reply to Scott

Thanks! I’m glad you liked it.

Now go forth and outnerd everyone, outnerd them to their graves.

Joe

An important note here (just in the very very rare case that this is something you’d encounter) is that this CAN be automated in VBA as a macro (which is what you had to do in the bad old days). Now, if the set has enough “rows” to be an issue it’s probably stored in a .csv friendly framework.

If you’re using something like R to analyze data, the original csv is just going to be more efficient to work with. I’m not sure it’s ever a problem in MLB contexts, but on the SMALL side sportsVU data is a gig per file, which simply won’t open in excel. I’m actually not sure about that- though I can guarantee 8 gigs won’t open from experience. In most cases I think Excel is great. Once the data starts getting medium sized, though, it starts to get slow. If you’d like a deeper primer on how to clean data, btw, Johns Hopkins BioStat has a FANTASTIC coursera course on data science.

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG