In one of my “from the vault” posts, I discuss how to use Python to convert a fixed-width file to a tab-delimited file. I explained that when I tried to use the R function read.fwf, it was disgustingly slow. In fact, if I use too large a buffersize it was actually crashing R at work today! I made an unfair comparison, though. The function I devised when I used Python was in accord to how I wanted to store the data–i.e., in an archived (compressed) file. There is merit to this, but it does not reflect the process I am about to describe.

The general idea of the conversion is simple. You read in a line of data and you fracture into pieces the length of the specified field widths. As fixed-width files require padding the entry with white spaces to fill the field width, it makes sense to remove them at this point. You then put the pieces back together, separating them by your delimiter. Then, write that line to your output file. Repeat for the length of the input file.

A simplified version of the Python script is to follow. I’m also going to show another formulation using a *nix tool called awk, which is very c-esque in its syntax. Lastly, I’m going to show you my new read.fwf for R that I developed today. For each of these, assume that ‘w’ is a space-delimited string of field width values, either hard coded in the script or, say, kept on the clipboard for the run. The mechanism of getting the widths into the process depends on the user’s work flow. Read on to learn more!