In one of my “from the vault” posts, I discuss how to use Python to convert a fixed-width file to a tab-delimited file. I explained that when I tried to use the R function read.fwf, it was disgustingly slow. In fact, if I use too large a buffersize it was actually crashing R at work today! I made an unfair comparison, though. The function I devised when I used Python was in accord to how I wanted to store the data–i.e., in an archived (compressed) file. There is merit to this, but it does not reflect the process I am about to describe.

The general idea of the conversion is simple. You read in a line of data and you fracture into pieces the length of the specified field widths. As fixed-width files require padding the entry with white spaces to fill the field width, it makes sense to remove them at this point. You then put the pieces back together, separating them by your delimiter. Then, write that line to your output file. Repeat for the length of the input file.

A simplified version of the Python script is to follow. I’m also going to show another formulation using a *nix tool called awk, which is very c-esque in its syntax. Lastly, I’m going to show you my new read.fwf for R that I developed today. For each of these, assume that ‘w’ is a space-delimited string of field width values, either hard coded in the script or, say, kept on the clipboard for the run. The mechanism of getting the widths into the process depends on the user’s work flow.

Python

widths = w
widths = widths.split()  # Splits by space delimiter; stores character
widths = [int(x) for x in widths]  # Converts character list to numeric

rf = open(infile, "r")  # 'infile' is the name of the fixed-width file
wf = open(outfile, "w")  # 'outfile' is the name of the output file
for rline in rf:  # For each line read in, convert it to a tab-delimited string and print
  wline = ""  # The line to be written out, starts empty
  a = 0
  for k in widths:
    b = a + k
    wline = wline + rline[a:b].strip() + "\t"  # Append a given fracture, without whitespaces, separated by tab
    a = b
  wline = wline[:-1] + "\n"  # Replace final tab with a newline
  wf.write(wline)  # Write the line to the write-file

rf.close()
wf.close()

Awk (at command prompt, type “awk -f function-file infile > outfile”)

BEGIN {
  FIELDWIDTHS = w;
}

{
// For all but last (i.e., not equal to NF) field ...
// Remove white spaces from the start (left)
// Remove white spaces from the end (right)
// Print the cleaned up fields tab-delimited
  for (i=1; i < NF; i++) {
    sub(/^[ ]+/, "", $i);
    sub(/[ ]+$/, "", $i);
    printf "%s\t", $i;
  }
// Increment up to NF; remove white spaces
// Print last field followed by newline ("\r\n" is Windows system newline)
   i++;
   sub(/^[ ]+/, "", $i);
   sub(/[ ]+$/, "", $i);
   printf "%s\r\n", $i;
 }

R

fixedSplit <- function(x, start, end, sep = "\t"){
  strings <- substring(x, start, end)            # Fracture string by widths
  strings <- gsub("[[:space:]]+$", "", strings)  # Remove white from right
  strings <- gsub("^[[:space:]]+", "", strings)  # Remove white from left
  strings <- paste(strings, collapse=sep)        # Make delimited string

  return(strings)
}

fix2del <- function(infile, fieldwidths, nlines, outfile = "outfile.txt",
                    buffersize = 2000, sep = "\t") {
    start_pts <- head(cumsum(c(1, fieldwidths)), -1)
    end_pts   <- start_pts + fieldwidths - 1
    index     <- 0
    repeat {
        lines <- scan(infile, what = "character", sep = "\n",
                      skip = index, n = buffersize, quiet = TRUE)
        index <- index + buffersize

        cat(unlist(lapply(lines, fixedSplit,
                          start = start_pts, end = end_pts, sep = sep)),
            file   = outfile,
            sep    = "\n",
            append = TRUE)

        if (index > nlines)
            break
    }  # end repeat
}  # end fix2del
w <- scan("clipboard", sep = " ")  # Read space-delimited list of widths
fix2del(infile, fieldwidths = w, nlines = 239478)  # Requires you to know your data
x <- read.delim("outfile.txt", header = FALSE)

I was never able to figure out how to put together this R script, and I always figured it would still be slow by comparison. The Python script, which read from archives, took upwards of 3 minutes. The awk function was nice and simple. It took less than 1.5 minutes. With the simplified Python function here, I was absolutely blown away by its speed! The results are:

Read takes approximately 30-35 seconds
awk: 1:20 (write)
R: 2:20 (write+read)
Python: 0:20 (write)

I would have thought awk to be the fastest, but I was impressed with the R script. The extra minute is worth it to have everything in an R work flow. Nevertheless, Python was the most efficient. Since Python is also easy to have on any system, it is readily available. I call it the analyst’s Swiss army knife. You can do a system call from within R to run the given script, and then have R pull in the file. With the script parameterized, you can have R handle all the variables, encapsulating everything in one R function, since the system call in R is just a string.

Honestly, I think fixed-width files should just go. I see no real benefit in them over delimited files or other storage means. Nevertheless, reading in fixed-width tabular data has its benefits if we look at it another way. By this I mean we can think of it is as “bin importing.” It is most effective when your tabular data is made for presentation, not transportation. As long as you can specify field widths that capture all the contents of some displayed textual data, you can use these methods to very quickly import it for further processing. The details of this will have to wait until tomorrow!