Converting Fixed-Width Files: There Are Many Forms

In one of my “from the vault” posts, I discuss how to use Python to convert a fixed-width file to a tab-delimited file. I explained that when I tried to use the R function read.fwf, it was disgustingly slow. In fact, if I use too large a buffersize it was actually crashing R at work today! I made an unfair comparison, though. The function I devised when I used Python was in accord to how I wanted to store the data–i.e., in an archived (compressed) file. There is merit to this, but it does not reflect the process I am about to describe.

The general idea of the conversion is simple. You read in a line of data and you fracture into pieces the length of the specified field widths. As fixed-width files require padding the entry with white spaces to fill the field width, it makes sense to remove them at this point. You then put the pieces back together, separating them by your delimiter. Then, write that line to your output file. Repeat for the length of the input file.

A simplified version of the Python script is to follow. I’m also going to show another formulation using a *nix tool called awk, which is very c-esque in its syntax. Lastly, I’m going to show you my new read.fwf for R that I developed today. For each of these, assume that ‘w’ is a space-delimited string of field width values, either hard coded in the script or, say, kept on the clipboard for the run. The mechanism of getting the widths into the process depends on the user’s work flow. Read on to learn more!

My Melt or Melt without Reshape

Lately, I’ve become interested in trying to perform R processes without depending on other packages. This often means I have to create a few wrappers, include a few extra lines of code, or just manage a few extra variables in the environment. Anyone that deals in data manipulation has had to have come across Wickham’s Reshape package. In all honesty, it is full of great utilities. However, one can quickly become complacent with the ease at which they process stuff in R using these tools. For instance, when I want to take some data in a “wide format” (i.e., has different columns for different fields/variables) and move it into a “long format” (i.e., a column for values and a factor column to specify which variable it is), I could often just say something like melt(x). It was the easiest way to make the conversion instead of using the base tools (which I speak of as including everything loaded into the environment by default–e.g., stats and utils), such as stack or reshape.

Over at the Talk Stats Forums, I brought up this question: How to do melt in base. Ultimately, I figured this one out on my own with a little playing around. By the end, I simplified it into a basic wrapper for stack. If you are unsure what I mean by that, a wrapper in programmatic terms is simply a function to call another function, possibly including a different interface and producing different results.

The problem with simply using stack is that it removes all the factor classes out of your data frame, leaving you with what I mentioned in the “long format” above: the values and the variable factors. But suppose my data columns were the vector (county, 2000, 2001, 2002), with county holding, say, county FIPS codes and the years holding unemployment rates under those years. My long format should be (county, year, value), with county and year as factor levels. The mere use of stack alone would result in (year, value). We need to append the county factor back to the new data frame, but it will clearly need to be duplicated enough times for each variable put into the new year factor. This will have to be repeated if we had more initial factor levels, too–e.g., we could have started with (state, county, 2000, 2001, 2002).

My wrapper solves this problem by two properties found in R. First, I know that the factor levels stack produces will be the fields in the original data frame that were non-factors. Thus, I can use this to pick out the names of the fields that were factors. Easy! The other property is that of recycling. If I join the vectors (1, 2) and (a, b, c, d), it is effectively like joining the vectors (1, 2, 1, 2) and (a, b, c, d). The shorter vector is recycled appropriately to fit the length of the longer. Intuitively, we should recognize that this recycling should appropriately fit the factor levels of whatever fields were discarded by stack back into the resulting data frame. Thus, “my melt” uses those properties to act as a wrapper for stack to produce a complete data frame that “melted” the original wide format to its long form.

melt <- function(df) {
  long <- stack(df)
  vars <- unique(levels(long$ind))
  cbind(df[, which(!names(df) %in% vars)], long)
}

From the Vault: SAS Datalines in R

To be honest, I’m totally ripping this from somewhere else. Usually I think I saw it as a trick in Data Manipulation with R. Even if I conceived of it on my own, the code was from somewhere else. I just happened to think “I could basically do the SAS datalines with this code.” For those of you familiar with SAS, you should have come across cards or datalines in some examples out there on the net. For instance, the Academic Technology Services at UCLA demonstrate reading data this way by this example:

DATA survey;
   INPUT id sex $ age inc r1 r2 r3 ;
   DATALINES;
 1  F  35 17  7 2 2
17  M  50 14  5 5 3
33  F  45  6  7 2 7
49  M  24 14  7 5 7
65  F  52  9  4 7 7
81  M  44 11  7 7 7
2   F  34 17  6 5 3
18  M  40 14  7 5 2
34  F  47  6  6 5 6
50  M  35 17  5 7 5
;
RUN;

It is pretty clear what it does. You define a data set called “survey” that has the specified fields. The datalines statement is then followed by a series of lines of data to match the input statement in content. It provides a quick and easy way to read in small data sets. This is especially useful for the case you are dealing with small data. You can keep the data import with the code itself. It is also apt for examples.

R does not explicitly possess a similar behavior; though you can conveniently enter data into the R environment by copying a body of data like the above to the clipboard and use a read.table(“clipboard”, …) statement. Nevertheless, to recreate the SAS datalines or cards example, we can use a simple wrapper along with scan as follows.

 survey <- data.frame(
   scan(what =
     list(id = 0, sex = “factor”, age = 0,
          inc = 0, r1 = 0, r2 = 0, r3 = 0)
   ) # end scan
 ) # end data.frame
 1  F 35 17 7 2 2
 17 M 50 14 5 5 3
 33 F 45 6  7 2 7
 49 M 24 14 7 5 7
 65 F 52 9  4 7 7
 81 M 44 11 7 7 7
 2  F 34 17 6 5 3
 18 M 40 14 7 5 2
 34 F 47 6  6 5 6
 50 M 35 17 5 7 5

There should be a extra line below the last data line. The reason is that scan terminates once a blank entry is specified. The scan command has a what parameter that is similar to the input statement in SAS. The parameter takes a list object that contains data type identifiers for each field in the data set. For ease, you can use a data entity to represent the type, such as 0 for numeric or “” for character. We wrap this scan statement with a call to data.frame so the assigned object has the desired structure. The extra call wrapped around scan, however, can make the code daunting as the number of parentheses rises. Thus, we can make a wrapper for this function.

datalines <- function(...) data.frame(scan(...))
datalines(what = list(x = 0, y = 0))
 20  114
 196 921
 115 560
 50  245
 122 575
 100 475
 33  138
 154 727
 80  375
 147 670
 182 828
 160 762

This datalines function is effectively nothing more than a call to scan that returns a data.frame object. It makes the code come out cleaner and more closely aligns to what we see in the SAS datalines command. While this is not world shattering news in the R community, it can serve the same utility that the SAS datalines command does. It can also help those in transition between SAS and R to think about “how can I make that SAS behavior in R?”