Converting Fixed-Width Files: There Are Many Forms

In one of my “from the vault” posts, I discuss how to use Python to convert a fixed-width file to a tab-delimited file. I explained that when I tried to use the R function read.fwf, it was disgustingly slow. In fact, if I use too large a buffersize it was actually crashing R at work today! I made an unfair comparison, though. The function I devised when I used Python was in accord to how I wanted to store the data–i.e., in an archived (compressed) file. There is merit to this, but it does not reflect the process I am about to describe.

The general idea of the conversion is simple. You read in a line of data and you fracture into pieces the length of the specified field widths. As fixed-width files require padding the entry with white spaces to fill the field width, it makes sense to remove them at this point. You then put the pieces back together, separating them by your delimiter. Then, write that line to your output file. Repeat for the length of the input file.

A simplified version of the Python script is to follow. I’m also going to show another formulation using a *nix tool called awk, which is very c-esque in its syntax. Lastly, I’m going to show you my new read.fwf for R that I developed today. For each of these, assume that ‘w’ is a space-delimited string of field width values, either hard coded in the script or, say, kept on the clipboard for the run. The mechanism of getting the widths into the process depends on the user’s work flow. Read on to learn more!

My Melt or Melt without Reshape

Lately, I’ve become interested in trying to perform R processes without depending on other packages. This often means I have to create a few wrappers, include a few extra lines of code, or just manage a few extra variables in the environment. Anyone that deals in data manipulation has had to have come across Wickham’s Reshape package. In all honesty, it is full of great utilities. However, one can quickly become complacent with the ease at which they process stuff in R using these tools. For instance, when I want to take some data in a “wide format” (i.e., has different columns for different fields/variables) and move it into a “long format” (i.e., a column for values and a factor column to specify which variable it is), I could often just say something like melt(x). It was the easiest way to make the conversion instead of using the base tools (which I speak of as including everything loaded into the environment by default–e.g., stats and utils), such as stack or reshape.

Over at the Talk Stats Forums, I brought up this question: How to do melt in base. Ultimately, I figured this one out on my own with a little playing around. By the end, I simplified it into a basic wrapper for stack. If you are unsure what I mean by that, a wrapper in programmatic terms is simply a function to call another function, possibly including a different interface and producing different results.

The problem with simply using stack is that it removes all the factor classes out of your data frame, leaving you with what I mentioned in the “long format” above: the values and the variable factors. But suppose my data columns were the vector (county, 2000, 2001, 2002), with county holding, say, county FIPS codes and the years holding unemployment rates under those years. My long format should be (county, year, value), with county and year as factor levels. The mere use of stack alone would result in (year, value). We need to append the county factor back to the new data frame, but it will clearly need to be duplicated enough times for each variable put into the new year factor. This will have to be repeated if we had more initial factor levels, too–e.g., we could have started with (state, county, 2000, 2001, 2002).

My wrapper solves this problem by two properties found in R. First, I know that the factor levels stack produces will be the fields in the original data frame that were non-factors. Thus, I can use this to pick out the names of the fields that were factors. Easy! The other property is that of recycling. If I join the vectors (1, 2) and (a, b, c, d), it is effectively like joining the vectors (1, 2, 1, 2) and (a, b, c, d). The shorter vector is recycled appropriately to fit the length of the longer. Intuitively, we should recognize that this recycling should appropriately fit the factor levels of whatever fields were discarded by stack back into the resulting data frame. Thus, “my melt” uses those properties to act as a wrapper for stack to produce a complete data frame that “melted” the original wide format to its long form.

melt <- function(df) {
  long <- stack(df)
  vars <- unique(levels(long$ind))
  cbind(df[, which(!names(df) %in% vars)], long)
}

From the Vault: SAS Datalines in R

To be honest, I’m totally ripping this from somewhere else. Usually I think I saw it as a trick in Data Manipulation with R. Even if I conceived of it on my own, the code was from somewhere else. I just happened to think “I could basically do the SAS datalines with this code.” For those of you familiar with SAS, you should have come across cards or datalines in some examples out there on the net. For instance, the Academic Technology Services at UCLA demonstrate reading data this way by this example:

DATA survey;
   INPUT id sex $ age inc r1 r2 r3 ;
   DATALINES;
 1  F  35 17  7 2 2
17  M  50 14  5 5 3
33  F  45  6  7 2 7
49  M  24 14  7 5 7
65  F  52  9  4 7 7
81  M  44 11  7 7 7
2   F  34 17  6 5 3
18  M  40 14  7 5 2
34  F  47  6  6 5 6
50  M  35 17  5 7 5
;
RUN;

It is pretty clear what it does. You define a data set called “survey” that has the specified fields. The datalines statement is then followed by a series of lines of data to match the input statement in content. It provides a quick and easy way to read in small data sets. This is especially useful for the case you are dealing with small data. You can keep the data import with the code itself. It is also apt for examples.

R does not explicitly possess a similar behavior; though you can conveniently enter data into the R environment by copying a body of data like the above to the clipboard and use a read.table(“clipboard”, …) statement. Nevertheless, to recreate the SAS datalines or cards example, we can use a simple wrapper along with scan as follows.

 survey <- data.frame(
   scan(what =
     list(id = 0, sex = “factor”, age = 0,
          inc = 0, r1 = 0, r2 = 0, r3 = 0)
   ) # end scan
 ) # end data.frame
 1  F 35 17 7 2 2
 17 M 50 14 5 5 3
 33 F 45 6  7 2 7
 49 M 24 14 7 5 7
 65 F 52 9  4 7 7
 81 M 44 11 7 7 7
 2  F 34 17 6 5 3
 18 M 40 14 7 5 2
 34 F 47 6  6 5 6
 50 M 35 17 5 7 5

There should be a extra line below the last data line. The reason is that scan terminates once a blank entry is specified. The scan command has a what parameter that is similar to the input statement in SAS. The parameter takes a list object that contains data type identifiers for each field in the data set. For ease, you can use a data entity to represent the type, such as 0 for numeric or “” for character. We wrap this scan statement with a call to data.frame so the assigned object has the desired structure. The extra call wrapped around scan, however, can make the code daunting as the number of parentheses rises. Thus, we can make a wrapper for this function.

datalines <- function(...) data.frame(scan(...))
datalines(what = list(x = 0, y = 0))
 20  114
 196 921
 115 560
 50  245
 122 575
 100 475
 33  138
 154 727
 80  375
 147 670
 182 828
 160 762

This datalines function is effectively nothing more than a call to scan that returns a data.frame object. It makes the code come out cleaner and more closely aligns to what we see in the SAS datalines command. While this is not world shattering news in the R community, it can serve the same utility that the SAS datalines command does. It can also help those in transition between SAS and R to think about “how can I make that SAS behavior in R?”

From the Vault: Convert Fixed-Width Files to Delimited

Since I plan to extinguish my Xanga Blog in the near future, I decided a couple of technical entries from there should find their way here. Therefore, over the next few weeks, I’m going to begin a series of “From the Vault” blogs to effect that migration. These will include scripts and mathematically oriented blogs that have received a good deal of attention (traffic, particularly from Google searches).

The case I faced with fixed-width files, which will reappear in future blogs, is that the government will often present you with data in this format. To the novice, one might think of using a text-to-columns like import with Excel or similar Access import would suffice. This is true, but it is not automated, and it requires you to point-and-click every field width you require. This is burdensome when your fields range well over 50 and impossible when Excel just cannot contain the required number of records (at least in Office 2003 I believe they maxed out at around 65,000 rows?).

I’m an R user, as this website should attest to, and I wanted an R solution. Unfortunately, the data was so large it was a very slow process. In fact, once I finally figured out the problems I had with R–these included the fact “#” was in my data and I needed to indicate the parameter in read.fwf to ignore it, among other things, like specifying appropriate data types as some numeric categorical fields did have alphabetic codes in it–the process took about 10 minutes to get the data into the R environment (or a few seconds more to save as Rdata). This was unacceptable. Enter Python.

The Python script below did two things I particularly enjoyed. First of all, my fixed-width file is plain ASCII text of nearly 100 MB. I can store this in a compressed ZIP archive to save space and group together other data of the same type. My Python script accesses these ZIP archives, as well as takes in the location to a space-delimited text file that a string of width values for each field of the data. This converts the fixed-width data to a tab-delimited file for easy (read.delim) import into R or other software. The whole process took me just under 2 minutes. Future awk scripts I’ll demonstrate cut that time down slightly more, and are far more simple than this script. Stay tuned.

import zipfile

f = open("widths file location", "r")
widths = f.readline()
widths = widths.split()
widths = [int(x) for x in widths]
f.close()

zf = zipfile.ZipFile("archive file location", "r")
rf = zf.open("dataset name in zf")
wf = open("output file location", "w")
zf.close()

for rline in rf:
  wline = ""
  a = 0
  for k in widths:
    b = a + k
    wline = wline + rline[a:b] + "\t"
    a = b
  wline = wline[:-1] + "\n"
  wf.write(wline)

rf.close()
wf.close()

The only thing required here is to change the “… file location” to where you have the files stored. A more robust version of this script should take in parameters from the user, either graphically or at the command-line, to specify those file locations. It would also be wise to provide options for the delimiter. In this case, the “\t” on line 19 (including blank lines) specifies this as tab-delimited. I prefer tabs because they’re often not included in data cells. If you do a comma-delimited file, the data value might include a comma that can throw off future importing (annoying!!). Nevertheless, this should be parameterized. I may update this in the future or create a list of “blogged about code snippets” for my projects page.