Since I plan to extinguish my Xanga Blog in the near future, I decided a couple of technical entries from there should find their way here. Therefore, over the next few weeks, I’m going to begin a series of “From the Vault” blogs to effect that migration. These will include scripts and mathematically oriented blogs that have received a good deal of attention (traffic, particularly from Google searches).

The case I faced with fixed-width files, which will reappear in future blogs, is that the government will often present you with data in this format. To the novice, one might think of using a text-to-columns like import with Excel or similar Access import would suffice. This is true, but it is not automated, and it requires you to point-and-click every field width you require. This is burdensome when your fields range well over 50 and impossible when Excel just cannot contain the required number of records (at least in Office 2003 I believe they maxed out at around 65,000 rows?).

I’m an R user, as this website should attest to, and I wanted an R solution. Unfortunately, the data was so large it was a very slow process. In fact, once I finally figured out the problems I had with R–these included the fact “#” was in my data and I needed to indicate the parameter in read.fwf to ignore it, among other things, like specifying appropriate data types as some numeric categorical fields did have alphabetic codes in it–the process took about 10 minutes to get the data into the R environment (or a few seconds more to save as Rdata). This was unacceptable. Enter Python.

The Python script below did two things I particularly enjoyed. First of all, my fixed-width file is plain ASCII text of nearly 100 MB. I can store this in a compressed ZIP archive to save space and group together other data of the same type. My Python script accesses these ZIP archives, as well as takes in the location to a space-delimited text file that a string of width values for each field of the data. This converts the fixed-width data to a tab-delimited file for easy (read.delim) import into R or other software. The whole process took me just under 2 minutes. Future awk scripts I’ll demonstrate cut that time down slightly more, and are far more simple than this script. Stay tuned.

import zipfile

f = open("widths file location", "r")
widths = f.readline()
widths = widths.split()
widths = [int(x) for x in widths]
f.close()

zf = zipfile.ZipFile("archive file location", "r")
rf = zf.open("dataset name in zf")
wf = open("output file location", "w")
zf.close()

for rline in rf:
  wline = ""
  a = 0
  for k in widths:
    b = a + k
    wline = wline + rline[a:b] + "\t"
    a = b
  wline = wline[:-1] + "\n"
  wf.write(wline)

rf.close()
wf.close()

The only thing required here is to change the “… file location” to where you have the files stored. A more robust version of this script should take in parameters from the user, either graphically or at the command-line, to specify those file locations. It would also be wise to provide options for the delimiter. In this case, the “\t” on line 19 (including blank lines) specifies this as tab-delimited. I prefer tabs because they’re often not included in data cells. If you do a comma-delimited file, the data value might include a comma that can throw off future importing (annoying!!). Nevertheless, this should be parameterized. I may update this in the future or create a list of “blogged about code snippets” for my projects page.