My Melt or Melt without Reshape

Lately, I’ve become interested in trying to perform R processes without depending on other packages. This often means I have to create a few wrappers, include a few extra lines of code, or just manage a few extra variables in the environment. Anyone that deals in data manipulation has had to have come across Wickham’s Reshape package. In all honesty, it is full of great utilities. However, one can quickly become complacent with the ease at which they process stuff in R using these tools. For instance, when I want to take some data in a “wide format” (i.e., has different columns for different fields/variables) and move it into a “long format” (i.e., a column for values and a factor column to specify which variable it is), I could often just say something like melt(x). It was the easiest way to make the conversion instead of using the base tools (which I speak of as including everything loaded into the environment by default–e.g., stats and utils), such as stack or reshape.

Over at the Talk Stats Forums, I brought up this question: How to do melt in base. Ultimately, I figured this one out on my own with a little playing around. By the end, I simplified it into a basic wrapper for stack. If you are unsure what I mean by that, a wrapper in programmatic terms is simply a function to call another function, possibly including a different interface and producing different results.

The problem with simply using stack is that it removes all the factor classes out of your data frame, leaving you with what I mentioned in the “long format” above: the values and the variable factors. But suppose my data columns were the vector (county, 2000, 2001, 2002), with county holding, say, county FIPS codes and the years holding unemployment rates under those years. My long format should be (county, year, value), with county and year as factor levels. The mere use of stack alone would result in (year, value). We need to append the county factor back to the new data frame, but it will clearly need to be duplicated enough times for each variable put into the new year factor. This will have to be repeated if we had more initial factor levels, too–e.g., we could have started with (state, county, 2000, 2001, 2002).

My wrapper solves this problem by two properties found in R. First, I know that the factor levels stack produces will be the fields in the original data frame that were non-factors. Thus, I can use this to pick out the names of the fields that were factors. Easy! The other property is that of recycling. If I join the vectors (1, 2) and (a, b, c, d), it is effectively like joining the vectors (1, 2, 1, 2) and (a, b, c, d). The shorter vector is recycled appropriately to fit the length of the longer. Intuitively, we should recognize that this recycling should appropriately fit the factor levels of whatever fields were discarded by stack back into the resulting data frame. Thus, “my melt” uses those properties to act as a wrapper for stack to produce a complete data frame that “melted” the original wide format to its long form.

melt <- function(df) {
  long <- stack(df)
  vars <- unique(levels(long$ind))
  cbind(df[, which(!names(df) %in% vars)], long)
}

From the Vault: SAS Datalines in R

To be honest, I’m totally ripping this from somewhere else. Usually I think I saw it as a trick in Data Manipulation with R. Even if I conceived of it on my own, the code was from somewhere else. I just happened to think “I could basically do the SAS datalines with this code.” For those of you familiar with SAS, you should have come across cards or datalines in some examples out there on the net. For instance, the Academic Technology Services at UCLA demonstrate reading data this way by this example:

DATA survey;
   INPUT id sex $ age inc r1 r2 r3 ;
   DATALINES;
 1  F  35 17  7 2 2
17  M  50 14  5 5 3
33  F  45  6  7 2 7
49  M  24 14  7 5 7
65  F  52  9  4 7 7
81  M  44 11  7 7 7
2   F  34 17  6 5 3
18  M  40 14  7 5 2
34  F  47  6  6 5 6
50  M  35 17  5 7 5
;
RUN;

It is pretty clear what it does. You define a data set called “survey” that has the specified fields. The datalines statement is then followed by a series of lines of data to match the input statement in content. It provides a quick and easy way to read in small data sets. This is especially useful for the case you are dealing with small data. You can keep the data import with the code itself. It is also apt for examples.

R does not explicitly possess a similar behavior; though you can conveniently enter data into the R environment by copying a body of data like the above to the clipboard and use a read.table(“clipboard”, …) statement. Nevertheless, to recreate the SAS datalines or cards example, we can use a simple wrapper along with scan as follows.

 survey <- data.frame(
   scan(what =
     list(id = 0, sex = “factor”, age = 0,
          inc = 0, r1 = 0, r2 = 0, r3 = 0)
   ) # end scan
 ) # end data.frame
 1  F 35 17 7 2 2
 17 M 50 14 5 5 3
 33 F 45 6  7 2 7
 49 M 24 14 7 5 7
 65 F 52 9  4 7 7
 81 M 44 11 7 7 7
 2  F 34 17 6 5 3
 18 M 40 14 7 5 2
 34 F 47 6  6 5 6
 50 M 35 17 5 7 5

There should be a extra line below the last data line. The reason is that scan terminates once a blank entry is specified. The scan command has a what parameter that is similar to the input statement in SAS. The parameter takes a list object that contains data type identifiers for each field in the data set. For ease, you can use a data entity to represent the type, such as 0 for numeric or “” for character. We wrap this scan statement with a call to data.frame so the assigned object has the desired structure. The extra call wrapped around scan, however, can make the code daunting as the number of parentheses rises. Thus, we can make a wrapper for this function.

datalines <- function(...) data.frame(scan(...))
datalines(what = list(x = 0, y = 0))
 20  114
 196 921
 115 560
 50  245
 122 575
 100 475
 33  138
 154 727
 80  375
 147 670
 182 828
 160 762

This datalines function is effectively nothing more than a call to scan that returns a data.frame object. It makes the code come out cleaner and more closely aligns to what we see in the SAS datalines command. While this is not world shattering news in the R community, it can serve the same utility that the SAS datalines command does. It can also help those in transition between SAS and R to think about “how can I make that SAS behavior in R?”

From the Vault: Convert Fixed-Width Files to Delimited

Since I plan to extinguish my Xanga Blog in the near future, I decided a couple of technical entries from there should find their way here. Therefore, over the next few weeks, I’m going to begin a series of “From the Vault” blogs to effect that migration. These will include scripts and mathematically oriented blogs that have received a good deal of attention (traffic, particularly from Google searches).

The case I faced with fixed-width files, which will reappear in future blogs, is that the government will often present you with data in this format. To the novice, one might think of using a text-to-columns like import with Excel or similar Access import would suffice. This is true, but it is not automated, and it requires you to point-and-click every field width you require. This is burdensome when your fields range well over 50 and impossible when Excel just cannot contain the required number of records (at least in Office 2003 I believe they maxed out at around 65,000 rows?).

I’m an R user, as this website should attest to, and I wanted an R solution. Unfortunately, the data was so large it was a very slow process. In fact, once I finally figured out the problems I had with R–these included the fact “#” was in my data and I needed to indicate the parameter in read.fwf to ignore it, among other things, like specifying appropriate data types as some numeric categorical fields did have alphabetic codes in it–the process took about 10 minutes to get the data into the R environment (or a few seconds more to save as Rdata). This was unacceptable. Enter Python.

The Python script below did two things I particularly enjoyed. First of all, my fixed-width file is plain ASCII text of nearly 100 MB. I can store this in a compressed ZIP archive to save space and group together other data of the same type. My Python script accesses these ZIP archives, as well as takes in the location to a space-delimited text file that a string of width values for each field of the data. This converts the fixed-width data to a tab-delimited file for easy (read.delim) import into R or other software. The whole process took me just under 2 minutes. Future awk scripts I’ll demonstrate cut that time down slightly more, and are far more simple than this script. Stay tuned.

import zipfile

f = open("widths file location", "r")
widths = f.readline()
widths = widths.split()
widths = [int(x) for x in widths]
f.close()

zf = zipfile.ZipFile("archive file location", "r")
rf = zf.open("dataset name in zf")
wf = open("output file location", "w")
zf.close()

for rline in rf:
  wline = ""
  a = 0
  for k in widths:
    b = a + k
    wline = wline + rline[a:b] + "\t"
    a = b
  wline = wline[:-1] + "\n"
  wf.write(wline)

rf.close()
wf.close()

The only thing required here is to change the “… file location” to where you have the files stored. A more robust version of this script should take in parameters from the user, either graphically or at the command-line, to specify those file locations. It would also be wise to provide options for the delimiter. In this case, the “\t” on line 19 (including blank lines) specifies this as tab-delimited. I prefer tabs because they’re often not included in data cells. If you do a comma-delimited file, the data value might include a comma that can throw off future importing (annoying!!). Nevertheless, this should be parameterized. I may update this in the future or create a list of “blogged about code snippets” for my projects page.

New Beginnings

Since I purchased this website (hosted by iPage) toward the end of summer (2011), I haven’t done too much with it. I quickly put up my personal information, my R scripts for my ALSM project, and some basic formatting. It has been idle most of the time. That is going to change.

First of all, I just installed ScribeFire for FireFox, from which I am posting this now. Thus, I can avoid the hassle of signing in when I want to publish a post. I also find the WordPress editor to run slowly on my old computer. I’m not sure how well this integrates with WordPress, but I’m hoping it’ll be enough for what I’m going to do; namely, I want to be able to write in LaTeX when appropriate, and once I get my CSS set up for a “code block” to improve readability when I publish code snippets, it will not require much. I notice ScribeFire has a “publish custom html tag,” so I’m hoping that will suffice.

Some things to expect as I begin adding more content to my website.

  • I found my old R code from my Numerical Analysis class. My R programming skills were horrible back then, at least from my current perspective. I’m going to rewrite them, entirely. I’ll include both an interactive script and a command-line (executable) script for each project. On that note, I’m also going to refresh my memory of Matlab by rewriting the functions using Octave, a GNU clone of Matlab. 
  • I will expand on my ALSM project, since I haven’t touched it in months. I’m going to rewrite a good portion of it, especially the commentary. I’ll write a function that will turn them into pretty HTML pages and include those for a better way to view my scripts.
  • I am proud of my writing and have yet to publish it here. I’ll add to the document section by including my old work and adding the current project I’m doing for a GIS class. It’ll be accompanied by a (required for the class) web page version.
  • There is much I can talk about with regards to programming as of late. Not only am I finding interesting things to discuss with regards to R, but also with regard to data analysis utilities in general. For instance, I’ve been playing around with Linux tools since I’ve migrated to Ubuntu a couple months ago. I am excited with the capabilities things like awk and cut provide. I’ll begin talking about those uses on my blog, along with other topics such as regular expressions and whatever else I encounter. 
  • Lastly, I also am going to put together a nifty icon sequence at the right-hand side of my “menu bar” that will link to the various places you can find me online that are listed under my “About” page. I know how to do the HTML/CSS for it. I don’t know how to make images! I’m not a graphics guy lol

Stay tune, as I need an outlet for my work to be recognized, and this website is definitely a work in progress. There is much I can share, and I know it can be of use to those out there that have to deal with data. As I learn, I want to contribute that to the broader community of analysts. Join me in discovering new ways to do the work we have a passion for!