Lately, I’ve become interested in trying to perform R processes without depending on other packages. This often means I have to create a few wrappers, include a few extra lines of code, or just manage a few extra variables in the environment. Anyone that deals in data manipulation has had to have come across Wickham’s Reshape package. In all honesty, it is full of great utilities. However, one can quickly become complacent with the ease at which they process stuff in R using these tools. For instance, when I want to take some data in a “wide format” (i.e., has different columns for different fields/variables) and move it into a “long format” (i.e., a column for values and a factor column to specify which variable it is), I could often just say something like melt(x). It was the easiest way to make the conversion instead of using the base tools (which I speak of as including everything loaded into the environment by default–e.g., stats and utils), such as stack or reshape.
Over at the Talk Stats Forums, I brought up this question: How to do melt in base. Ultimately, I figured this one out on my own with a little playing around. By the end, I simplified it into a basic wrapper for stack. If you are unsure what I mean by that, a wrapper in programmatic terms is simply a function to call another function, possibly including a different interface and producing different results.
The problem with simply using stack is that it removes all the factor classes out of your data frame, leaving you with what I mentioned in the “long format” above: the values and the variable factors. But suppose my data columns were the vector (county, 2000, 2001, 2002), with county holding, say, county FIPS codes and the years holding unemployment rates under those years. My long format should be (county, year, value), with county and year as factor levels. The mere use of stack alone would result in (year, value). We need to append the county factor back to the new data frame, but it will clearly need to be duplicated enough times for each variable put into the new year factor. This will have to be repeated if we had more initial factor levels, too–e.g., we could have started with (state, county, 2000, 2001, 2002).
My wrapper solves this problem by two properties found in R. First, I know that the factor levels stack produces will be the fields in the original data frame that were non-factors. Thus, I can use this to pick out the names of the fields that were factors. Easy! The other property is that of recycling. If I join the vectors (1, 2) and (a, b, c, d), it is effectively like joining the vectors (1, 2, 1, 2) and (a, b, c, d). The shorter vector is recycled appropriately to fit the length of the longer. Intuitively, we should recognize that this recycling should appropriately fit the factor levels of whatever fields were discarded by stack back into the resulting data frame. Thus, “my melt” uses those properties to act as a wrapper for stack to produce a complete data frame that “melted” the original wide format to its long form.
melt <- function(df) {
long <- stack(df)
vars <- unique(levels(long$ind))
cbind(df[, which(!names(df) %in% vars)], long)
}