Wednesday, January 4, 2017

Package 'colr': Selecting Columns in 'R'


If you develop in 'R', like me, you may find yourself frequently selecting named data from  a data set, such as a data frame or matrix. 'R' supplies many ways to do this, but the syntax depends on the data type. So I wrote this package (named 'colr') with one syntax for all types. It uses a very powerful syntax called 'regex'. You may not be familiar with 'regex', but it is a DSL for text find and find and replace, implemented across a very wide range of languages, and 'R' has a full distribution of the 'Perl' 'regex' engine in the base package.

You can use this new package in a simple way with limited knowledge of 'regex', but if you learn more of 'regex' you can also use it in complex situations. The package is available on CRAN and in 'R' you can install it with install.packages('colr')
The easiest example is when you want to find all columns that have some specific string in several column names, such as all columns with 'length' in the column name. With this package that can be done with: cgrep(x, 'length') . I hope you give 'colr' a try and let me know what you think.

Functions provided in 'colr'

Package 'colr' provides two functions: 'cgrep' and 'csub'. 'cgrep' selects all columns where the string matches the column names, 'csub' changes the column names according to the replacement. The syntax for both function is identical except that csub takes one more argument then cgrep.

cgrep

'cgrep' is a function to select columns or rows from a dataframe, list or matrix with named columns. For lists the selection is by names of the list. cgrep will act only on the top level of the list. The return type is the same as the input type with the selected columns, or if the input was a list or matrix and only a single column was selected a flattened list or vector is returned.

csub

csub is a function to change column- or row names in a dataframe, list or matrix with named columns. The return is the same dataframe, list or matrix where the all column names with a match are changed to the pattern, or if no match was found the returned data frame, list or matrix is exactly as the input. For lists csub acts on the names in top level of the list.

regex

'Regex' is a DSL (grammar) for pattern matching and replacement in strings. A pattern consist of a quoted string of characters and meta characters. The quote in 'R' regex is ". The meta characters have special meaning in the grammar. An example of a meta character is the '.' (the dot) that stands for any character except a newline in the string. So to match in the string 'The quick brown fox jumps over the hedge' either 'ow' or 'ox' or 'ov' you can use the pattern "o.". There is much more to say about 'regex'; the vignette of the package 'colr' will give you some introduction. I find it most instructive to do 'regex' by example, I use the website 'regexr.com' for that.

Comparison

How is this package different from other methods. I frequently use the 'dplyr::select' function that works on data frames but not on other types. Let's compare a simple selection from the Iris dataset (a data frame). Use of the 'magrittr' style and of the relevant packages is implied in the examples.
with 'colr':
iris >%> cgrep("Sepal") >%> head ()
with 'dplyr': 
iris %>% select(contains("Sepal")) %>% head()
using subsetting:
iris %>% .[c('Sepal.Width', 'Sepal.Length')] %>% head()
As you can see the cgrep function from 'colr' is concise (it is also precise but I don't demonstrate that here).

Let's do an example with find and replace.

with 'colr':
iris >%> csub("Width", "Girth") >%> head()
with 'dplyr':
iris  %>% select(Sepal.Girth=Sepal.Width, Petal.Girth=Petal.Width)  %>% head() 
 using subsetting:
names(iris)[names(iris) %in% c("Petal.Width","Sepal.Width")] <- c("Petal.Girth","Sepal.Girth")
Again 'colr' is much more concise.

Other

Under the hood 'colr' uses 'grep' and 'gsub'. That will give the following code:
{x <- iris
names(x) <- gsub("Width", "Girth", gsub("Width", "Girth", names(x)))
x}

You could use that too, but it is much longer and it will only work on data frames. csub (and cgrep) will work just as fine on numeric types (such as matrices) and on lists.  I hope you give 'colr' a try and let me know what you think. 

No comments:

Post a Comment