Tuesday, April 30, 2013

Data Frames for GSL Shell

Lately I've made a lot of work to implement "General Data Tables" in GSL Shell. I choose this name to designate what is otherwise called DataFrame in GNU R or other environments.

The difference between data tables and matrices are:
  • each column is identified by a name
  • the data in each cell can be a number but also a string or be undefined
The fact that you can store strings in each cell is very useful, I guess everyone can understand the reasons, not all data is numeric.

In addition the fact that each column has a name greatly simplifies a lot operations since you can refer to the data by name instead of having anonymous columns identified by an index.

Here an example taken from the excellent |STAT user manual of Gary Pearlman.

studentteachersexm1m2final
S-1johnmale564258
S-2johnmale969091
S-3johnmale705965
S-4johnmale827578
S-5johnmale859092
S-6johnmale696065
S-7johnfemale827860
S-8johnfemale848182
S-9johnfemale898068
S-10johnfemale909391
S-11janemale424665
S-12janemale281534
S-13janemale496875
S-14janemale363048
S-15janemale585862
S-16janemale727084
S-17janefemale656170
S-18janefemale687571
S-19janefemale625055
S-20janefemale717287

The data above can be used to show some of plotting functions.

What is very interesting is that, having the data in tabular format, many operations becomes very easy. For example to create an histogram of the "final" column you can simply type:
> gdt.hist(ms, "final")

to obtain the following plot:

 Given the data above you may wish to have a more expressive plot based on the teacher and the sex of the students. Here come to help the "gdt.plot" function I'm very proud of. You can use it very simply:
> gdt.plot(ms, "final ~ teacher, sex, student")

to obtain the following plot:
The function "gdt.plot" use a sort of mini language that let you specify what should be plotted (y variables) in term of which variables.

Something interesting is that the function figure out by himself if the x variable is a numeric variable of an enumeration like in the example above. In addition you can "layer up" more enumeration variables just like you can do with Excel's pivot tables.

The following form can be also used:

> gdt.plot(ms, "final ~ sex, student | teacher")

to create many lines grouped by the field teacher.

The mini language is actually quite flexible. You can use arbitrary mathematical expression, not just variable names. If you want you can try to discover yourself its possibilities. There is a specific chapter in the GSL Shell's user manual.

I hope this is interesting for you. In the next post I will talk about the linear regression function modelled after the GNU R's function "lm"...