Optimizing data structures and memory usage: advanced data.table

The data.table package has two major goals:

  1. Reduce programming time
  2. Reduce compute time

It has a flexible and consistent syntax and is both extremely fast and memory efficient - making it a very powerful tool for wrangling of large data *in-memory* (e.g., 100GB in RAM). It inherits from data.frame, a 2D data structure in R, and extends it by offering both fast and memory efficient query facilities. Data.table addresses the following activities:

  • Aggregations (split-apply-combine type operations)
  • Add/update/delete columns without any unnecessary copies (by reference)
  • File reader (fread)
  • Ordered and rolling joins
  • Overlapping range/interval joins
  • Reshaping etc.

These functionalities cover more than 90% of the day-to-day data wrangling tasks you will encounter. While that is great news, some of the concepts and features data.table provides can be challenging at first. Not because it's difficult, but because it demands a radically different way of thinking. The aim of this course is to get you to think in data.table so that you can understand its nuances, appreciate it, and perform data wrangling seamlessly. This will enable you to spend more time in obtaining meaning out of your data.

The course discusses the following topics

  • Start with data.table by looking at some simple operations - subset rows and select columns
  • Atomic vectors and operations on them
  • Compute on columns (first enhanced feature of data.table)
  • A general form of data.table syntax and minor differences to data.frames
  • Perform operations combined with groups
  • Add/update/delete operations, and combined with grouping
  • Extend the concept of subsets to joins (new concept in data.table)
  • Understanding ordered and rolling joins functionalities in data.table
  • Overlapping range/interval joins
  • Reshaping and other helpful functions
  • And finally complete the circle with some tasks that will help put together almost all the concepts we just learnt, so that we can compare/contrast with the tasks we looked at at the very beginning using base R