Memory usage and pointers in R


In order to develop more efficient code, it is important to understand how R uses memory and how to take better advantage of it.

Dependencies

In order to use the examples in this post, we will need the pryr and pointr packages. You can easily install/load each if the help of ezloader :) available here.

  library(ezloader)
  ezloader("pointr")
  ezloader("pryr")

Memory and addresses

Every object you create in R will use memory. Even empty objects (length 0), will use a minimum amount of memory to store the object description. We can use the functions pryr::object_size() and pryr::address to learn how much memory is in use and the object current address in memory.

  x <- 1:1e6
  object_size(x) # how much memory is used
  ## 4,000,728 B
  address(x) # where in memory is the object stored
  ## [1] "0x563883c67698"

As you can see, x takes 4 MB of memory. How about if we create a new object, with a list of 3 x, is it going to take 3 times more memory?

  y <- list(x, x, x)
  object_size(y)
  ## 4,000,808 B
  address(y)
  ## [1] "0x5638846239e8"

Interesting! Even if y has 3 times the contents of x, it is using virtually the same amount of memory of x. How is that possible? Simple, because even if y is an object on its own, for the values it is simply pointing to the same address in memory as x. If you create a new object from another (like a copy), as long as both are unchanged, they will use the same space in memory.

  x_copy <- x
  identical(address(x), address(x_copy))
  ## > [1] TRUE

And even if we take into account the size of both objects together, the memory size remains the same.

  object_size(x, x_copy, y)
  ## 4,000,808 B

On the other hand, if we create another object that is identical to y, but using only values instead of "references" to x, it will effectively use more memory:

  z <- list(1:1e6, 1:1e6, 1:1e6)
  identical(y,z)
  ## [1] TRUE
  object_size(y)
  ## 4,000,808 B
  object_size(z)
  ## 12,001,208 B

Strings work similarly. Each unique string goes in a shared pool, so the total memory use of a string vector is not a simple multiplication of the repetitions of that string.

  object_size("banana")
  ## 112 B
  object_size(rep("banana", 10))
  ## 232 B
  banana <- "banana"
  ten_bananas <- rep(banana, 10)
  object_size(banana)
  ## 112 B
  object_size(ten_bananas)
  ## 232 B

Here are a few more examples. As you can see, not exactly the same as the example before, with x, y, and z.

  ten_bananas <- rep(banana, 10)
  object_size(ten_bananas)
  ## 232 B
  ten_bananas <- c(banana, banana, banana, banana, banana, banana, banana, banana,
                   banana, banana)
  object_size(ten_bananas)
  ## 232 B
  ten_bananas <- rep("banana", 10)
  object_size(ten_bananas)
  ## 232 B
  ten_bananas <- c("banana", "banana", "banana", "banana", "banana", "banana",
                   "banana", "banana", "banana", "banana")
  object_size(ten_bananas)
  ## 232 B

Considering all that information, let's compare two lists of strings of bananas. On will have rows of vectors ("ba" "na" "na" "na" ... with "na" increasing in each row), while the other will have "bananana..." in each row (single string). Which one will use more memory?

  ba_na_na_vec <- lapply(0:1000, function(i) c("ba", rep("na", i)))
  banana_list <- lapply(ba_na_na_vec, paste0, collapse = "")
  length(ba_na_na_vec)
  ## [1] 1001
  length(banana_list)
  ## [1] 1001
  object_size(ba_na_na_vec)
  ## 4,068,472 B
  object_size(banana_list)
  ## 1,121,160 B

Even if the list of vectors has a lot of repetitions of the same string, it uses more memory. Is it because it has considerably more strings per row?

Copy on change

Now let's see what happens when we change one single value in x. R will not perform an "in-place" change. It will copy the object to a new memory position with the new value. Other objects that were pointing to the previous memory address (in our case, the y and x_copy), will keep that address - and that memory usage. Also, since it was copied, x will be using more memory now… check it out:

  x[1]
  ## [1] 1
  y[[1]][1]
  ## [1] 1
  x[1] <- 0
  x[1]
  ## [1] 0
  y[[1]][1]
  ## [1] 1
  object_size(x)
  ## 8,000,048 B
  object_size(y)
  ## 4,000,808 B
  object_size(x, y)
  ## 12,000,856 B
  identical(address(x), address(x_copy))
  ## [1] FALSE

However, R will also do a modify in place. If the object is the only one using that address in memory, that is, if there is no other object pointing to this original object, than changes are in place, no copy is created. So, usually, the first change will generate a copy, ensuring it is the only object pointing at that address. Further changes will keep the same address. Also, the change has to be executed in a block of code - especially if you're using RStudio - BTW, try ESS with Emacs ;). If the change is performed with the execution of one line at a time, R may create a copy, even if there are no other objects pointing to the original.

    if ("x" %in% ls()) rm(x)
    if ("y" %in% ls()) rm(y)
    if ("z" %in% ls()) rm(z)
    x <- 1:10
    {
      print(x[1])
      print(address(x))
      x[1] <- 0
      print(x[1])
      print(address(x))
      x[1] <- -1
      print(x[1])
      print(address(x))
    }
  ## [1] 1
  ## [1] "0x5582340fc188"
  ## [1] 0
  ## [1] "0x5582348dc488"
  ## [1] -1
  ## [1] "0x5582348dc488"

Pointers

With the help of the package pointr we can create and use objects as pointers in R. The function is ptr("pointer_name", "target_obj_name"). Note that the object names must be passed as characters! Alternatively, a pointer can be created with the %=% sign.

  x <- 1 : 10
  ptr("point_to_x", "x")
  ## OR
  point_to_x %=% x
  point_to_x
  ## [1]  1  2  3  4  5  6  7  8  9 10
  address(point_to_x) == address(x)
  ## [1] TRUE
  object_size(point_to_x) == object_size(x)
  ## [1] TRUE
  object_size(x, point_to_x)
  ## 680 B
  x[1]
  ## [1] 0
  x[1] <- 0
  x[1]
  ## [1] 0
  point_to_x[1]
  ## [1] 0
  address(point_to_x) == address(x)
  ## [1] TRUE
  object_size(point_to_x) == object_size(x)
  ## [1] TRUE
  object_size(x, point_to_x)
  ## 176 B

To find out where the pointer is pointing to we can use the function where.ptr()

  where.ptr(point_to_x)
  ## [1] "x"

Also, modifications made to the pointer object will also be reflected on the target. The pointr package is working on the background to reconnect all the objects.

  point_to_x[1]
  ## [1] 0
  x[1]
  ## [1] 0
  point_to_x[1] <- -1
  x[1]
  ## [1] -1
  address(point_to_x) == address(x)
  ## [1] TRUE
  object_size(point_to_x) == object_size(x)
  ## [1] TRUE
  object_size(x, point_to_x)
  ## 176 B

Here is an example on how to take advantage of pointers creating a subset of a data frame:

  df <- data.frame(list(numbers = c(1,2,3,4,5),
                        chars = c("A", "B", "C", "D", "E")),
                   stringsAsFactors = FALSE)
  i <- 2
  secret_agent %=% df[i,]

  subset_of_df <- df[2, ]
  address(subset_of_df) == address(secret_agent)
  ## [1] FALSE
  identical(subset_of_df, secret_agent)
  ## [1] TRUE
  object_size(subset_of_df)
  ## 920 B
  object_size(secret_agent)
  ## 920 B
  object_size(subset_of_df, secret_agent)
  ## 1,320 B

  secret_agent[1] <- 007
  secret_agent[2] <- "Bond"
  df
  ##     numbers chars
  ## 1       1     A
  ## 2       7  Bond
  ## 3       3     C
  ## 4       4     D
  ## 5       5     E

  identical(df[2,], secret_agent)
  ## [1] TRUE
  object_size(df[2,], secret_agent)
  ## 1,320 B

Have you noticed that the pointer uses the row i? This is very interesting about this pointer (and something you should be careful when using it). If the value of i changes…

  secret_agent
  ##   numbers chars
  ## 2       7  Bond
  i <- 3
  secret_agent
  ##     numbers chars
  ## 3       3     C

You may have noticed that even if you're using the pointer, the pointer object will also use memory. Therefore, it may mimic a pointer, but it is not a pointer as you might expect compared to other programming languages.