10X single-cell data & HDF5Array performance

Earlier this year 10X Genomics released a single-cell RNA-sequencing dataset containing data from 1.3 million mouse brain cells. The blog post accompanying the release contained the provocative statement “We do not recommend loading the file into R, due to the file size and the lack of 64 bit integers support in R.” This is a bit of a non-sequitur, and naturally there has been a push within the Bioconductor community to address such concerns and show how to work with such datasets efficiently. Here we look at some basic benchmarks of R & Bioconductor’s performance on this dataset.

Update – January 2018

Performance improvements identified for rhdf5 and DelayedArray are mentioned in the article below. These have now been incorporated into the versions of packages available via Bioconductor. I have updated the accompanying scripts available in the Github repository to use the versions that were available at the time of writing, so the original results are still reproducible, but anyone using the current versions of the packages will be benefiting from the improvements.

Introduction Aaron Lun has done a nice job of making this data available via Bioconductor’s ExperimentHub. You can access his data package at (https://github.com/LTLA/TENxBrainData) and install it with the following command:

BiocInstaller::biocLite('LTLA/TENxBrainData')

The most striking feature of this dataset is its shear size, with over 36 billion entries in the complete matrix (≈28,000 genes by ≈1,300,000 cells). Aaron’s TENxBrainData package makes this available as a SingleCellExperiment object, using an HDF5Array to provide an on-disk representation of the data.

Datasets of this magnitude are only going to become more common, but when I first started to explore TENxBrainData my initial reaction was a little disappointment at the time required to perform some of the operations described the vignette that accompanies the dataset. The purpose of this post is to explore the performance of R at working with matrices of this scale, and whether there are any improvements that can easily be made.

All code examples on this page should be executable, but the complete scripts used to carry out this analysis can be found at https://github.com/grimbough/TENxBenchmarking.

Subsetting the data

Since the slowest approaches I used required nearly an hour to run on the complete set of 1.3 million cells, all the examples shown here use the first 130,000 cells in the matrix. In my testing it is reasonable to simply multiply that by 10 to get a fairly reliable figure for working with the whole dataset. Here’s the code for loading the complete dataset, and then selecting our subset.

library(TENxBrainData)
tenx <- TENxBrainData()
tenx.sub <- tenx[,1:130000]

Default everything

The first example in the TENxBrainData vignette is calculating the sum of counts for every gene in the dataset, achieved using the colSums() function. My naive first approach involved skipping over any useful hints in the vignette and jumping straight into the computation. Lets look at how long that takes using the default settings immediately after starting a fresh R session.

system.time(res1 <- colSums(counts(tenx.sub)))
##    user  system elapsed 
## 298.630  11.671 329.071

We can see that the median time taken is well over 5 minutes – and if we extrapolate, that’s 55 minutes to produce the column counts for the whole dataset. Not exactly conducive to interactive exploratory data analysis.

Comparing to in-memory performance

However, maybe my expectations are too high, so lets try loading the entire set of counts into a standard R matrix and perform the same calculation on that. If you want to work with the entire matrix this requires ≈145GB of memory, obviously not a resource available to everyone, and it takes a considerable amount of time just to read from disk. Here are the stats for our subset of the data:

system.time(tenx.inmem <- as.matrix(counts(tenx.sub)))
##    user  system elapsed 
## 114.901  60.033 182.270
pryr::object_size(tenx.inmem)
## 14.6 GB

Again extrapolating from these values, we can see it would take over 30 minutes to load the complete matrix into memory.

Now we can again benchmark the colSums() function on the in-memory representation:

system.time(res2 <- colSums(tenx.inmem))
##   user  system elapsed 
##  6.557   0.005   6.563

This is clearly much faster, taking about 6.5 seconds, or just over a minute for the whole dataset (assuming you have enough RAM). Even with the long load time, this is significantly quicker than the first attempt.

Increasing the DelayedArray block size

If we go back and read the Aaron’s vignette more carefully, we’ll see there’s already a suggestion for improving the performance – namely increasing the DelayedArray block size. In order to keep the memory usage low, DelayedArray works by processing objects in chunks (blocks), reading a subset of the data from disk, processing it, and then reading the next block. The block size parameter defines how much memory it is allowed to use for each block. The larger the block size, the fewer separate read operations from disk will be required, which will hopefully be faster. We can check the current block size using the following command:

options()$DelayedArray.block.size
##  [1] 4500000

The value shown here is the block size in bytes, which equates to 4.5MB. In our matrix each column is approximately 112KB (4 bytes * 28,000 rows), so with the default block size we can read roughly 40 columns at once, which would require 32,500 read operations to process the full dataset.

N.B. it’s actually more complex than this, since within this HDF5 file the data are stored in 100×100 chunks, and all values in a chunk must be read in one go. So when we load 40 columns into R, we’re actually reading 100 and discarding some, resulting in most columns being read at least twice.

We should follow Aaron’s advice and increase the block size to something more appropriate (2GB), which should still be fine for a regular computer, and then run the benchmark again.

options(DelayedArray.block.size=2e9)
system.time(res3 <- colSums(counts(tenx.sub)))
##    user  system elapsed 
## 159.022  63.334 225.434

This clearly makes a huge difference, but we’re still looking at 35 minutes when working with the complete set of cells – nearly 30 times slower than the in-memory version, but comparable if we include the time taken to load the data into RAM. With this in mind I thought I’d look at some of the options for improving the performance.

Uncompressed data in the HDF5 file

After some profiling with the excellent profvis tool (https://github.com/rstudio/profvis) it seems that a large portion of the run time was being spent in rhdf5‘s H5Dread() function, or more specifically the .Call() routine found within this. This is in turn calling code in the HDF5 system library for reading from a dataset, which I haven’t tried to profile, but I’m going to assume it reasonably optimized.

It may be that the time here is all spent reading data from disk, however a quick look at my system monitor suggests the CPU is working at 100% and this is the bottleneck rather than disk IO. The HDF5 dataset we’re using is compressed with the gzip DEFLATE algorithm, and so it may be that decompression is the rate limiting step.

If we suspect that the decompression is a bottleneck, one way to test is to create an on-disk representation that is not compressed, and then try reading this. The code below will use the rhdf5 R package to create this uncompressed file, but bear in mind that it requires 145GB of disk space and, in my experience, double that in memory during creation. There’s probably a more efficient way to do this but since I have access to a machine that can cope with this, I haven’t investigated. One approach could be to could choose to write just the subset we’re using to disk and modify the code accordingly.

library(rhdf5)
tenx <- TENxBrainData()
tenx.inmem <- as.matrix(counts(tenx))

h5File <- 'tenx_uncompressed.h5'
h5createFile(h5File)
h5createDataset(file = h5File, dataset = "counts", 
                dims = dim(tenx.inmem), chunk = c(100,100),
                level = 0, storage.mode = "integer")
h5write(tenx.inmem, file = h5File, name = "counts" )

We can then create an HDF5Array object based on our uncompressed dataset, select the same subset as before, and run our same benchmark.

h5.uncmp <- HDF5Array(file = 'tenx_uncompressed.h5', 
                      name = "counts")
tenx.uncmp <- SingleCellExperiment(
    list(counts = h5.uncmp), 
    rowData = rowData(tenx), colData = colData(tenx)
)
tenx.sub.uncmp <- tenx.uncmp[,1:130000]

system.time(res4 <- colSums(counts(tenx.sub.uncmp)))
##    user  system elapsed 
## 119.739  73.032 197.672

The time taken here is noticeably faster than using the increased block size alone. Thus it appears that decompression of the data is some what of a bottle neck in reading from disk, although we should bear in mind of significant the difference between 5GB for our original file and 145GB for the uncompressed version is when it comes to storage and distribution of a dataset such as this.

Tweaks to rhdf5 & DelayedArray

The profvis profiling also revealled a few areas where matrices were being copied in memory unnecessarily (at least based on my brief investigation). I made modified versions of both packages, which tried to determine if the copying steps were really necessary and skip them if not. We can install these modified versions using the code below, which specifies the precise Github commits to work with.

N.B. I ran into problems running this in the same R session as the previous code, as many loaded packages depended upon these two, and I found it was simplest to just start a new session before doing the installation. This is why there are many separate R scripts in the Github repository.

devtools::install_github(repo = "grimbough/rhdf5", 
                         ref = "5e1f0be", quiet = TRUE)
devtools::install_github(repo = "grimbough/DelayedArray", 
                         ref = "d61695b", quiet = TRUE)

After installation, we then load the TENxBrainData data, set the DelayedArray block size, and check we’re using the developmental package versions.

library(TENxBrainData)
options(DelayedArray.block.size=2e9)
sapply(c("rhdf5", "DelayedArray"), packageVersion, simplify = FALSE)
##  $rhdf5
##  [1] ‘2.23.1.1’
##
##  $DelayedArray
##  [1] ‘0.5.1.1’

Having confirmed we’re using the modified versions of the packages, we load the data, select the same subset as before, and calculate the column sums.

tenx <- TENxBrainData()
tenx.sub <- tenx[,1:13000]

system.time(res5 <- colSums(counts(tenx.sub)))
##   user  system elapsed 
## 84.424  25.046 109.485

The performance improvements here seem to outweigh the move to uncompressed data, reducing the run time to roughly a third of where we started.

Combining approaches

Finally we can combine the modified software packages with the uncompressed dataset. We can keep the two developmental versions of the package installed previously, but now run the calculation on our existing uncompressed HDF5 dataset. We can reuse the same code as before to create the SingleCellExperiment object.

h5.uncmp <- HDF5Array(file = 'tenx_uncompressed.h5', 
                      name = "counts")
tenx.uncmp <- SingleCellExperiment(
    list(counts = h5.uncmp), 
    rowData = rowData(tenx), colData = colData(tenx)
)
tenx.sub.uncmp <- tenx.uncmp[,1:130000]

system.time(res5 <- colSums(tenx.sub.uncmp))
##    user  system elapsed 
##  37.570  27.771  65.352

Again we see a significant improvement in performance, with the combination of the two approaches getting the computation time down to just under 20% of the original 329 seconds. It’s still slower than when working on the in-memory copy, but it’s certainly an improvement.

Checking the results

Finally we want to make sure that these various approaches, particularly those where we’re using modified packages, are actually producing the same results. Since we’ve been saving the results at each step in our res# objects, we can easily check them against one another using the code below.

sapply(list(res2, res3, res4, res5),
       FUN = identical,
       y = res1)
## [1] TRUE TRUE TRUE TRUE

Conclusions

Here’s a plot summarising the timings reported above for each approach.

Clearly, working with on-disk representations of matrices is considerably slower than when they are held in memory. If you’re aiming to carry out a number of operations, and have the memory capacity, loading the entirety of the data into memory gives the fastest response. In a similar vein, if disk space is cheap then working from an uncompressed version of the data on disk can yield appreciable performance benefits.

If we stick to working with the compressed dataset, then updating the DelayedArray blocksize immediately gives a significant performance boost (I think the default could certainly be raised from 4.5MB).

The modifications to rhdf5 and DelayedArray also seem to yield noticeable improvements. Although we immediately combined the changes to both packages in the section above, I’ve also included entries in the plot where the modified versions were used separately to see if the benefits could be attributed only to one or came from both. Here the latter appears to be the case. This is encouraging, but my modifications require some more rigorous checking before being committed to the official release version of those packages.

Finally, if one can combine the code improvements with an uncompressed dataset, we see reduction in runtime of nearly 80%, dropping the difference between in-memory and my first naive approach from 40 times slower, to only a factor of 8.

Local caching

It’s also worth noting that for most of the timings there is a significant proportion of time attributed to system. Repeated runs of the same command reduce this a bit, suggesting that waiting for data to be read from disk is still somewhat of a issue, and some component of my system (OS, disk, network – I don’t know which) is caching it during repeated runs. The plot below shows timing for 6 runs, with the first calculation seen in the previous figure highlighted, and we can see that for most scenarios the first run is also the slowest (including in-memory). However, since one would most likely want to compute summary statistics like this only once on a dataset, taking the first value as our reference timing seems valid.

Follow up work

  • I haven’t included any comparisons with in-memory sparse matrix representations like dgCMatrix, which may provide a better balance between memory usage and speed.
  • These comparisons have all focused on a single operations – computing the column sums. It certainly isn’t a given that the same performance characteristics will be seen on operations that work on rows, or where individual data entries are selected.
  • Similarly, the subset of data selected in all these examples is contiguous. It’s just the first 10%, keep in the same order as it is on disk. I suspect the performance will be noticeably worse if we select columns at random from through the matrix, particularly if they are out of order, since there will be multiple reads of many blocks in the HDF5 file, but I haven’t check this.
  • In limited experimentation I see improved performance when working directly on an HDF5Array that doesn’t have the meta-data associated with a SingleCellExperiment object. There may be scope to improve how meta-data (like column and row names) are handled, since it shouldn’t impact a straight-forward computation like we’re using here.
  • Since gzip decompression is clearly somewhat of a bottleneck , perhaps there is a more appropriate algorithm for single-cell data. It’s very sparse, with long runs of zeros, implying compression should be fairly easy for almost any algorithm. Investigating if there are any more efficient or multi-threaded decompression approaches available to HDF5 may be beneficial.
  • Even if we retain gzip compression, the size of the chunk in the HDF5 file (currently 100×100) can be varied almost infinitely – and the selection of the most appropriate chunk dimensions will always be a trade off between the frequency of operations that read entire rows or columns, and those that are more selective.
  • The local caching that makes subsequent runs of the same benchmark faster, may also be having an impact between runs. It would be good to verify this by running the benchmarks out of order.

Updated: