Obtaining chromosome lengths in Bioconductor

The following question was asked in the Bioconductor slack channel

Many different people suggested strategies, and it turns out there are quite a lot of ways to get this information. Here’s a quick summary broken down by whether you want to get the information from UCSC or Ensembl (chr1 vs 1) and by the principle package used. Since the original question mentioned the ‘mm9’ assembly of the mouse genome that’s what’s used in the following examples, although the same strategies should work for many other organisms and assemblies.

UCSC

First we’ll start with methods for getting the chromosome lengths from UCSC.

GenomicFeatures (Martin Morgan)

This relies on a working internet connection, but the download is tiny so it’s super fast. Returns a simple data.frame with chromosome names and lengths - pretty much the perfect answer to the question.

library(GenomicFeatures)
getChromInfoFromUCSC("mm9") %>%
    head()
  chrom    length
1  chr1 197195432
2  chr2 181748087
3  chr3 159599783
4  chr4 155630120
5  chr5 152537259
6  chr6 149517037

rtracklayer (Malte Thodberg)

Also requires a working internet connection and returns a Seqinfo object. If the aim is to use the chromosome lengths to build a GRanges object you can use similar function GRangesForUCSCGenome("mm9") to get straight there - a nice shortcut.

library(rtracklayer)
SeqinfoForUCSCGenome("mm9")
Seqinfo object with 35 sequences from mm9 genome:
  seqnames     seqlengths isCircular genome
  chr1          197195432       <NA>    mm9
  chr2          181748087       <NA>    mm9
  chr3          159599783       <NA>    mm9
  chr4          155630120       <NA>    mm9
  chr5          152537259       <NA>    mm9
  ...                 ...        ...    ...
  chr16_random       3994       <NA>    mm9
  chr17_random     628739       <NA>    mm9
  chrX_random     1785075       <NA>    mm9
  chrY_random    58682461       <NA>    mm9
  chrUn_random    5900358       <NA>    mm9

GenomicInfoDb (Ludwig Geistlinger)

We can use the GenomicInfoDb package in conjunction with the TxDb package for the genome in question. As a general statement this will be a bigger download than any of the other solutions, although the package download only needs to happen once, and then this approach can be used offline. To check if your assembly is avaiable, all TxDb packages can be listed with BiocManager::available("^TxDb"). The return type is a named vector where the name of each entry corresponds to a chromosome name.

library(GenomicInfoDb)
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
seqlengths(TxDb.Mmusculus.UCSC.mm9.knownGene)
        chr1         chr2         chr3         chr4         chr5         chr6         chr7 
   197195432    181748087    159599783    155630120    152537259    149517037    152524553 
        chr8         chr9        chr10        chr11        chr12        chr13        chr14 
   131738871    124076172    129993255    121843856    121257530    120284312    125194864 
       chr15        chr16        chr17        chr18        chr19         chrX         chrY 
   103494974     98319150     95272651     90772031     61342430    166650296     15902555 
        chrM  chr1_random  chr3_random  chr4_random  chr5_random  chr7_random  chr8_random 
       16299      1231697        41899       160594       357350       362490       849593 
 chr9_random chr13_random chr16_random chr17_random  chrX_random  chrY_random chrUn_random 
      449403       400311         3994       628739      1785075     58682461      5900358 

Ensembl

If you prefer to work with references from Ensembl, there are plenty of options there too, although note that you probably won’t be able to use the ‘mm9’ nomenclature.

GenomicFeatures

Again we can use the GenomicFeatures package, but specifying that we want to get the data from Biomart. For me is the slowest of the options listed and that time is required for each run, nothing is cached. Note that we have to use the name of the dataset within Ensembl, rather than the genome ID or organism name. Also, to return results from mm9 (NCBI m37) we have to use the archive of Ensembl 67, from May 2012. Otherwise using the default parameters will return the most recent genome build. Again you get the simple two column data.frame result.

library(GenomicFeatures)
getChromInfoFromBiomart(dataset = 'mmusculus_gene_ensembl',
                        host = "http://may2012.archive.ensembl.org") %>%
    dplyr::arrange(chrom) %>%
    head()
  chrom    length
1     1 197195432
2    10 129993255
3    11 121843856
4    12 121257530
5    13 120284312
6    14 125194864

Ensembl REST API

There’s a bit more code here as it isn’t wrapped inside a package (it’s not really a Bioconductor solution, but I’m a fan of the Ensmbl API). The results also comeback in an unexpected order and it’s not super simple to sort as this is a character vector. This may or may not matter depending upon the intended use. This time we just use the English language name of the organism e.g. ‘mouse’, but it will also work with ‘mus_musculus’. Unfortunately this approach can not get us details for the mm9 genome build. You can use the REST API to find the available assemblies, which at the time of writing is only GRCm38.p6.

library(httr)
library(jsonlite)
library(xml2)
library(dplyr)

GET("https://rest.ensembl.org/info/assembly/mouse?", 
    content_type("application/json")) %>% 
    httr::content(type = "text/plain", encoding = "UTF-8") %>%
    fromJSON() %$%
    top_level_region %>%
    dplyr::filter(coord_system == "chromosome") %>%
    head()
  name coord_system    length
1    X   chromosome 171031299
2   18   chromosome  90702639
3    6   chromosome 149736546
4    7   chromosome 145441459
5    5   chromosome 151834684
6    1   chromosome 195471971

Conclusion

I’m sure there must be many more approaches (this doesn’t even touch on CRAN packages) and there isn’t an obvious best choice, although GenomicFeatures::getChromInfoFromUCSC() seems both quick and simple. Ultimately the decision comes down to many factors: which packages you’re happy for your code to depend on; speed; reliance on an internet connection; convenience of the output format for your task; etc. On thing that is clear is that there’s no need to reinvent the wheel for this particular task.

Updated: