--- title: "Reading from SOMA objects" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Reading from SOMA objects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Overview In this tutorial we'll learn how to read data from various SOMA objects. We will assume familiarity with SOMA objects, so it is recommended to go through the `vignette("soma-objects")` first. A core feature of SOMA is the ability to read _subsets_ of data from disk into memory as slices. SOMA uses [Apache Arrow](https://arrow.apache.org/) as an intermediate in-memory storage. From here, the slices can be further converted into native R objects, like data frames and matrices. ```{r} library(tiledbsoma) ``` ## Example data Load the bundled `SOMAExperiment` containing a subsetted version of the 10X genomics [PBMC dataset](https://satijalab.github.io/seurat-object/reference/pbmc_small.html) provided by SeuratObject. This will return a `SOMAExperiment` object. This is a small dataset that easily fits into memory, but we'll focus on operations that can easily scale to larger datasets as well. ```{r} experiment <- load_dataset("soma-exp-pbmc-small") ``` ## SOMA DataFrame We'll start with the `obs` dataframe. Simply calling the `read()$concat()` method will load all of the data in memory as an [Arrow Table](https://arrow.apache.org/docs/r/reference/table.html). ```{r} obs <- experiment$obs obs$read()$concat() ``` This is easily converted into a `data.frame` using Arrow's methods: ```{r} obs$read()$concat()$to_data_frame() ``` ### Slicing Slices of data can be read by passing coordinates to the `read()` method. Before we do that, let's take a look at the schema of `obs`: ```{r} obs$schema() ``` With any SOMA object, you can only slice across an indexed column (a "dimension" in TileDB parlance). You can use `dimnames()` to retrieve the names of any SOMA object's indexed dimensions: ```{r} obs$dimnames() ``` In this case, there is a single dimension called `soma_joinid`. From the schema above we can see this contains integers. Let's look at a few ways to slice the dataframe. Select a single row: ```{r} obs$read(coords = 0)$concat() ``` Select multiple, non-contiguous rows: ```{r} obs$read(coords = c(0, 2))$concat() ``` Select multiple, contiguous rows: ```{r} obs$read(coords = 0:4)$concat() ``` ### Selecting columns As TileDB is a columnar format, it is possible to select a subset of columns to read by using the `column_names` argument: ```{r} obs$read(coords = 0:4, column_names = c("obs_id", "groups"))$concat() ``` ### Filtering In addition to slicing by coordinates you can also apply filters to the data using the `value_filter` argument. These expressions are pushed down to the TileDB engine and efficiently applied to the data on disk. Here are a few examples. Identify all cells in the `"g1"` group: ```{r} obs$read(value_filter = "groups == 'g1'")$concat()$to_data_frame() ``` Identify all cells in the `"g1"` or `"g2"` group: ```{r} obs$read(value_filter = "groups == 'g1' | groups == 'g2'")$concat()$to_data_frame() ``` Altenratively, you can use the `%in%` operator: ```{r} obs$read(value_filter = "groups %in% c('g1', 'g2')")$concat()$to_data_frame() ``` Identify all cells in the `"g1"` group with more than more than 60 features: ```{r} obs$read(value_filter = "groups == 'g1' & nFeature_RNA > 60")$concat()$to_data_frame() ``` ## SOMA SparseNDArray For `SOMASparseNDArray`, let's consider the `X` layer containing the `"counts"` data: ```{r} counts <- experiment$ms$get("RNA")$X$get("counts") counts ``` Similar to `SOMADataFrame`, we can load the data into memory as an Arrow Table: ```{r} counts$read()$tables()$concat() ``` Or as a [`Matrix::sparseMatrix()`]: ```{r} counts$read()$sparse_matrix()$concat() ``` ### Slicing Just as with a `SOMADataFrame`, we can also retrieve subsets of the data from a `SOMASparseNDArray` that can fit in memory. Unlike `SOMADataFrame`s, `SOMASparseNDArray`s are always indexed using a zero-based offset integer on each dimension, named `soma_dim_N`. Therefore, if the array is `N`-dimensional, the `read()` method can accept a list of length `N` that specifies how to slice the array. `SOMASparseNDArray` dimensions are always named `soma_dim_N` where `N` is the dimension number. As before you could use `schema()` or `dimnames()` to retrieve the dimension names. ```{r} counts$schema() ``` For example, here's how to fetch the first 5 rows of the matrix: ```{r} counts$read(coords = list(soma_dim_0 = 0:4))$tables()$concat() ```