---
title: "Reading from SOMA objects"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Reading from SOMA objects}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Overview

In this tutorial we'll learn how to read data from various SOMA objects. We will assume familiarity with SOMA objects, so it is recommended to go through the `vignette("soma-objects")` first.

A core feature of SOMA is the ability to read _subsets_ of data from disk into memory as slices. SOMA uses [Apache Arrow](https://arrow.apache.org/) as an intermediate in-memory storage. From here, the slices can be further converted into native R objects, like data frames and matrices.

```{r}
library(tiledbsoma)
```

## Example data

Load the bundled `SOMAExperiment` containing a subsetted version of the 10X genomics [PBMC dataset](https://satijalab.github.io/seurat-object/reference/pbmc_small.html) provided by SeuratObject. This will return a `SOMAExperiment` object. This is a small dataset that easily fits into memory, but we'll focus on operations that can easily scale to larger datasets as well.

```{r}
experiment <- load_dataset("soma-exp-pbmc-small")
```

## SOMA DataFrame

We'll start with the `obs` dataframe. Simply calling the `read()$concat()` method will load all of the data in memory as an [Arrow Table](https://arrow.apache.org/docs/r/reference/table.html).


```{r}
obs <- experiment$obs
obs$read()$concat()
```

This is easily converted into a `data.frame` using Arrow's methods:

```{r}
obs$read()$concat()$to_data_frame()
```

### Slicing

Slices of data can be read by passing coordinates to the `read()` method. Before we do that, let's take a look at the schema of `obs`:

```{r}
obs$schema()
```

With any SOMA object, you can only slice across an indexed column (a "dimension" in TileDB parlance). You can use `dimnames()` to retrieve the names of any SOMA object's indexed dimensions:

```{r}
obs$dimnames()
```

In this case, there is a single dimension called `soma_joinid`. From the schema above we can see this contains integers.

Let's look at a few ways to slice the dataframe.

Select a single row:

```{r}
obs$read(coords = 0)$concat()
```

Select multiple, non-contiguous rows:

```{r}
obs$read(coords = c(0, 2))$concat()
```

Select multiple, contiguous rows:

```{r}
obs$read(coords = 0:4)$concat()
```

### Selecting columns

As TileDB is a columnar format, it is possible to select a subset of columns to read by using the `column_names` argument:

```{r}
obs$read(coords = 0:4, column_names = c("obs_id", "groups"))$concat()
```

### Filtering

In addition to slicing by coordinates you can also apply filters to the data using the `value_filter` argument. These expressions are pushed down to the TileDB engine and efficiently applied to the data on disk. Here are a few examples.

Identify all cells in the `"g1"` group:

```{r}
obs$read(value_filter = "groups == 'g1'")$concat()$to_data_frame()
```

Identify all cells in the `"g1"` or `"g2"` group:

```{r}
obs$read(value_filter = "groups == 'g1' | groups == 'g2'")$concat()$to_data_frame()
```

Altenratively, you can use the `%in%` operator:

```{r}
obs$read(value_filter = "groups %in% c('g1', 'g2')")$concat()$to_data_frame()
```

Identify all cells in the `"g1"` group with more than more than 60 features:

```{r}
obs$read(value_filter = "groups == 'g1' & nFeature_RNA > 60")$concat()$to_data_frame()
```

## SOMA SparseNDArray

For `SOMASparseNDArray`, let's consider the `X` layer containing the `"counts"` data:

```{r}
counts <- experiment$ms$get("RNA")$X$get("counts")
counts
```

Similar to `SOMADataFrame`, we can load the data into memory as an Arrow Table:

```{r}
counts$read()$tables()$concat()
```

Or as a [`Matrix::sparseMatrix()`]:

```{r}
counts$read()$sparse_matrix()$concat()
```

### Slicing

Just as with a `SOMADataFrame`, we can also retrieve subsets of the data from a `SOMASparseNDArray` that can fit in memory.

Unlike `SOMADataFrame`s, `SOMASparseNDArray`s are always indexed using a zero-based offset integer on each dimension, named `soma_dim_N`. Therefore, if the array is `N`-dimensional, the `read()` method can accept a list of length `N` that specifies how to slice the array.

`SOMASparseNDArray` dimensions are always named `soma_dim_N` where `N` is the dimension number. As before you could use `schema()` or `dimnames()` to retrieve the dimension names.

```{r}
counts$schema()
```

For example, here's how to fetch the first 5 rows of the matrix:

```{r}
counts$read(coords = list(soma_dim_0 = 0:4))$tables()$concat()
```