---
title: "Querying a SOMA experiment"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Querying a SOMA experiment}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

In this notebook, we'll take a quick look at how to query a `SOMAExperiment` using the `SOMAExperimentAxisQuery` class. This allows for easy selection of data from a `SOMAMeasurement` by filtering on annotations stored in each axis data frame (i.e., `obs` and `var`).

```{r}
library(tiledbsoma)
```

## Example data

Load the bundled `SOMAExperiment` containing a subsetted version of the 10X genomics [PBMC dataset](https://satijalab.github.io/seurat-object/reference/pbmc_small.html) provided by SeuratObject. This will return a `SOMAExperiment` object.

```{r}
experiment <- load_dataset("soma-exp-pbmc-small")
experiment
```

## Querying basics

To perform a query we'll need to initialize a new `SOMAExperimentAxisQuery` object, specifying the `SOMAExperiment` and the `SOMAMeasurement` within the experiment we want to query.

We can see that our current experiment contains only a single measurement: `"RNA"`.

```{r}
experiment$ms
```

To use larger (or smaller) buffer sizes:

```{r}
ctx <- SOMATileDBContext$new(c(soma.init_buffer_bytes=as.character(2 * 1024**3)))
experiment <- SOMAExperimentOpen(experiment$uri, tiledbsoma_ctx = ctx)
```

Alternatively, you can have in your environment `export TILEDB_SOMA_INIT_BUFFER_BYTES=2147483648` before loading the data.


Now we can construct our query object.

```{r}
query <- SOMAExperimentAxisQuery$new(
  experiment = experiment,
  measurement_name = "RNA"
)
```

Once it's created, we can use the `query` object to inspect, select, and extract filtered data from the experiment.

For example, we can use `n_obs` and `n_vars` to determine the number of observations and variables that passed our filtering criteria. Since we didn't specify any filtering criteria, these numbers will match the full size of the experiment.

Number of observations:

```{r}
query$n_obs
```

Number of variables:

```{r}
query$n_vars
```

We can also extract any data component from the experiment. Here we'll read in the `obs` data frame from the query using `obs()` which returns an iterator of `arrow::Table`. The iterator is useful when the data is too large to load in memory allowing to stream the data in chunks. This applies to `var()` as well.

To load the data in memory you can concatenate all chunks of the iterator as shown below.

```{r}
iterator <- query$obs()
obs <- iterator$concat()
obs
```
As a reminder `arrow:Table` can be easily cast into a `tibble`

```{r}
obs$to_data_frame()
```

Alternatively, you can use the iterator, which retrieves data in chunks that are smaller than the `soma.init_buffer_bytes` context field. You can use the iterator's method `$read_next()` to load a chunk in memory.

```{r}
iterator <- query$obs()
iterator$read_next()
```

In this example the full `obs` table is relatively small and fits all in one chunk.

For a bigger `SOMADataFrame` you can check if the iteration has finished by checking the logical `$read_complete()`.

Here we demonstrate by creating a new iterator.

```{r}
iterator <- experiment$obs$read()
iterator$read_complete()
```
```{r}
iterator$read_next()
iterator$read_complete()
```

We can also access the expression via `X()`.

Similarly to `obs()` and `var()`, `X()` is intended for iteration, but in this case we have access to two different iterators, and thus `X()` returns a reader that gives you access to an iterator for `arrow::Table` and one for `Matrix::sparse_matrix`.

Let's take a look at the Arrow Table iterator:

```{r}
reader <- query$X(layer_name = "counts")
table_irerator <- reader$tables()
table_irerator$read_next()
```

As in the `obs` example the data is small enough to fit in one chunk. For bigger data you can user `iterator$read_complete()` to check the status of iteration and `iterator$concat()` to concatenate the rest of the chunks.

The iterator for  `Matrix::sparse_matrix` works in the same way. Keep in mind that the matrix format is `dgTMatrix` as it is the most memory-efficient and the only format type that can be easily iterated. And most importantly, the resulting object is a "view" of the full matrix with the original shape and indexes but only with data corresponding to the query coordinates or filters (see section below).

```{r}
reader <- query$X(layer_name = "counts")
iterator <- reader$sparse_matrix()
str(iterator$read_next())
```

## Adding filters

Adding filters requires creating a `SOMAAxisQuery` object that allows you to define coordinates, value filters, or both for an axis.

Here we'll create a query for `obs` that slices the first 40 rows, and then filters that subset based on the `nCount_RNA` column.

```{r}
obs_query <- SOMAAxisQuery$new(
  coords = list(soma_joinid = 0:39),
  value_filter = "nCount_RNA > 100"
)
```

To apply this filter we'll pass it to a new `SOMAExperimentAxisQuery` object.

```{r}
query <- SOMAExperimentAxisQuery$new(
  experiment = experiment,
  measurement_name = "RNA",
  obs_query = obs_query
)
```

Let's see how many observations this query identified.

```{r}
query$n_obs
```

As before, we can load the `obs` data frame into memory but now it only includes the filtered observations.

```{r}
obs <- query$obs(column_names = c("obs_id", "nCount_RNA"))$concat()
obs$to_data_frame()
```

As well as the X matrix in two different formats:

`arrow::Table`

```{r}
query$X("counts")$tables()$concat()
```

`Matrix::sparse_matrix` in `dgTMatrix` format.

```{r}
str(query$X("counts")$sparse_matrix()$concat())
```

For a re-indexed and re-shaped sparse matrix see the section below.

## Export to an annotated re-indexed sparse matrix

Any component of the queried `SOMAExperiment` can be exported to a [sparse matrix][Matrix::sparseMatrix-class] using the `to_sparse_matrix()` method.

For example, let's create a sparse matrix of the filtered expression data. We'll create a new query that returns a smaller subset of the data to make the output easier to read.

```{r}
query <- SOMAExperimentAxisQuery$new(
  experiment = experiment,
  measurement_name = "RNA",
  obs_query = SOMAAxisQuery$new(coords = 0:9),
  var_query = SOMAAxisQuery$new(coords = 0:9)
)
```

Then we indicate that we want to access the `"counts"` layer of the `"X"` collection.

```{r}
query$to_sparse_matrix(
  collection = "X",
  layer = "counts"
)
```

By default, the dimensions are named using `soma_joinid` values which are unique to each observation and variable. However, dimension names can come from any column in the `obs` and `var` arrays that uniquely identifies each record. For an expression matrix it makes sense to name the dimensions using cell barcodes and gene names, which are stored in the `obs_id` and `var_id` columns, respectively.

```{r}
query$to_sparse_matrix(
  collection = "X",
  layer = "counts",
  obs_index = "obs_id",
  var_index = "var_id"
)
```

We can use this method for any of the `SOMAExperiment`'s collections. Let's access the t-SNE coordinates stored in the `obsm` collection's `X_tsne` layer, populating the row names with cell barcodes.

```{r}
query$to_sparse_matrix(
  collection = "obsm",
  layer = "X_tsne",
  obs_index = "obs_id"
)
```

## Export to Seurat

The `query` object also contains methods for loading in results as a Seurat object (or any of Seurat's component classes). As with the `to_sparse_matrix()` method, we can specify the `obs_index` and `var_index` to use for naming the dimensions of the resulting object.

```{r eval=requireNamespace("SeuratObject", quietly = TRUE)}
query <- SOMAExperimentAxisQuery$new(
  experiment = experiment,
  measurement_name = "RNA"
)

query$to_seurat(
  X_layers = c(counts = "counts", data = "data"),
  obs_index = "obs_id",
  var_index = "var_id"
)
```