Encoding RNA data.

The evolution of the latent representation of BRCA data throughout the training procedure.

Background

Transcriptomic (RNAseq) data is very high dimensional and difficult to deal with using classical statistical techniques. Autoencoder, a kind of neural network that focuses on reconstructing the input through a bottleneck have been of tremendous help tackling this issue. They enable us to work with smaller representations, enabling subsequent stratification, inference and visualizations.

Method

I implemented a highly modular auto encoder approach where different part can be swapped out, assessing the performance of multiple approaches. Such modules go from the type of layers we use (conventional multi layer perceptron (MLP), as well as convolutional neural network (CNN)) to the type of latent space we implement. The following approach were considered : No variational approach, Variational Autoencoder (VAE) and Vector-Quantized VAE (VQ-VAE).

Results

Each point correspond to the patient representation through the encoder at different steps of the training process. This is achieved by training an auto encoder over RNAseq data from ~1200 patients provided openly by the TCGA project. At a given set of steps, we compute the PCA projection of the latent representation of the whole dataset and then put everything together in a nice smooth animation.

Going further

Some automatic parameter search methods have been implemented. This was the goal of my thesis, where I assess the clustering capabilities on one dataset with a set of hyperparameter. I would then use those same hyper parameters on a different set, with similar processing steps, hoping to discover an underlying structure within the target dataset.

You can find the code used to build this dynamic representation and more on my github as well as its documentation.