Unsupervised autoencoder learning on gene expression data from the Human Cell Atlas
Installation
- clone git repository
- install requirements
to reproduce the workflow run the following scripts, for information read section 3 :
- run:
python main_download_loom.py
- run:
python main.py
you might also run the following scrips and read section 4 for information:
- run:
python main_on_kaggle.py
The following code is meant to be run on a local host. We did not manage to solve dependency conflicts on the cluster yet. The conflicts appeared because we use Python 3.9 for our project and the cluster's newest version is Python 3.6
Usage
1. API Access
This part of the project enables the automated download of loom files from the Human Cell Atlas. To filter the desired files, the following parameters can be specified:
- numberFiles (default: 100)
- sort ( default: 'lastModifiedDate')
- order (default: 'asc')
In addition, optional and either individually as a string object e.g. donor_species = "Homo sapiens" or, if e.g. organs "heart" and "blood" are desired as a list:
- projectId
- projectTitle
- organ
- donor_species
- sampleDisease
- cellType=False
In general, there are two implemented approaches, one is interactive through console and gives the opportunity to view the metadata and select files to download, the second is non-interactive. Open the script "api_access_main.py" for more detailed information and to run the code. For the following workflow you do not need to run "api_access_main.py", api access is done by running the script "main_download_loom.py".
1.2 for future use
If the dataset in the Human Cells Atlas changes, the following aspect should be considered:
- The functions is_valid_input() and filter_error() check whether filter inputs of the function get_files_metadata() are valid. The list of valid inputs is also output to the user* in the event of incorrect input. This is to support the user in the input and help to avoid spelling mistakes. Unfortunately, this check could only be realised for a few variables, because variables such as projectID or ProjectTitle have too many valid entries and the amount of valid entries changes dynamically with the addition and removal of projects on the website. This problem perspectively also exists for the variables "organ" and "donor_species", accordingly it makes sense in the future use of the code to keep the lists in the function get_validation_list() up to date or to deactivate the checks in the function get_files_metadata() in case of problems. The valid responses can be viewed in the API response, it is helpful to save them as a JSON file for this purpose.
2. Autoencoder architecture
The Autoencoder has the following dimensions: (2048, 128 16, 128,2048). (2048 is the number of in- and output features, 128 the size of the hidden layers and 16 the size of the latent space.) After intense research we choose to implement the same architecture as T. Geddes, Kim et al. how chose this architecture for single-cell RNA-seq data analysis after they performed hyperparameter tuning. We selected a vanilla autoencoder architecture which according to Simidjievski et al. achieve good results for our use case perform good compared to other autoencoder architectures Simidjievski et al.. See "autoencoder_methods.py" for architekture specification.
3. Data Analyse Human Cell Atlas
We downloaded different loom files from the Human Cell Atlas, after preprocessing we split it into test- and train-data and trained an autoencoder on the data. Afterwards, we performed a t-sne on the latent space of the test-data, which has the following dimensions: (n_samples, 16). To validate and compare our approach we also performed a PCA on the same data set, preprocessed it in the same way and ran a t-sne on the test-data transformed through the first 16 principal components. Therefore, both the autoencoder and the PCA have reduced the data set to 16 dimensions.
To reproduce the workflow, run the following scripts:
- run:
python main_download_loom.py
- run:
python main.py
3.1 Results
The results do not show a proper clustering according to the organ types the cells come from. We assume that the cells are from different organs, but within one organ might be cells with different tissue types. This information is not provided through the lack of detailed information on the single cells type. To be able to see if our autoencoder architecture suits the use case of clustering single-cell RNA-seq data, we used an alternative dataset.
4. Data analyse on alternative data
Because the results we got from the Human Cell Atlas Data were not very informative, we tried to set up a pipeline using another data set.
4.1 ICMR Dataset (Indian Council of Medical Research)
The Dataset is published on Kaggle. The Input Dataset contains 801 observations, each assigned to a person with a specific type of cancer. The different types of cancer are: Breast, Kidney, Colon, Lung and Prostate Cancer.
Each sample contains expression values of 20.531 genes.
We trained the same autoencoder architecture on the kaggle_data and as before performed t-sne on the resulting latent space. As we did with the Human Cell Atlas data, we also applied a PCA on the kaggle_data, using the first 16 principal components for further dimension reduction and clustering with t-sne. To reproduce the workflow, run the following script:
- run:
python main_on_kaggle.py