File size: 2,022 Bytes
59eb23b
 
ba31ca2
 
 
074b7a0
 
 
 
ebe9215
 
 
 
 
 
2cfa32b
7a7e4ca
 
 
2cfa32b
 
7a7e4ca
b17b6fd
 
2cfa32b
 
 
074b7a0
 
7a7e4ca
 
 
5d60a3a
 
2cfa32b
5d60a3a
 
b17b6fd
 
5d60a3a
 
ebe9215
b17b6fd
5d60a3a
 
b17b6fd
5d60a3a
 
 
b17b6fd
5d60a3a
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: gpl-3.0
language:
- en
library_name: sklearn
---

Prediction of aerobicity (whether an bacteria or archaeon is aerobic) based on gene copy numbers. The prediction problem is posed as a 2-class problem (the prediction is either aerobic or anaerobic).

This predictor was used in this (currently pre-publication) manuscript, please cite it if appropriate:

Davin, Adrian A., Ben J. Woodcroft, Rochelle M. Soo, Ranjani Murali, Dominik Schrempf, James Clark, Bastien Boussau et al. "An evolutionary timescale for Bacteria calibrated using the Great Oxidation Event." bioRxiv (2023): 2023-08.
https://www.biorxiv.org/content/10.1101/2023.08.08.552427v1.full

## Installation

First ensure you have installed git-lfs (including running `git lfs install`), as described at https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs

Then clone this repository, using 

```
git clone https://huggingface.co/wwood/aerobicity
git lfs fetch --all
git lfs pull
```

Then setup the conda environment:

```
cd aerobicity
mamba env create -p env -f env-apply.yml
conda activate ./env
```

and download the eggNOG database. We use version 2.1.3, as specified in the `env-apply.yml` conda environment file, because this is what the predictor was trained on. The eggNOG database is large, so it is not included in the repository. To download it, run:

```
mkdir eggNOG
download_eggnog_data.py --data_dir ./eggNOG
```

## Usage
To apply the predictor, run against a test genome:

```
./17_apply_to_proteome.py --protein-fasta data/RS_GCF_000515355.1_protein.faa --eggnog-data-dir eggNOG/ 
--models XGBoost.model --output-predictions predictions.csv
```

The predictions are then in `predictions.csv`. In the predictions output file, a prediction of `0` corresponds to a anaerobic prediction, and `1` corresponds to an aerobic prediction.

To run on your genomes, provide its protein fasta (i.e. the result of running `prodigal` on it), and use that instead of `data/RS_GCF_000515355.1_protein.faa` in the above command.