Mutation library
The current library is hosted on a separate github repository here. This repository contains a CSV file with all the mutations and associated information.
The original mutation list
Originally, the mutations used for TB-Profiler were collated from the literature (detailed here). Since then, the library has been continuously updated and curated with contribution from the community.
WHO mutation catalogue
In 2021 the WHO published a list of mutations associated with drug resistance in TB. This was followed up in 2023 with a more comprehensive list and can be found here.
Which mutations are used by TB-Profiler?
There are two main mutation databases maintained for use with tb-profiler and they are hosted in a dedicated repository. They are hosted on different github branches are and detailed below:
Branch name | Description |
---|---|
who | All mutations from the 2nd edition WHO mutation catalogue. |
tbdb | A combinaiton of the original library and the WHO catalogue. |
In you install tb-profiler, it will come with the tbdb library. If you want to use the WHO catalogue you can will need to run the following command:
tb-profiler update_tbdb --branch who
You should then be able to see the database by running tb-profiler list_db
. If you then want to use this database during the profiling process then you can use the --db who
flag.
What is the difference between the two libraries?
The WHO catalogue is a comprehensive list of mutations associated with drug resistance in TB. These were defined based on statistical support from a large number of isolates. While this does mean that the list is comprehensive, rare resistance mutations may not have been captured. The tbdb library combines this list with the original library which was curated from the literature. This library may contain mutations which are not in the WHO catalogue but have been reported in the literature. In addition, the tbdb database contains mutations for drugs which were not evaluated by the WHO such as para-aminosalicylic acid (PAS) and cycloserine (CS). To give an idea on the differences between the two libraries, the following table shows the sensitivity and specificity of the two libraries when run on a test set of isoaltes with paired phenotypic and genotypic data.
Important
This dataset is an amalgamation of data from many different published studies which may have different methodologies for performing the DST. The dataset is not perfect and should be taken with a grain of salt. It does however give an idea of the differences between the two libraries.
Drug | Sensitive isolates | Resistant isolates | Sensitivity - tbdb | Specificity - tbdb | Sensitivity - who | Specificity - who |
---|---|---|---|---|---|---|
rifampicin | 23022 | 10544 | 0.97 | 0.97 | 0.97 | 0.97 |
isoniazid | 20501 | 12838 | 0.95 | 0.98 | 0.92 | 0.98 |
ethambutol | 24688 | 5475 | 0.93 | 0.90 | 0.87 | 0.92 |
pyrazinamide | 12340 | 2842 | 0.89 | 0.96 | 0.85 | 0.97 |
moxifloxacin | 13446 | 2534 | 0.92 | 0.92 | 0.91 | 0.92 |
levofloxacin | 12620 | 3131 | 0.92 | 0.97 | 0.91 | 0.97 |
bedaquiline | 11045 | 100 | 0.40 | 0.98 | 0.40 | 0.98 |
delamanid | 10829 | 178 | 0.12 | 1.00 | 0.12 | 1.00 |
linezolid | 13218 | 174 | 0.40 | 1.00 | 0.40 | 1.00 |
streptomycin | 7527 | 3907 | 0.84 | 0.93 | 0.80 | 0.93 |
amikacin | 16107 | 1652 | 0.84 | 0.99 | 0.84 | 0.99 |
kanamycin | 16047 | 2407 | 0.85 | 0.97 | 0.85 | 0.98 |
capreomycin | 9242 | 1027 | 0.79 | 0.97 | 0.78 | 0.97 |
clofazimine | 12934 | 483 | 0.16 | 0.99 | 0.16 | 0.99 |
ethionamide | 11783 | 2993 | 0.85 | 0.89 | 0.83 | 0.90 |
PAS | 1123 | 102 | 0.39 | 0.95 | Not tested | Not tested |
cycloserine | 960 | 157 | 0.29 | 0.97 | Not tested | Not tested |
As can be seen, the tbdb library has a higher sensitivity for most drugs. This is because it contains mutations which are not in the WHO catalogue. However, this comes at the cost of a drop in specificity for some drugs. The largest difference can be seen for isoniazid, ethambutol and pyrazinamide.
Note
There is no right choice between the two libraries and the choice of library will depend on the specific use case.
Want to contribute?
If you think a mutation should be removed or added please raise and issue here. If you want to help curate the library, leave a comment here.
Adding/removing mutations
Mutations can be added by submitting a pull request on a branch modified tbdb.csv file. If that previous sentence made no sense to you then you can suggest a change using an issue and we will try help.
How does it work?
The mutations are listed in mutations.csv. These are parsed by tb-profiler create_db
to generate the json formatted database used by TB-Profiler along with a few more files. Mutations can be removed and added from tbdb.csv and a new library can be built using tb-profiler create_db
.
tbdb.csv
This is a CSV file which must contain the following column headings: 1. Gene - These can be the gene names (e.g. rpoB) or locus tag (e.g. Rv0667). 2. Mutation - These must follow the hgvs nomenclature or be a valid sequence ontology term. More info on this down below. 3. type 4. drug - Name of the drug 5. original_mutation - 6. confidence - The confidence as given by the WHO mutation catalogue 7. source - A reference to the source of the mutation (e.g. WHO catalogue v2 or tbdb) 8. comment - Any additional information about the mutation
Mutation format
HGVS nomenclature
Mutations must follow the HGVS nomenclature. Information on this format can be found here. The following types of mutations are currently allowed:
- Amino acid substitutions. Example: S450L in rpoB would be p.Ser450Leu
- Deletions in genes. Example: Deletion of nucleotide 758 in tlyA would be c.758del
- Insertion in genes. Example: Insertion of GT between nucleotide 1850 and 1851 in katG would be c.1850_1851insGT
- SNPs in non-coding RNAs. Example: A to G at position 1401 in rrs would be n.1401a>g
- SNPs in gene promoters. Example: A to G 7 bases 5' of the start codon in pncA c.-7A>G
Sequence ontology terms
If the mutation is not a simple amino acid substitution or indel then it can be described using a sequence ontology term. The sequence ontology is a standardised ontology for describing genomic features. All terms detected by snpEff are valid sequence ontology terms and can be found here.
Epistasis rules
Epistasis rules can be added to the rules.txt file. These are rules that allow tb-profiler to negate the resistance effect of certain mutations if another mutation is present. For example, if a mutation an mmpL5 loss of function mutation is present then it will inactivate the effect of any mmpR5 resistance mutation. The format of the rules.txt file is as follows:
Variant(gene_name=mmpL5,type=lof) inactivates_resistance Variant(gene_name=mmpR5)
This will inactivate the effect of any mmpR5 resistance mutation if an mmpL5 loss of function mutation is present. The rules.txt file can contain multiple rules.
Confidence values
The confidence values are taken from the WHO catalogue but this can be changed to whatever you want.
Generating a new library
Just download the repository using git clone https://github.com/jodyphelan/tbdb.git
. This will generate a folder with all the required files. Then you can run tb-profiler create_db
Usage: tb-profiler create_db [-h] --prefix PREFIX [--csv CSV [CSV ...]] [--watchlist WATCHLIST] [--spoligotypes SPOLIGOTYPES] [--spoligotype_annotations SPOLIGOTYPE_ANNOTATIONS] [--barcode BARCODE]
[--bedmask BEDMASK] [--rules RULES] [--amplicon_primers AMPLICON_PRIMERS] [--match_ref MATCH_REF] [--custom] [--db_name DB_NAME] [--db_commit DB_COMMIT]
[--db_author DB_AUTHOR] [--db_date DB_DATE] [--include_original_mutation] [--load] [--no_overwrite] [--dir DIR] [--temp TEMP] [--version]
[--logging {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Options:
-h, --help show this help message and exit
--prefix, -p PREFIX The input CSV file containing the mutations (default: None)
--csv, -c CSV [CSV ...]
The prefix for all output files (default: ['mutations.csv'])
--watchlist, -w WATCHLIST
A csv file containing genes to profile but without any specific associated mutations (default: watchlist.csv)
--spoligotypes SPOLIGOTYPES
A file containing a list of spoligotype spacers (default: spoligotype_spacers.txt)
--spoligotype_annotations SPOLIGOTYPE_ANNOTATIONS
--barcode BARCODE A bed file containing lineage barcode SNPs (default: barcode.bed)
--bedmask BEDMASK A bed file containing a list of low-complexity regions (default: mask.bed)
--rules RULES A file containing python rules (default: rules.txt)
--amplicon_primers AMPLICON_PRIMERS
A file containing a list of amplicon primers (default: None)
--match_ref MATCH_REF
The prefix for all output files (default: None)
--custom Tells the script this is a custom database, this is used to alter the generation of the version definition (default: False)
--db_name DB_NAME Overrides the name of the database in the version file (default: None)
--db_commit DB_COMMIT
Overrides the commit string of the database in the version file (default: None)
--db_author DB_AUTHOR
Overrides the author of the database in the version file (default: None)
--db_date DB_DATE Overrides the date of the database in the version file (default: None)
--include_original_mutation
Include the original mutation (before reformatting) as part of the variant annotaion (default: False)
--load Automaticaly load database (default: False)
--no_overwrite Don't load if existing database with prefix exists (default: False)
--dir, -d DIR Storage directory (default: .)
--temp TEMP Temp firectory to process all files (default: .)
--version show program's version number and exit
--logging {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Logging level (default: INFO)
If you run it without any arguments it will generate the database files with the prefix tbdb
. This can be changed by using the --prefix
argument.
Using alternate reference names
The tbdb database will assume you have mapped your data to a reference with "Chromosome" as the sequence name. If your reference sequence is the same but has a differenct name e.g "NC_000962.3". You can generate a database with an alternate sequence name using the --match_ref /path/to/your/reference.fasta
flag.
Watchlist
There are some genes it may be of interest to record mutations even if we do not have any specific associated mutaitons. To allow this funcitonality we have included a "watchlist" file. To include genes just add them and the associated drugs to the watchlist.csv file.