genbank_parsing.Rmd
library(floundeR)
#> floundeR v0.0.5
The Genbank file format is a standard for the sharing and distribution of whole genome sequences with their accompanying annotations. The Genbank
parser in floundeR
may not be the most technically competent parser and have been implemented with a couple of ambitions that are related to the analysis and identification of genetic variants in microbial genomes. This will undoubtedly change with time …
There is an R6 object defined in floundeR
called GenbankGenome
. Let’s identify the packaged genbank file that corresponds to a Tuberculosis reference genome and create the corresponding object.
TB_reference = flnDr("TB_H37Rv.gb.gz")
tb <- GenbankGenome$new(TB_reference)
#> ℹ Parsing genbank file [/tmp/Rtmpqr71Kn/temp_libpath14ba6f8cb076/floundeR/extdata/TB_H37Rv.gb.gz]
#> → processing sequence [73527] lines
#> → setting sequence
#> ✓ Done! [213179] lines parsed
This has created a TB GenbankGenome
object. A number of functions are provided by the object to access the annotation content of this genome. The fundamental methods provided a briefly introduced below.
The CDS
features annotated in the Genbank
file have been collated as a GenomicRanges
object. This can be accessed using
tb$list_cds()
#> GRanges object with 4031 ranges and 2 metadata columns:
#> seqnames ranges strand | gene
#> <Rle> <IRanges> <Rle> | <character>
#> [1] AL123456 BX842572-BX.. 1-1524 + | dnaA
#> [2] AL123456 BX842572-BX.. 2052-3260 + | dnaN
#> [3] AL123456 BX842572-BX.. 3280-4437 + | recF
#> [4] AL123456 BX842572-BX.. 4434-4997 + | Rv0004
#> [5] AL123456 BX842572-BX.. 5240-7267 + | gyrB
#> ... ... ... ... . ...
#> [4027] AL123456 BX842572-BX.. 4408334-4408897 - | Rv3920c
#> [4028] AL123456 BX842572-BX.. 4408969-4410069 - | Rv3921c
#> [4029] AL123456 BX842572-BX.. 4410053-4410415 - | Rv3922c
#> [4030] AL123456 BX842572-BX.. 4410412-4410789 - | rnpA
#> [4031] AL123456 BX842572-BX.. 4410786-4410929 - | rpmH
#> translation
#> <character>
#> [1] MTDDPGSGFTTVWNAVVSEL..
#> [2] MDAATTRVGLTDLTFRLLRE..
#> [3] MYVRHLGLRDFRSWACVDLE..
#> [4] MTGSVDRPDQNRGERSMKSP..
#> [5] MAAQKKKAQDEYGAASITIL..
#> ... ...
#> [4027] MADADTTDFDVDAEAPGGGV..
#> [4028] MSLLFDFFSLDFIYYPVSWI..
#> [4029] MSLSRQSCGRVVRVTGRASA..
#> [4030] MIATPGLFAVLRARNRMRRS..
#> [4031] MTKGKRTFQPNNRRRARVHG..
#> -------
#> seqinfo: 1 sequence from Mycobacterium tuberculosis H37Rv complete genome. genome
or for atomic entries (I am interested in the gene identified as eis
we can use the more targetted method method
tb$get_cds(feature_id="eis")
#> → extracting cds [eis]
#> GRanges object with 1 range and 2 metadata columns:
#> seqnames ranges strand | gene
#> <Rle> <IRanges> <Rle> | <character>
#> 2485 AL123456 BX842572-BX.. 2714124-2715332 - | eis
#> translation
#> <character>
#> 2485 MTVTLCSPTEDDWPGMFLLA..
#> -------
#> seqinfo: 1 sequence from Mycobacterium tuberculosis H37Rv complete genome. genome
A clinical collaborator has provided me with some information on a variant that we should be checking in some whole genome sequencing studies. The variant is a C->T transition located 12 nucleotides upstream of the eis gene. I would like to sanity check the provided information and identify an absolute coordinate within the reference genome.
The same collaborator as above is also interested in a mutation that has been described as a peptide change from A->E at amino acid number three of the pncA protein.
tb$get_cds(feature_id="pncA")
#> → extracting cds [pncA]
#> GRanges object with 1 range and 2 metadata columns:
#> seqnames ranges strand | gene
#> <Rle> <IRanges> <Rle> | <character>
#> 2094 AL123456 BX842572-BX.. 2288681-2289241 - | pncA
#> translation
#> <character>
#> 2094 MRALIIVDVQNDFCEGGSLA..
#> -------
#> seqinfo: 1 sequence from Mycobacterium tuberculosis H37Rv complete genome. genome