genbank_parsing.Rmd
library(floundeR)
#> floundeR v0.0.5The Genbank file format is a standard for the sharing and distribution of whole genome sequences with their accompanying annotations. The Genbank parser in floundeR may not be the most technically competent parser and have been implemented with a couple of ambitions that are related to the analysis and identification of genetic variants in microbial genomes. This will undoubtedly change with time …
There is an R6 object defined in floundeR called GenbankGenome. Let’s identify the packaged genbank file that corresponds to a Tuberculosis reference genome and create the corresponding object.
TB_reference = flnDr("TB_H37Rv.gb.gz")
tb <- GenbankGenome$new(TB_reference)
#> ℹ Parsing genbank file [/tmp/Rtmpqr71Kn/temp_libpath14ba6f8cb076/floundeR/extdata/TB_H37Rv.gb.gz]
#> → processing sequence [73527] lines
#> → setting sequence
#> ✓ Done! [213179] lines parsedThis has created a TB GenbankGenome object. A number of functions are provided by the object to access the annotation content of this genome. The fundamental methods provided a briefly introduced below.
The CDS features annotated in the Genbank file have been collated as a GenomicRanges object. This can be accessed using
tb$list_cds()
#> GRanges object with 4031 ranges and 2 metadata columns:
#> seqnames ranges strand | gene
#> <Rle> <IRanges> <Rle> | <character>
#> [1] AL123456 BX842572-BX.. 1-1524 + | dnaA
#> [2] AL123456 BX842572-BX.. 2052-3260 + | dnaN
#> [3] AL123456 BX842572-BX.. 3280-4437 + | recF
#> [4] AL123456 BX842572-BX.. 4434-4997 + | Rv0004
#> [5] AL123456 BX842572-BX.. 5240-7267 + | gyrB
#> ... ... ... ... . ...
#> [4027] AL123456 BX842572-BX.. 4408334-4408897 - | Rv3920c
#> [4028] AL123456 BX842572-BX.. 4408969-4410069 - | Rv3921c
#> [4029] AL123456 BX842572-BX.. 4410053-4410415 - | Rv3922c
#> [4030] AL123456 BX842572-BX.. 4410412-4410789 - | rnpA
#> [4031] AL123456 BX842572-BX.. 4410786-4410929 - | rpmH
#> translation
#> <character>
#> [1] MTDDPGSGFTTVWNAVVSEL..
#> [2] MDAATTRVGLTDLTFRLLRE..
#> [3] MYVRHLGLRDFRSWACVDLE..
#> [4] MTGSVDRPDQNRGERSMKSP..
#> [5] MAAQKKKAQDEYGAASITIL..
#> ... ...
#> [4027] MADADTTDFDVDAEAPGGGV..
#> [4028] MSLLFDFFSLDFIYYPVSWI..
#> [4029] MSLSRQSCGRVVRVTGRASA..
#> [4030] MIATPGLFAVLRARNRMRRS..
#> [4031] MTKGKRTFQPNNRRRARVHG..
#> -------
#> seqinfo: 1 sequence from Mycobacterium tuberculosis H37Rv complete genome. genomeor for atomic entries (I am interested in the gene identified as eis we can use the more targetted method method
tb$get_cds(feature_id="eis")
#> → extracting cds [eis]
#> GRanges object with 1 range and 2 metadata columns:
#> seqnames ranges strand | gene
#> <Rle> <IRanges> <Rle> | <character>
#> 2485 AL123456 BX842572-BX.. 2714124-2715332 - | eis
#> translation
#> <character>
#> 2485 MTVTLCSPTEDDWPGMFLLA..
#> -------
#> seqinfo: 1 sequence from Mycobacterium tuberculosis H37Rv complete genome. genomeA clinical collaborator has provided me with some information on a variant that we should be checking in some whole genome sequencing studies. The variant is a C->T transition located 12 nucleotides upstream of the eis gene. I would like to sanity check the provided information and identify an absolute coordinate within the reference genome.
The same collaborator as above is also interested in a mutation that has been described as a peptide change from A->E at amino acid number three of the pncA protein.
tb$get_cds(feature_id="pncA")
#> → extracting cds [pncA]
#> GRanges object with 1 range and 2 metadata columns:
#> seqnames ranges strand | gene
#> <Rle> <IRanges> <Rle> | <character>
#> 2094 AL123456 BX842572-BX.. 2288681-2289241 - | pncA
#> translation
#> <character>
#> 2094 MRALIIVDVQNDFCEGGSLA..
#> -------
#> seqinfo: 1 sequence from Mycobacterium tuberculosis H37Rv complete genome. genome