library(floundeR)
#> floundeR v0.0.5

Genbank parsing

The Genbank file format is a standard for the sharing and distribution of whole genome sequences with their accompanying annotations. The Genbank parser in floundeR may not be the most technically competent parser and have been implemented with a couple of ambitions that are related to the analysis and identification of genetic variants in microbial genomes. This will undoubtedly change with time …

Load the Genbank flatfile

There is an R6 object defined in floundeR called GenbankGenome. Let’s identify the packaged genbank file that corresponds to a Tuberculosis reference genome and create the corresponding object.

TB_reference = flnDr("TB_H37Rv.gb.gz")
tb <- GenbankGenome$new(TB_reference)
#> ℹ Parsing genbank file [/tmp/Rtmpqr71Kn/temp_libpath14ba6f8cb076/floundeR/extdata/TB_H37Rv.gb.gz]
#> → processing sequence [73527] lines
#> → setting sequence
#> ✓ Done! [213179] lines parsed

This has created a TB GenbankGenome object. A number of functions are provided by the object to access the annotation content of this genome. The fundamental methods provided a briefly introduced below.

Accessing the CDS annotations

The CDS features annotated in the Genbank file have been collated as a GenomicRanges object. This can be accessed using

tb$list_cds()
#> GRanges object with 4031 ranges and 2 metadata columns:
#>                        seqnames          ranges strand |        gene
#>                           <Rle>       <IRanges>  <Rle> | <character>
#>      [1] AL123456 BX842572-BX..          1-1524      + |        dnaA
#>      [2] AL123456 BX842572-BX..       2052-3260      + |        dnaN
#>      [3] AL123456 BX842572-BX..       3280-4437      + |        recF
#>      [4] AL123456 BX842572-BX..       4434-4997      + |      Rv0004
#>      [5] AL123456 BX842572-BX..       5240-7267      + |        gyrB
#>      ...                    ...             ...    ... .         ...
#>   [4027] AL123456 BX842572-BX.. 4408334-4408897      - |     Rv3920c
#>   [4028] AL123456 BX842572-BX.. 4408969-4410069      - |     Rv3921c
#>   [4029] AL123456 BX842572-BX.. 4410053-4410415      - |     Rv3922c
#>   [4030] AL123456 BX842572-BX.. 4410412-4410789      - |        rnpA
#>   [4031] AL123456 BX842572-BX.. 4410786-4410929      - |        rpmH
#>                     translation
#>                     <character>
#>      [1] MTDDPGSGFTTVWNAVVSEL..
#>      [2] MDAATTRVGLTDLTFRLLRE..
#>      [3] MYVRHLGLRDFRSWACVDLE..
#>      [4] MTGSVDRPDQNRGERSMKSP..
#>      [5] MAAQKKKAQDEYGAASITIL..
#>      ...                    ...
#>   [4027] MADADTTDFDVDAEAPGGGV..
#>   [4028] MSLLFDFFSLDFIYYPVSWI..
#>   [4029] MSLSRQSCGRVVRVTGRASA..
#>   [4030] MIATPGLFAVLRARNRMRRS..
#>   [4031] MTKGKRTFQPNNRRRARVHG..
#>   -------
#>   seqinfo: 1 sequence from Mycobacterium tuberculosis H37Rv complete genome. genome

or for atomic entries (I am interested in the gene identified as eis we can use the more targetted method method

tb$get_cds(feature_id="eis")
#> → extracting cds [eis]
#> GRanges object with 1 range and 2 metadata columns:
#>                      seqnames          ranges strand |        gene
#>                         <Rle>       <IRanges>  <Rle> | <character>
#>   2485 AL123456 BX842572-BX.. 2714124-2715332      - |         eis
#>                   translation
#>                   <character>
#>   2485 MTVTLCSPTEDDWPGMFLLA..
#>   -------
#>   seqinfo: 1 sequence from Mycobacterium tuberculosis H37Rv complete genome. genome

Accessing the genome sequence

The genome sequence from the Genbank entry can be accessed using an active binding called sequence.

tb$sequence
#> 4411532-letter DNAString object
#> seq: TTGACCGATGACCCCGGTTCAGGCTTCACCACAGTG...AGACCCGGAACTAACGAGAACCAGGGAGATACGTCG

Case studies

Variant analysis in tuberculosis

A clinical collaborator has provided me with some information on a variant that we should be checking in some whole genome sequencing studies. The variant is a C->T transition located 12 nucleotides upstream of the eis gene. I would like to sanity check the provided information and identify an absolute coordinate within the reference genome.

Variant definition in tubeculosis

The same collaborator as above is also interested in a mutation that has been described as a peptide change from A->E at amino acid number three of the pncA protein.

tb$get_cds(feature_id="pncA")
#> → extracting cds [pncA]
#> GRanges object with 1 range and 2 metadata columns:
#>                      seqnames          ranges strand |        gene
#>                         <Rle>       <IRanges>  <Rle> | <character>
#>   2094 AL123456 BX842572-BX.. 2288681-2289241      - |        pncA
#>                   translation
#>                   <character>
#>   2094 MRALIIVDVQNDFCEGGSLA..
#>   -------
#>   seqinfo: 1 sequence from Mycobacterium tuberculosis H37Rv complete genome. genome