Visualizing Bioinformatics: Crafting Engaging Scientific Graphics
Written on
Creating captivating and informative graphics in the field of bioinformatics can dispel the stereotype that science is dull and overly complex. With the right visualizations, raw DNA data can be transformed into clear and striking illustrations that convey substantial information.
Biopython stands out as the leading open-source bioinformatics toolkit for Python. It offers tailored classes and methods, supports standard biological file formats, and allows integration with other bioinformatics tools like BLAST, EMBOSS, and Clustalw. This article will showcase how Biopython can generate impressive visuals from biological data.
We will begin by examining Biopython's capabilities for manipulating DNA, RNA, and protein sequences. Following that, we will escalate in complexity to encompass entire genomes and how these genomes evolve over time through phylogenetic tree representations. This journey will enable you to analyze any genomes of interest and derive insights from the raw data!
Sequences
Biopython features a Seq (Sequence) base class, which is akin to Python's built-in string class but is enriched with additional methods for biological applications. For instance, we can import a DNA sequence as follows:
>>> my_sequence = Seq('AGTCC')
You can utilize traditional string methods on this object, such as iterating through each character or employing len() to determine the length of the sequence.
# Print length >>> print(len(my_sequence)) 5
# Print each basepair in the sequence >>> for basepair in my_sequence: ... print(basepair) A G T C C
You can also access basepairs by their index and slice the Sequence as you would normally. Slicing produces a new Seq() object, which can be stored in a new variable or replace the original sequence.
# Select basepair by index >>> my_sequence[2] 'T'
# Slice sequence (returns a new Seq() object) >>> my_sequence[1:4] Seq('GTC')
Next, we will explore some additional biologically significant methods available in the Seq() class. You can easily create the complement and reverse complement of a DNA sequence, as well as calculate the proportion of G and C basepairs within the sequence.
Complement
During DNA replication, each strand separates to synthesize the other strand. Adenine (A) pairs with thymine (T), while guanine (G) pairs with cytosine (C). Thus, a sequence like 'ACTG' will have a complementary sequence of 'TGAC'. For RNA, uracil (U) replaces thymine (T), leading to an RNA sequence of 'UGAC' derived from the DNA sequence 'ACTG'.
>>> my_sequence = Seq('AGTCC')
# Generate complement sequence >>> my_sequence.complement() Seq('TCAGG')
Reverse Complement
In the transcription process, DNA is read from the template strand running 3' to 5', and the reverse complement provides the mRNA sequence. However, we typically analyze DNA from the coding strand, which runs 5' to 3'.
Although the biological process involves two steps—first obtaining the reverse complement of the template strand, then transcribing it into RNA—we can simplify it by working directly from the coding strand and substituting T basepairs for U. Both methods yield an mRNA sequence.
# Generate reverse complement sequence >>> my_sequence.reverse_complement() Seq('GGACT')
# Two-step solution from template strand >>> coding_strand = Seq('ATGTAG') >>> template_strand = coding_strand.reverse_complement() >>> template_strand.reverse_complement().transcribe() Seq('AUGUAG')
# From coding strand >>> coding_strand = Seq('ATGTAG') >>> coding_strand.transcribe() Seq('AUGUAG')
GC-content
Biopython can also compute the ratio of GC bases to AT (or U) bases in a DNA or RNA sequence. This information is biologically significant since the GC-content varies between species, making it useful for identifying unknown genomes by comparing them to known species' GC-content.
Within a genome, coding regions (often genes) typically exhibit a higher GC-content compared to non-coding regions. Therefore, GC-content can assist in identifying genes within an unknown genome.
# Calculate GC proportion >>> from Bio.SeqUtils import GC >>> GC(my_sequence) 60.0
Translation
After generating mRNA from DNA, the next step is to translate it into a protein sequence! Similar to transcription, Biopython provides various methods for this, whether using the biologically accurate approach or directly translating from the coding strand DNA.
# In the biologically correct way after creating the mRNA sequence >>> mRNA = Seq('AUGGCCAUUGUA') >>> mRNA.translate() Seq('MAIV')
# From the coding strand directly >>> coding_strand = Seq('ATGGCCATTGTA') >>> coding_strand.translate() Seq('MAIV')
It’s crucial to note that different taxonomic groups require distinct tables for accurately synthesizing protein sequences from mRNA. Refer to NCBI resources on genetic codes for more information. A keyword argument can be specified in the .translate() method, such as .translate(table="Vertebrate Mitochondrial").
Genomic Data
Having understood the sequence manipulation in Biopython, let's elevate our focus to visualizing entire genomes. The GenomeDiagram module is specifically designed for depicting genomes in either linear or circular formats (the latter being common for most prokaryotes).
Biopython includes several sample genomes available for download from GitHub. You can also obtain genomes of interest from GenBank, the NIH genetic sequence database, in various file formats like .gbk or .fasta.
# Import Libraries from reportlab.lib import colors from reportlab.lib.units import cm from Bio.Graphics import GenomeDiagram from Bio import SeqIO from Bio.SeqFeature import SeqFeature, FeatureLocation
# Read in our genome record = SeqIO.read("NC_005816.gb", "genbank")
gd_diagram = GenomeDiagram.Diagram(record.id) gd_track_for_features = gd_diagram.new_track(1, name="Annotated Features") gd_feature_set = gd_track_for_features.new_set()
# Color code every other gene for clarity for feature in record.features:
if feature.type != "gene":
continuecolor = colors.blue if len(gd_feature_set) % 2 == 0 else colors.lightblue
gd_feature_set.add_feature(
feature, sigil="ARROW", color=color, label=True, label_size=14, label_angle=0)
# Add restriction sites for popular enzymes for site, name, color in [
("GAATTC", "EcoRI", colors.green),
("CCCGGG", "SmaI", colors.orange),
("AAGCTT", "HindIII", colors.red),
("GGATCC", "BamHI", colors.purple)]:
index = 0
while True:
index = record.seq.find(site, start=index)
if index == -1:
breakfeature = SeqFeature(FeatureLocation(index, index + len(site)))
gd_feature_set.add_feature(
feature,
color=color,
name=name,
label=True,
label_size=10,
label_color=color)
index += len(site)
# Create our diagram! gd_diagram.draw(
format="circular",
circular=True,
pagesize=(20 * cm, 20 * cm),
start=0,
end=len(record),
circle_core=0.5)
# Save to any standard photo format gd_diagram.write("plasmid_circular_nice1.png", "PNG")
Now, let's embark on crafting beautiful visualizations! The genome for Yersinia pestis biovar Microtus, an avirulent strain of the bubonic plague, developed as a vaccine, is illustrated here. Genes are represented in blue and light blue, with restriction enzyme sites marked by colored lines.
Visualizing a more intricate genome, the chloroplast of thale cress (Arabidopsis thaliana), showcases many more genes and a greater number of EcoRI restriction sites. This genome comprises 154,478 base pairs, while the previous example had only 9,609 base pairs.
This is a thale cress, a model organism in genetics. By specifying the genome from GenBank, we can apply the same foundational code to generate a circular DNA sequence.
# Specify the thale cress chloroplast genome record = SeqIO.read("NC_000932.gb", "genbank")
This represents the peak complexity for genome presentation while maintaining clarity. Moreover, it is visually striking. Indeed, by adding more restriction enzyme sites, as done in the first genome, we can produce even more beautiful visualizations.
Phylogeny
Having journeyed from raw DNA sequences to entire genomes, we recognize that genomes are not as stable as one might hope. Mutations occur frequently, especially in viruses, posing a significant challenge in vaccine development due to their continually changing genomes.
By comparing new sequences to a reference sequence, we can monitor how variants emerge over time. For the SARS-CoV-2 (Covid-19) virus, the reference sequence is the original strain from Wuhan, China.
Nextstrain provides several packages to visualize phylogeny, enabling public health officials and researchers to analyze new Covid-19 strains within their local communities. Animated visuals can be built to represent various features, including the number of mutations, country of origin, genotype, and clades.
Refer to this guide to create your own Covid-19 phylogeny. Setting up a new conda environment for Nextstrain is advisable, as it involves downloading numerous dependencies.
Conclusion
Biopython is an essential library for bioinformatics in Python. This article has provided a glimpse into its capabilities for analyzing sequences, transcribing and translating DNA to mRNA to protein sequences, visualizing complete linear and circular genomes, and depicting chromosomes. Future articles will delve deeper into Biopython applications.
Biopython boasts comprehensive documentation and can utilize NCBI resources, including whole genomes, variant strain genomes, and chromosomes via FASTA files. Experiment with genomes that pique your interest and see what you can create! You might end up with an effective and stunning visualization.
Connect
I am eager to connect and explore collaborative projects! Feel free to follow me on GitHub or LinkedIn, and check out my other stories on Medium. You can also find me on Twitter!