Pattern Genomics has kicked off a study involving enterovirus. In order to help our collaborator visualize the data, we have produced multiple sequence alignments in regions of interest. Multiple sequence alignments are complex to produce, and as every well-trained bioinformatician knows, no efficient algorithm can guarantee finding an optimal alignment on a large data set in a reasonable amount of time. Therefore heuristics are employed. Nevertheless, multiple sequence alignments are frequently used when searching for conserved DNA sequences among many samples to employ as primer and probe locations. A common strategy is to find regions of conservation and generate primers and probes, thus helping to ensure sensitivity of the reaction as long as the aligned sequences are representative of the target and its variations. BLAST is then typically used on the primers and probes to help ensure specificity.
In our study, we were able to use our Daydreamer platform to find regions of interest and then use multiple sequence alignment to visualize these short, ~100bp segments. This allowed us to produce alignments of over 2,000 viral sequences using the built-in methods in CLC Sequence Viewer in a reasonable amount of time. But we noticed something interesting. Examine the fragment of the alignment below:
In this example, note that the short segment “GAAGAG” appears to be well conserved, with a mismatch in a very small number of examples causing an extra column between “GAA” and “GAG” and just a few sequences varying at the tail end of “GAG”. But now look at the alignment below:
This alignment is generated from the exact same sequences in the same location, except that it includes one extra column to the left. In this view of the alignment, however, the gap between “GAA” and “GAG” appears to be more common and the entire segment “GAG” appears to be less well conserved. The culprit? The bottom several samples (among others, not shown) have the “GAG” segment shifted to the right relative to “GAG” and its variants in the samples higher up in the alignment. This is a simple artifact of the alignment heuristic and its treatment of the extra column, but it completely changes the user’s perception on how well the segment “GAAGAG” is conserved in the data.
For this reason, Pattern Genomics and its Daydreamer platform do not rely upon multiple sequence alignments to generate candidate primers and probes. Instead, we consider the overall frequency of oligo sequences (for example, “GAAGAG”) across the sample population. Our custom engineering of the Daydreamer platform in C++ allows us to handle hundreds of millions of candidate oligos efficiently. Our approach allows us to simultaneously assess the sensitivity and specificity of many candidate oligos on a genome-wide scale without the artifacts of multiple sequence alignment influencing our selections.
If you do rely on multiple sequence alignments to select primer and probe locations, we recommend re-aligning the data in potential target regions with different region boundaries and settings, and also removing any “outlier” samples that might confuse the alignment algorithm with a large number of indels and/or mismatches. This should give a clearer picture of the conservation of different regions in the majority of the sample sequences.