Lesson 2_7: Phylogenetic Trees

Assignment 2_7

Read the background information on phylogeny concepts and programs. As you will learn from the reading, many algorithms for identifying phylogenetic trees depend on the order in which sequences are analyzed. The program Phylip is freely available, and can be used to explore this issue.

Q: Is the order of sequence in the data file important for parsimony programs?

  1. Down load Phylip from its WWW site. You should also get the documentation files.
  2. Use an ASCII text editor and the data in all14d.aln to create an input file for Phylip (read the sequence.doc file to learn about the proper format). This data file is the interleaved output from Clustal W, has 65 sequences in it, and each sequence has 157 characters. The data is not published, but is real data for a family of bacterial transcription factors. Note that since I used it to generate the outputs that are described below, I have come to believe that a better alignment might be obtained by deleting the first and last characters for all sequences, as they are gaps, but for 2 exceptions. (If you are really frustrated, here is my first, interleaved data file) edited for Phylip, and here is the second, non-interleaved version without the gaps.)
  3. Run the Protpars.exe program to determine a maximum parsimony tree. For this first run, set the control options so that the data is treated in the order that it is presented in your data file.
  4. Rename the output files so that you will not overwrite them in the steps that follow.
  5. Run the Phylip program Drawtree.exe, choosing to make a postscript file, and then use Ghostview to visualize your tree.
  6. Run the program a second time, but using option J, have it randomize the sequence order 10 times (this may take overnight to run; takes ~1 hour on a 75 MHz pentium running in the background on Windows NT).
  7. Examine the output files, using Drawtree.exe and Ghostview as well as a text editor.
  8. Write an HTML document describing your exercise, explaining what the outcome was, and how you interpret it.

A better strategy: Distance Matrix Trees

A better strategy might be to use a distance method for determining the tree, given the large number of sequences. Why is this so?

Confident About Your Groups?

(don't try this exercise on the lab computers; its therefore optional)

One method of estimating the confidence that you have in assigning the groups is to perform a bootstrap on the original data file, determine a distance matrix for each bootstrapped data set, use the neighbor-joining program to establish groups for each data set, and then use the Consensu.exe program to derive the majority rule consensus tree for all of the bootstrapped data sets. To do so for 610 bootstrapped data sets took several days on the Pentium PC mentioned above. Note that the final tree here represents the bootstrap freqencies, not the evolutionary distances (the internode distances reflect the bootstrap frequencies; the final line to each sequence is the same for all sequences, and its magnitude is just chosen for spreading out the names so they can be read).

Alternate (or additional optional) Exercise

The NCBI now shows data coming in from the effort to sequence the genome of Pseudamonas aeruginosa. If you do a blast search for WPGNVREL, a highly conserved motif in the sigma 54 dependent transcriptional activators that was used for one of the pcr primers used to obtain many of the sequences in the file all14d.aln, there are 12 hits. You can download the sequence contigs that are presently available, and use blast and a translation program to determine the amino acid sequence of the 12 putative activators. (Use the wordpad editor to access each of the 12 contigs; then cut / paste them into the blast search site.) After translating the sequences, you could eliminate any that were already in all14.dnd, add the new ones, and use clustalX with Phylip to see if they cluster with any previously known genetic functions. Your blast search may also turn up some predicted functional genes that these activators are regulating (can you find the system that is a very likely candidate for regulating the expression of a transport gene? it may help to confine your search to sequences that flank the sigma 54-dependent activator region by removing the latter). Also, you could use the program SEQSCAN that I provide to look for sigma 54 dependent promoters that may or may not be adjacent to IHF binding sites. Any such promoters should mark a point in the sequence close to the beginning of regulated genes, and indicate the direction of transcription. If you find any novel relationships, you are among the first in the world to know about them!

Potentially Useful links:

Return to Table of Contents