Practical 2

Pairwise sequence alignment

Aims

To reinforce the basic concepts of pairwise sequence alignment described in the lectures.
To give practice in Perl programming.

Objectives

After this practical you will:

understand how filtering can reduce the noise in a dotplot;
understand how a dynamic programming algorithm finds an optimal pairwise sequence alignment;
understand the difference between global and local alignment algorithms;
be familiar with Perl programs for pairwise sequence alignment.

Exercises

Copy the example programs from directory /chalmers/users/kemp/MVE360/practical2.
The program dotplot.pl is incomplete. This program should print a letter at (row i, column j) if the character at position i in the first sequence matches the character at position j in the second sequence. That is, it should produce the following output for the given values of $seq1 and $seq2:
```
D                D
 O O     O  OO  O
  R     R
 O O     O  OO  O
    T         T
     H         H
      Y
     H         H
 O O     O  OO  O
D                D
                  G
                   K
                    I
                     N
```
Complete program dotplot.pl so that it produces the desired output for any strings $seq1 and $seq2.
Modify the program dotplot.pl so that letters are only printed if the characters at positions i+1 and j+1 also match. Observe how this reduces the noise in the dotplot.
Modify the program global_alignment.pl so that an extra line out output is printed between the two aligned sequence, indicating exact matches with the character "|", e.g.
```
    AT-CGAT
    || || |
    ATACG-T
    
```
Modify the program global_alignment.pl so that the percent identity between the two sequences is written out.
Add a comment to your program explaining how you have decided to calculate the percent identity.
Copy the program global_alignment.pl to the file local_alignment.pl. Modify this program so that it implements the Smith-Waterman algorithm for finding an optimal local alignment.
Test your program with the sequences "PAWHEAE" and "HDAGAWGHEQ".
(2015-03-05: You do not need to do question 7 to be approved on Practical 2.)

Modify the program global_alignment.pl so that it counts the total number of optimal alignments for the two sequences.
Test your program with the sequences "ATTA" and "ATTTTA".
Copy the program global_alignment.pl to the file levenshtein.pl. Modify this program so that it calculates the Levenshtein distance (edit distance) between the two sequence.
File substitution_matrix.pl comtains a piece of Perl code that initialises an associative array with values for a simple substitution matrix for aligning a pair of DNA sequences:
```
%substitution_matrix = (
  "AA"=>  2, "AC" => -1, "AG"=> -1, "AT"=> -1,
  "CA"=> -1, "CC" =>  2, "CG"=> -1, "CT"=> -1,
  "GA"=> -1, "GC" => -1, "GG"=>  2, "GT"=> -1,
  "TA"=> -1, "TC" => -1, "TG"=> -1, "TT"=>  2,
);
```
Copy the program global_alignment.pl to the file dna.pl. Add the code for initialising the scoring matrix associative array to this file, and use score values from this associative array instead of the variables $MATCH and $MISMATCH when calculating diagonal scores.

Modify the substitution matrix to reflect that transitions are more common than transversions.

What to demonstrate or hand in

Demonstrate your versions of these programs with suitable test data:

dotplot.pl with the modification described in exercises 2 and 3;

global_alignment.pl with the modifications described in exercises ~~4, 5 and 7~~ 4 and 5 (2015-03-05);

local_alignment.pl as described in exercise 6.

Ensure that your names are included in a comment in your program.

Last Modified: 5 March 2015 by Graham Kemp