Bioinformatics (2011/2012)

# Practical 6

## Pairwise alignment and Perl revision

### Aims

• To revise the concepts of pairwise sequence alignment.
• To give practice in using and writing simple Perl scripts.

### Objectives

After this practical you will:

• be able to apply a dynamic programming algorithm for pairwise sequence alignment;
• be able to use and write simple Perl scripts.

### Exercises

1. (This question does not involve programming.)

Using a gap score of -2 and match/mismatch scores taken from the PAM250 substitution matrix (given below), derive the score matrix for a global alignment of "GFQW" with "GNW".

In this case, what is the score of an optimal global alignment?
Give the alignment(s) with this score.

PAM250 substitution matrix:

```   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V
A  2
R -2  6
N  0  0  2
D  0 -1  2  4
C -2 -4 -4 -5  4
Q  0  1  1  2 -5  4
E  0 -1  1  3 -5  2  4
G  1 -3  0  1 -3 -1  0  5
H -1  2  2  1 -3  3  1 -2  6
I -1 -2 -2 -2 -2 -2 -2 -3 -2  5
L -2 -3 -3 -4 -6 -2 -3 -4 -2  2  6
K -1  3  1  0 -5  1  0 -2  0 -2 -3  5
M -1  0 -2 -3 -5 -1 -2 -3 -2  2  4  0  6
F -4 -4 -4 -6 -4 -5 -5 -5 -2  1  2 -5  0  9
P  1  0 -1 -1 -3  0 -1 -1  0 -2 -3 -1 -2 -5  6
S  1  0  1  0  0 -1  0  1 -1 -1 -3  0 -2 -3  1  3
T  1 -1  0  0 -2 -1  0  0 -1  0 -2  0 -1 -2  0  1  3
W -6  2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4  0 -6 -2 -5 17
Y -3 -4 -2 -4  0 -4 -4 -5  0 -1 -1 -4 -2  7 -5 -3 -3  0 10
V  0 -2 -2 -2 -2 -2 -2 -1 -2  4  2 -2  2 -1 -1 -1  0 -6 -2  4
```
2. A five-residue motif, "GPGXX", is believed to be important for the elastic properites of spider silk. This motif occurs 64 times in UniProt entry SPD2_NEPCL (see /chalmers/users/kemp/MVE360/practical6/SPD2_NEPCL).

1. Write a Perl program that reads a UniProt file whose name is specified on the command line and prints out the number of occurrences of this motif in the UniProt entry.

2. Suppose we want a program that finds what actual amino acid residues correspond to the "XX" positions in this motif (e.g. "QQ", "GY", etc.), counts how often each "XX" pair occurs, and prints the most common pair. The output of the program should look as follows:

```        QQ 25
RY 1
SA 12
SQ 1
GY 24
IA 1
Most common pair is QQ
```

Write a Perl program that performs this task.

3. (Please don't spend long on this question. I don't want to see your solution for this question. However, I would like you think about how you would perform this task using Perl.)

G-protein coupled receptors (GPCRs) are membrane proteins with seven alpha-helical transmembrane (TM) regions. GPCRs can be classified as rhodopsin-like receptors (class I), secretin-like receptors (class II) or metabotropic-like receptors (class III) based on the sequences of their seven TM regions. The patterns associated with TM regions of the different classes are:

```family          TM pattern

class I         2  LA..D
3  [D/E]R[Y/H]
5  [F/Y]..P.......Y
6  [F/Y]...W.P
7  [N/D]P..Y

class II        1  G...S...L
2  H.[H/N/Q]....[F/Y]..[N/R/K]
3  W...E...L
4  GW..P
6  [K/R]....L.P..G
7  QG.......C

class III       2  [K/R]....[E/D].[C/S][F/Y]
3  [S/A]....KT
4  Q......[W/L]
5  Y...L...C
6  E.[K/R]...F.M......W....P
```

Wrire a Perl program to that reads a UniProtKB file, and classify the protein in that file as belonging to class I, class II or class III (or no class) according to the classification rules used by Bissantz et al. (2004):

"The family is considered determined if either (i) two patterns (motifs) of one family are found and none of another family, or (ii) if 3 patterns (motifs) of one family are found and not more than 1 pattern of the other families, or (iii) if 4 or more patterns (motifs) of one family are found and not more than 2 patterns of the other families."
4. Find the accession codes of some UniProtKB entries for human GPCR proteins by searching for proteins with for "GPCR" and Organism "Human". These UniProtKB entries can be used as input to your program.

### Supplementary Material

Question 2 relates to the article: Liu, Y., Sponner, A., Porter, D. and Vollrath, F. (2008) "Proline and processing of spider silks", Biomacromolecules, 9, 116-121 (PubMed)

Question 3 relates to the article: Bissantz, C., Logean, A. and Rognan, D. (2004) "High-throughput modeling of human G-protein coupled receptors: amino acid sequence alignment, three-dimensional model building, and receptor library screening.", J. Chem. Inf. Comput. Sci., 44, 1162-1176 (PubMed)

### What to demonstrate or hand in

You do not have to demonstrate or hand in anything in connection with this set of exercises, but you should ensure that you have completed the earlier computer labs, and check that your completed assignments have been recorded correctly.