After this practical you will:
Try out some of the scripts from the lecture. Some of these are in directory /chalmers/users/kemp/UMF018/lecture4/.
Write a Perl program that predicts whether a short stretch of genomic
sequence comes from a CpG island by summing the log likelyhood ratios of
transition probabilities between every pair of consecutive nucleotides
in the sequence.
Test your program with the sequences in files /chalmers/users/kemp/UMF018/practical3/test_seq1 and /chalmers/users/kemp/UMF018/practical3/test_seq2.
Your program should give the values -12.866 and 49.275 for these data files.
For your convenience, the file /chalmers/users/kemp/UMF018/practical3/cpg_islands.pl defines an associative array containing log likelyhood ratios of transition probabilities for dinucleotides in putative human CpG islands, and has code for reading a FASTA format sequence into a string variable.
The human alpha globin gene cluster located on chromosome 16 contains five putative CpG islands (NCBI entry). Download the sequence of this region in FASTA format, and use the program written in part (a) to predict the locations of CpG islands. Plot a graph showing the log-odds score for a sliding window. Experiment with different window sizes. Compare the results obtained using your program with the annotations in the NCBI data file.
In answering this question you can reuse your solution to Question 4 in the Perl 1 practical.
Write a Perl program that reads a UniProt file (Swiss-Prot format) and writes out the sequences of the alpha helices. There should be one line of output for each alpha helix in the protein. The accession code of the UniProt entry (e.g. P00784) should be given as an argument on the command line, and your Perl program should retrieve the UniProt entry from the ExPASy web site using the lynx program.
unix> ./uniprot_helices.pl P00784
Write a Perl program that generalises your solution to part (a) by taking the name of the feature type of interest as the second command line argument, e.g.
unix> ./uniprot_features.pl P00784 HELIX unix> ./uniprot_features.pl Q9NS75 TRANSMEM
According to Pyagay et al. (2005):
"A short collagen motif with 12 Gly-X-Y repeats appears to be responsible for trimerization of the protein and this renders the molecule susceptible to cleavage by collagenase."
Here, X and Y represent any arbitrary amino acid residues. This motif is present in the FASTA format file /chalmers/users/kemp/UMF018/practical3/NP_612464.fasta
Write a Perl program that reads a FASTA format file whose name is specified on the command line, and finds whether the protein sequence in that file contains a motif consisting of 12 (or more) Gly-X-Y repeat.
UniProt entry CO1A1_HUMAN contains several motifs consisting of two or more Gly-X-Y repeats. Look at the sequence in file /chalmers/users/kemp/UMF018/practical3/CO1A1_HUMAN and find the first subsequence that matches with this motif. At what position in the sequence does this subsequence start? How many Gly-X-Y repeats are in this motif? (Answer these questions by looking at the sequence. Do not write any code for this part of the question.)
Write a Perl program that reads a UniProt file whose name is specified on the command line and prints out the sequence of each Gly-X-Y repeat motif, and also prints out the number of repeats in each motif.
A five-residue motif, "GPGXX", is believed to be important for the elastic properites of spider silk. This motif occurs 64 times in UniProt entry SPD2_NEPCL (see /chalmers/users/kemp/UMF018/practical3/SPD2_NEPCL).
Write a Perl program that reads a UniProt file whose name is specified on the command line and prints out the number of occurrences of this motif in the UniProt entry.
Suppose we want a program that finds what actual amino acid residues correspond to the "XX" positions in this motif (e.g. "QQ", "GY", etc.), counts how often each "XX" pair occurs, and prints the most common pair. The output of the program should look as follows:
QQ 25
RY 1
SA 12
SQ 1
GY 24
IA 1
Most common pair is QQ
Write a Perl program that performs this task.
Questions 2 and 3 relate to section 3.1 of Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. "Biological sequence analysis - Probabilistic models of proteins and nucleic acids", Cambridge University Press, 1998.
Question 5 relates to the article: Pyagay, P., Heroult, M., Wang, Q., Lehnert, W., Belden, J., Liaw, L., Friesel, R.E. and Lindner, V. (2005) "Collagen triple helix repeat containing 1, a novel secreted protein in injured and diseased arteries, inhibits collagen expression and promotes cell migration", Circ Res., 96, 261-268 (PubMed)
Question 6 relates to the article: Liu, Y., Sponner, A., Porter, D. and Vollrath, F. (2008) "Proline and processing of spider silks", Biomacromolecules, 9, 116-121 (PubMed)
Either:
Ensure that your names are included in a comment in your program.