Practical 4

Perl (3)

Aims

To give practice in using and writing simple Perl scripts.

Objectives

After this practical you will:

be able to use and write simple Perl scripts.

Exercises

This question is based on the last slide from the lecture on 2013-02-12.

Write a Perl program that reads a set of aligned RNA sequences from a file specified on the command line and finds positions that are covarying to maintain Watson-Crick complementarity. You can assume that there is one sequence per line, and that the sequences contain only the characters a, c, g and u (no white space, gaps or ambiguity codes).

You can test your program with multiple alignment files 'ma1' and 'ma2' in directory /chalmers/users/kemp/MVE360/practical4.

(Hint: One approach is to work with an array whose elements are the sequences that you have read in from the file. An alternative approach is to construct an array whose elements are the columns in the alignment. This second approach requires a little more work at the start, but the main task - checking for covariance - is then much easier. I don't mind which approach you take, but I recommend the second approach, since I believe it's less work overall.)
Write a Perl program that reads a nucleotide sequence from an EMBL databank file, and finds the longest subsequence whose reverse complement is also present in the sequence.
According to Pyagay et al. (2005):

"A short collagen motif with 12 Gly-X-Y repeats appears to be responsible for trimerization of the protein and this renders the molecule susceptible to cleavage by collagenase."

Here, X and Y represent any arbitrary amino acid residues. This motif is present in the FASTA format file /chalmers/users/kemp/MVE360/practical4/NP_612464.fasta and UniProt file /chalmers/users/kemp/MVE360/practical4/CO1A1_HUMAN.uniprot contains several occurrences of this motif.
1. Write a Perl program that reads a FASTA format file (e.g. NP_612464.fasta) whose name is specified on the command line, and finds whether the protein sequence in that file contains a motif consisting of 12 (or more) Gly-X-Y repeat.
  
  (Sequence information in a FASTA format file is found on lines that do not begin with the character '>'.)
2. UniProt entry CO1A1_HUMAN contains several motifs consisting of two or more Gly-X-Y repeats. Look at the sequence in file CO1A1_HUMAN.uniprot and find the first subsequence that matches with this motif. At what position in the sequence does this subsequence start? How many Gly-X-Y repeats are in this motif? (Answer these questions by looking at the sequence. Do not write any code for this part of the question.)
3. Write a Perl program that reads a UniProt file whose name is specified on the command line and prints out the sequence of each Gly-X-Y repeat motif, and also prints out the number of repeats in each motif.
  
  (As a check, you should find four occurrences of this motif in CO1A1_HUMAN.uniprot, and the 2nd, 3rd and 4th have lengths 338, 2 and 2.)

What to demonstrate or hand in

Demonstrate your solutions to exercise 1.

Ensure that your names are included in a comment in your program.

Last Modified: 11 February 2013 by Graham Kemp