Sequence bioinformatics (2010/2011) | Graham Kemp's classes

Practical: Perl 3

Aims

Objectives

After this practical you will:

Exercises

  1. Try out some of the scripts from the lecture. Some of these are in directory /chalmers/users/kemp/UMF018/lecture4/.

  2. Write a Perl program that predicts whether a short stretch of genomic sequence comes from a CpG island by summing the log likelyhood ratios of transition probabilities between every pair of consecutive nucleotides in the sequence. Test your program with the sequences in files /chalmers/users/kemp/UMF018/practical3/test_seq1 and /chalmers/users/kemp/UMF018/practical3/test_seq2.
    Your program should give the values -12.866 and 49.275 for these data files.

    For your convenience, the file /chalmers/users/kemp/UMF018/practical3/cpg_islands.pl defines an associative array containing log likelyhood ratios of transition probabilities for dinucleotides in putative human CpG islands, and has code for reading a FASTA format sequence into a string variable.

    1. Modify the Perl program written in Question 2 so that it can be used to predict CpG islands in long genomic sequences by summing the log likelyhood ratios of transition probabilities between every pair of consecutive nucleotides in a sliding window. For each window position, print out the position and the score, so that the scores can be plotted on a graph using the gnuplot program.
    2. The human alpha globin gene cluster located on chromosome 16 contains five putative CpG islands (NCBI entry). Download the sequence of this region in FASTA format, and use the program written in part (a) to predict the locations of CpG islands. Plot a graph showing the log-odds score for a sliding window. Experiment with different window sizes. Compare the results obtained using your program with the annotations in the NCBI data file.

  3. In answering this question you can reuse your solution to Question 4 in the Perl 1 practical.

    1. Write a Perl program that reads a UniProt file (Swiss-Prot format) and writes out the sequences of the alpha helices. There should be one line of output for each alpha helix in the protein. The accession code of the UniProt entry (e.g. P00784) should be given as an argument on the command line, and your Perl program should retrieve the UniProt entry from the ExPASy web site using the lynx program.

      unix> ./uniprot_helices.pl P00784
      
    2. Write a Perl program that generalises your solution to part (a) by taking the name of the feature type of interest as the second command line argument, e.g.

      unix> ./uniprot_features.pl P00784 HELIX
      unix> ./uniprot_features.pl Q9NS75 TRANSMEM
      
  4. According to Pyagay et al. (2005):

    "A short collagen motif with 12 Gly-X-Y repeats appears to be responsible for trimerization of the protein and this renders the molecule susceptible to cleavage by collagenase."

    Here, X and Y represent any arbitrary amino acid residues. This motif is present in the FASTA format file /chalmers/users/kemp/UMF018/practical3/NP_612464.fasta

    1. Write a Perl program that reads a FASTA format file whose name is specified on the command line, and finds whether the protein sequence in that file contains a motif consisting of 12 (or more) Gly-X-Y repeat.

    2. UniProt entry CO1A1_HUMAN contains several motifs consisting of two or more Gly-X-Y repeats. Look at the sequence in file /chalmers/users/kemp/UMF018/practical3/CO1A1_HUMAN and find the first subsequence that matches with this motif. At what position in the sequence does this subsequence start? How many Gly-X-Y repeats are in this motif? (Answer these questions by looking at the sequence. Do not write any code for this part of the question.)

    3. Write a Perl program that reads a UniProt file whose name is specified on the command line and prints out the sequence of each Gly-X-Y repeat motif, and also prints out the number of repeats in each motif.

  5. A five-residue motif, "GPGXX", is believed to be important for the elastic properites of spider silk. This motif occurs 64 times in UniProt entry SPD2_NEPCL (see /chalmers/users/kemp/UMF018/practical3/SPD2_NEPCL).

    1. Write a Perl program that reads a UniProt file whose name is specified on the command line and prints out the number of occurrences of this motif in the UniProt entry.

    2. Suppose we want a program that finds what actual amino acid residues correspond to the "XX" positions in this motif (e.g. "QQ", "GY", etc.), counts how often each "XX" pair occurs, and prints the most common pair. The output of the program should look as follows:

              QQ 25
              RY 1
              SA 12
              SQ 1
              GY 24
              IA 1
              Most common pair is QQ
      

      Write a Perl program that performs this task.

Supplementary Material

Questions 2 and 3 relate to section 3.1 of Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. "Biological sequence analysis - Probabilistic models of proteins and nucleic acids", Cambridge University Press, 1998.

Question 5 relates to the article: Pyagay, P., Heroult, M., Wang, Q., Lehnert, W., Belden, J., Liaw, L., Friesel, R.E. and Lindner, V. (2005) "Collagen triple helix repeat containing 1, a novel secreted protein in injured and diseased arteries, inhibits collagen expression and promotes cell migration", Circ Res., 96, 261-268 (PubMed)

Question 6 relates to the article: Liu, Y., Sponner, A., Porter, D. and Vollrath, F. (2008) "Proline and processing of spider silks", Biomacromolecules, 9, 116-121 (PubMed)

Work to be handed in

Either:

  1. Show your solution to Question 3a to me during the practical session on Friday 26 November 2010, or
  2. If this is not possible, you should print the program that is your solution to Question 3a and the graph that is produced when you run the program with the sequence of the human alpha globin gene cluster as input (Question 3b). There is an envelope marked "Sequence Bioinformatics" in the tray outside my office (room 6475, EDIT building). Put your solution into this envelope no later than 17:00 on Monday 6 December 2010.

Ensure that your names are included in a comment in your program.


Last Modified: 19 November 2010 by Graham Kemp