Bioinformatics (2011/2012)

Practical 5

Perl (4)



After this practical you will:


  1. Try out some of the scripts from the lecture. Some of these are in directory /chalmers/users/kemp/MVE360/lecture6/.

  2. Write a Perl program that predicts whether a short stretch of genomic sequence comes from a CpG island by summing the log likelyhood ratios of transition probabilities between every pair of consecutive nucleotides in the sequence. Test your program with the sequences in files /chalmers/users/kemp/MVE360/practical5/test_seq1 and /chalmers/users/kemp/MVE360/practical5/test_seq2.
    Your program should give the values -12.866 and 49.275 for these data files.

    For your convenience, the file /chalmers/users/kemp/UMF018/practical3/ defines an associative array containing log likelyhood ratios of transition probabilities for dinucleotides in putative human CpG islands, and has code for reading a FASTA format sequence into a string variable.

    1. Modify the Perl program written in Question 2 so that it can be used to predict CpG islands in long genomic sequences by summing the log likelyhood ratios of transition probabilities between every pair of consecutive nucleotides in a sliding window. For each window position, print out the position and the score, so that the scores can be plotted on a graph using the gnuplot program.
    2. The human alpha globin gene cluster located on chromosome 16 contains five putative CpG islands (NCBI entry). Download the sequence of this region in FASTA format, and use the program written in part (a) to predict the locations of CpG islands.

      Plot a graph showing the log-odds score for a sliding window (use the gnuplot program for this).

      unix> ./ file.fasta > outfile
      unix> gnuplot
      gnuplot> plot "outfile" with lines
      gnuplot> exit

      Experiment with different window sizes (e.g. 500 nucleotides).

      Compare the results obtained using your program with the annotations in the NCBI data file (search for "CpG" within the NCBI entry).

  3. In answering this question you can reuse your solution to Question 3 in Practical 3.

    1. Write a Perl program that reads a UniProt file (Swiss-Prot format) and writes out the sequences of the alpha helices. There should be one line of output for each alpha helix in the protein. The accession code of the UniProt entry (e.g. P00784) should be given as an argument on the command line, and your Perl program should retrieve the UniProt entry from the ExPASy web site using the lynx program.

      unix> ./ P00784
    2. Write a Perl program that generalises your solution to part (a) by taking the name of the feature type of interest as the second command line argument, e.g.

      unix> ./ P00784 HELIX
      unix> ./ Q9NS75 TRANSMEM

What to demonstrate or hand in

Demonstrate your solutions to exercise 3.

Ensure that your names are included in a comment in your program.

Last Modified: 14 February 2012 by Graham Kemp