Sequence bioinformatics (2010/2011) | Graham Kemp's classes

Practical: Perl 2

Aims

Objectives

After this practical you will:

Exercises

  1. Try out some of the scripts from the lecture. Some of these are in directory /chalmers/users/kemp/UMF018/lecture3/.

  2. Modify the program reverse_complement.pl so that it can print the reverse complement of DNA sequences that contain nucleotide ambiguity codes ("Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences", Tables 1 and 2).

  3. Study the program /chalmers/users/kemp/UMF018/lesk/translate.pl

    Generalise this program so that it prints the translations of a DNA sequence in all six possible reading frames (sequence as read in, in three phases; reverse complement, in three phases).

  4. This question is based on the lecture on RNA bioinformatics.

    Write a Perl program that reads a DNA sequence from a file, and finds whether it contains a subsequence that matches the following PatScan pattern:

    p1=3...3
    YYAGG
    p2=3...3
    GNRA
    ~p2
    AGCAG
    ~p1
    

    Your program should print the subsequence that matches. Test your program with the sequence in file /chalmers/users/kemp/UMF018/practical2/dna

  5. This question is based on the lecture on RNA bioinformatics.

    Write a Perl program that reads a set of aligned RNA sequences from a file specified on the command line and finds positions that are covarying to maintain Watson-Crick complementarity. You can assume that there is one sequence per line, and that the sequences contain only the characters a, c, g and u (no white space, gaps or ambiguity codes).

    You can test your program with multiple alignment files 'ma1' and 'ma2' in directory /chalmers/users/kemp/UMF018/practical2.

  6. Write a Perl program that reads a nucleotide sequence from an EMBL databank file, and finds the longest subsequence whose reverse complement is also present in the sequence.

  7. Modify the program embl_orf.pl so that it prints out the translated sequence of the longest open reading frame. The output should use one-letter amino-acid residue codes, and the output should have 10 characters per line.

  8. Study the program /chalmers/users/kemp/UMF018/lesk/assemble.pl
    Try running this program with different input strings.

    Try this program with the following input fragments:

    rs International Mas
    onal Mas
    ernational Masters Prog
    me in Bio
    Bioinformatics
    Chalmers Interna
    rs Programme in Bio
    

    Now try the program with the same fragments in a different order:

    Chalmers Interna
    rs International Mas
    onal Mas
    ernational Masters Prog
    rs Programme in Bio
    me in Bio
    Bioinformatics
    

    Can you explain the difference in the program's output? Try to modify the program so that it assembles fragments correctly regardless of the order in which they appear in the input stream.

Work to be handed in

Either:

  1. Show your solution to Question 5 to me during the practical session on Monday 15 November 2010, or
  2. If this is not possible, you should print the program that is your solution to Question 5 and the output that is produced when you run the program with file /chalmers/users/kemp/UMF018/practical2/ma2 as input. There is an envelope marked "Sequence Bioinformatics" in the tray outside my office (room 6475, EDIT building). Put your solution into this envelope no later than 17:00 on Monday 22 November 2010.

Ensure that your names are included in a comment in your program.


Last Modified: 12 November 2010 by Graham Kemp