Computational methods in bioinformatics (2013-2014)

Practical 2

Domain assignment

Aims

Objectives

After this practical you will be able to:

Exercises

In this practical you will implement protein domain assignment programs that are based on DOMAK and STRUDL, but which use a simpler method for determining whether two amino acid residues are interacting (or in contact) with each other. In these exercises two residues are determined to be interacting if the centres of their alpha-carbon atoms ("CA") are within a threshold distance (e.g. 7Å). (2013-11-12: Just to confirm, you only need to consider the alpha-carbon atoms, and all other atoms can be ignored.)

You are welcome to use any of the programs in directory /chalmers/users/kemp/TDA507/example_programs/reading_pdb_files as a starting point.

  1. Write a program that can read a Protein Data Bank file (assume that it contains only one chain, like Protein Data Bank entries 1CDH or 2CSN) and generate a distance map.

    A simple approach is to make use of the "dotplot.tcl" program in directory /chalmers/users/kemp/TDA507/practical2. This program reads a file with pairs of numbers (one pair of numbers per line), and plots a point for each pair of numbers.

    You can write a program that reads the coordinates of "CA" atoms in a PDB file then, for each pair of CA atoms (the CA atom in residue i and the CA atom in residue j), writes out a pair of numbers (i and j) if the distance between the CA atoms is less than a threshold. If you plan to write your program in C, I recommend using the program atom_array.c as a starting point (copy this file to your own filespace, and give it a suitable name, e.g. make_distance_map.c, and modify the appropriate lines in a Makefile in the same directory to refer to this new program). Modify the function read_data() so that atom records are only stored in the atom array if the atom name is " CA ". After all CA atoms have been read into the atom array, find all pairs whose separation is less than a threshold distance (e.g. 7Å).

    Run this program and redirect the output to a file, e.g.

    ./make_distance_map 2CSN.pdb > 2CSN.pairs
    

    The program /chalmers/users/kemp/TDA507/practical2/dotplot.tcl can then be used to plot the points as a distance map. To run this program, type its name and give the name of your file containing pairs of numbers as a command line argument, e.g.

    /chalmers/users/kemp/TDA507/practical2/dotplot.tcl 2CSN.pairs
    
  2. Write a program that can read a Protein Data Bank (assume that it contains only one chain, like Protein Data Bank entries 1CDH or 2CSN) and identify the residue at which the chain is most clearly partitioned into two parts/domains. Your program should use the simple scoring function that is used in the DOMAK program: (intA/extAB)*(intB/extAB)

  3. Write a program that can read a Protein Data Bank (assume that it contains only one chain, like Protein Data Bank entries 1CDH or 2CSN) and identify the two sets of residues in a good partitioning of the residues using a graph partitioning algorithm (e.g. the Kernighan-Lin algorithm which is used in STRUDL). When deciding which residue u in set U should be swapped next, use a simple count of the number of residues in set V that interact with u as a measure of the extent of the interaction between residue u and set V.

    Describe any similarities or differences that you observe between the results of the programs in questions 2 and 3.

What to submit

You should submit your solutions via the Fire system before 23:59 on Thursday 14 November 2013.

Everyone should do questions 1 and 2. Those aiming for a higher grade should also attempt question 3.

Test your programs with suitable test data. Upload the following to the Fire system:


Last Modified: 12 November 2013 by Graham Kemp