Computational methods in bioinformatics (2015-2016)

Practical 2

Domain assignment

Aims

Objectives

After this practical you will be able to:

Exercises

In this practical you will implement protein domain assignment programs that are based on DOMAK and STRUDL, but which use a simpler method for determining whether two amino acid residues are interacting (or in contact) with each other. In these exercises two residues are determined to be interacting if the centres of their alpha-carbon atoms ("CA") are within a threshold distance (e.g. 7 Å or 8 Å). (Just to confirm, you only need to consider the alpha-carbon atoms, and all other atoms can be ignored.)

You are welcome to use any of the programs in directory /chalmers/users/kemp/TDA507/example_programs/reading_pdb_files (also available online) as a starting point.

  1. Write a program that can read a Protein Data Bank file (assume that it contains only one chain, like Protein Data Bank entries 1CDH or 2CSN) and generate a distance map.

    A simple approach is to make use of the "dotplot.tcl" program in directory /chalmers/users/kemp/TDA507/practical2 (also available online). This program reads a file with pairs of numbers (one pair of numbers per line), and plots a point for each pair of numbers.

    You can write a program that reads the coordinates of "CA" atoms in a PDB file then, for each pair of CA atoms (the CA atom in residue i and the CA atom in residue j), writes out a pair of numbers (i and j) if the distance between the CA atoms is less than a threshold. If you plan to write your program in C, I recommend using the program atom_array.c as a starting point (copy this file to your own filespace, and give it a suitable name, e.g. make_distance_map.c, and modify the appropriate lines in a Makefile in the same directory to refer to this new program). Modify the function read_data() so that atom records are only stored in the atom array if the atom name is " CA ". After all CA atoms have been read into the atom array, find all pairs whose separation is less than a threshold distance (e.g. 7 Å or 8 Å).

    Run this program and redirect the output to a file, e.g.

    ./make_distance_map 2CSN.pdb > 2CSN.pairs
    

    The program /chalmers/users/kemp/TDA507/practical2/dotplot.tcl can then be used to plot the points as a distance map. To run this program, type its name and give the name of your file containing pairs of numbers as a command line argument, e.g.

    /chalmers/users/kemp/TDA507/practical2/dotplot.tcl 2CSN.pairs
    
  2. Write a program that can read a Protein Data Bank (assume that it contains only one chain, like Protein Data Bank entries 1CDH or 2CSN) and identify the residue at which the chain is most clearly partitioned into two parts/domains. Your program should use the simple scoring function that is used in the DOMAK program: (intA/extAB)*(intB/extAB)

  3. Write a program that can read a Protein Data Bank (assume that it contains only one chain, like Protein Data Bank entries 1CDH or 2CSN) and identify the two sets of residues in a good partitioning of the residues using a graph partitioning algorithm (e.g. the Kernighan-Lin algorithm which is used in STRUDL). When deciding which residue u in set U should be swapped next, use a simple count of the number of residues in set V that interact with u as a measure of the extent of the interaction between residue u and set V.

    The output from your program should consist of two lists of residue numbers.

    Describe any similarities or differences that you observe between the results of the programs in questions 2 and 3.

  4. Some protein chains contain more than two domains. Modify your solution to question 2 or question 3 so that so that your program finds all domains in a protein chain.

    The output from your program should consist of integer N (the number of domains found) and either:

    Include a comment to explain how your program decides on the value for N.

    Test your program with files 4GAF_B.pdb and 1HZH_H.pdb (online).

What to submit

You should submit your solutions via the Fire system before 23:59 on Wednesday 18 November 2015.

Everyone should do questions 1 and 2. Those aiming for a higher grade should also attempt questions 3 and 4. (Note: You can do question 4 even if you don't try question 3.)

Test your programs with suitable test data. Upload the following to the Fire system:

Ensure that your name is included in all files submitted (e.g. in a comment in your source code).


Last Modified: 11 November 2015 by Graham Kemp