Protein Folding: Combinatorial Problems and Algorithms

Sorin Istrail

Sandia National Laboratories
Massively Parallel Computing Research Laboratory
Albuquerque, NM 87185
http://www.cs.sandia.gov/~scistra


Protein folding is a Grand Challenge area of research at the crossroads of biology, chemistry, physics and mathematics/computer science. The central problem is the protein structure prediction which asks to determine from the amino acid sequence of the protein its unique, lowest energy three-dimensional structure. Although with over 20 years of research, only recently the literature included the first rigorous computational complexity results : the first NP-complete results, Ngo and Marks 1992, Frankel 1993, Unger and Molt 1993, for new protein-like models ; the first constant approximation algorithm , Hart and Istrail 1995, for the Dill 1985 model, one of the most studied protein folding moldels; and the first review of the area devoted to the computational complexity of protein folding, Ngo, Marks and Karplus 1994, that attempts to relate the physical process of folding and its informal algorithmic paradoxes to the theory of NP-completeness as a basis for rigorous analysis and insight.

In this talk I will focus mainly on three themes:

(1) What is the significance of NP-complete results and approximation algorithms, and the possibility of being just side-effects of model simplifications ? I will present results from Hart and Istrail 1996 (a & b) showing that both NP-completeness and the existence of feasible approximation algorithms transcend the details of modeling discretizations, such as lattices and potential energy formulas.

(2) What is the trade-off between physical accuracy of the models and their computational tractability ? Apparently, more modeling detail is regarded as an automatic increase in computational complexity. However, in Hart and Istrail 1996c it is shown that explicit modeling of side chains, as a generalization of the Dill model, provides better approximation algorithms: linear time algorithms that have worst-case performance better than 86% of optimal. A notoriously difficult problem in this area is the analysis of off-lattice (unconstrained continuos space) models. We introduce a Tangent Spheres Protein Folding Model that generalizes all lattice Dill models. We show that in this model our side chain folding algorithm has off-lattice worst-case performance better than 70% of optimal.

(3) Can we detect structural similarity in a robust way ? Multiple sequence alignment methods are fundamental for understanding folding principles. As "similarity" is a subjective concept, multiple alignment techniques propose scoring schemes that attempt at revealing in a fair way the shared structure. The famous Arrow Paradox in mathematical economics highlights the impossibility of designing scoring schemes, i.e., voting methods that in a fair way derive social choice from individual values. I will present some results from Istrail and Ravi 1996 that attempt at identifying intrinsic biases and fundamental limitations associated with structural alignment.

In the talk I will argue the obvious: that we need more mathematics and computer science research in protein folding before concluding prematurely, as an article in the October 1996 issue of Scientific American that Protein Folding is one of the problems that are "Confronting Science's Logical Limits."


Joint work on protein folding with Bill Hart (Sandia National Laboratories) and on multiple alignments with R. Ravi (Carnegie Mellon)