Statistical methods for protein sequences

Regression with protein sequence covariates

The R package krm provides a way to fit regression model with protein sequence covariates through a kernel-based random effect model.


  • Fong, Y.‡, Datta, S.‡, Georgiev, I., Kwong, P., Tomaras, G. (2014) Mutual information kernel logistic models with application in HIV vaccine studies, Biostatistics, in press. (‡ equal contribution)



Clustering protein sequences into subfamilies

rBHP is a general clustering/mixture modeling algorithm that is based on randomized Bottom-up Hierarchical clustering Pruned (rBHP) splits.

cHMM is a mixture model based clustering method for identifying protein subfamily, and it uses rBHP as part of the inference machinery. 

Brief Guide

  • Type cbclust.exe to see help message
  • To run cHMM, do "cbclust.exe -m ProteinSequence ..."