samedi 27 juin 2015

Hierarchical clustering a pairwise distance matrix of precomputed distances

I have a pairwise distance dataframe that I've made with pandas:

#Get files
import glob
import itertools
one_dimension = glob.glob('*.pdb')

dataframe = []
for combo in itertools.combinations(one_dimension,2):
    pdb_1 = combo[0]
    pdb_2 = combo[1]
    entry = { 'pdb_1' : pdb_1, 'pdb_2', 'rmsd': get_rmsd(pdb_1,pdb_2)
    dataframe.append(entry)

import pandas
dataframe = Dataframe(dataframe)
dataframe

enter image description here

All I want to do is cluster the dataframe in such a way where all clusters contain pdbs that are less than some cutoff ( lets say less than 2). I have read that complete linkage is the way to go.

For instance:

  1. pdb_1,pdb_2 have an rmsd 1.56
  2. pdb_3,pdb_2 have an rmsd 1.03
  3. pdb_2, pdb_1 have an rmsd of 1.60

So they are can all appear in a cluster together. But if any new pdb tries to be added to the cluster, if it is > 2 for any member already in the cluster, it will be rejected.

I understand that this is a complete linkage with a cutoff.

I have looked into scipy.cluster.hierarchy.linkage, but I'm having an extremely hard time formatting the array to enter into the linkage.

  • What is the best way to complete this task?

  • How do I go from my dataframe to something that can be useable by
    scipy.cluster?

  • Should I turn it into an R dataframe?

  • How do I find out which members are in the cluster if I transform the pairwise distance to an array.

I have found this, this, and this question similar, and found this tutorial

Aucun commentaire:

Enregistrer un commentaire