Friday, April 25, 2014

Big Data (Alone) Won't Save You


One of the pitfalls of the big data "revolution" came to haunt me in one of my current research projects.  I'm trying to develop new approaches for identifying insect chemosensory receptors.  Insects use chemosensory receptors to find food (e.g., mosquitoes searching for human hosts), mates, and even to avoid insecticides.  Insect chemosensory receptors tend to vary quite a bit from species to species, based on their lifestyles -- ants have about 350 olfactory receptors, while the body louse only has about 10.

As part of my work, I'm evaluating standard prediction approaches such as Hidden Markov Models (HMMs).  Without going into gory details, HMMs are built from an alignment of training sequences.  The resulting HMMs can be run on unclassified sequences to predict the probability that each query sequence matches the training sequences.  The quality of the HMM results are highly dependent on the quality and similarity of the training sequences, as I quick discovered.

I compared the performance of four sets HMMs on olfactory (ORs) and gustatory receptors (GRs) from 15 species (3 mosquitoes and 12 flies).  I had 251,890 total sequences, of which 1,149 are thought to be ORs and 921 are thought to be GRs. The first HMM was downloaded from Pfam, a database of protein families, and trained on both ORs and GRs.  I trained one HMM on 930 ORs given to me by a post-doc.  The Pfam HMM and first OR HMM were then used to identify GRs and additional ORs in the dataset that we missed the first time, resulting in about 200 additional ORs and 921 GRs.  I then trained two more HMMs, one on the expanded set of the 1,149 ORs and one on all 921 GRs.  Afterwards, I ran all of the HMMs against the proteomes and compared sensitivity and accuracy.




Surprisingly, the HMM trained on the final list of ORs did WORSE than the original HMM!  The original OR HMM found 1,126 of the 1,149 ORs while filtering out the GRs and other sequences.  The final OR HMM identified 15 fewer ORs and many more GRs and other sequences, resulting in a higher false positive rate.    In this case, more data did not result in better performance -- the quality of the training data proved to be much more important than the quantity.




Now, I need to go back and find a way to distinguish between "good" and "bad" training sequences.  I have a few empirical approaches in mind, but my most valuable asset will consultations with domain experts.   I'll need to repeat the process on another data set in the future, so it's more important to find out why certain sequences are "bad" than it is to identify the bad sequences in this dataset.

I learned a simple lesson: big data won't save you from the blind application of machine learning approaches.