[FRIAM] More on levels of sequence organization

Marcus Daniels marcus at snoutfarm.com
Thu May 2 15:24:53 EDT 2019


I can imagine Facebook friends sharing their Ancestry.com data.   Facebook compiles all that that and sells services to insurance companies so that they can anticipate risk.
There’s no bound on the stupidity of Facebook users.

From: Friam <friam-bounces at redfish.com> on behalf of Roger Critchlow <rec at elf.org>
Reply-To: The Friday Morning Applied Complexity Coffee Group <friam at redfish.com>
Date: Thursday, May 2, 2019 at 1:02 PM
To: The Friday Morning Applied Complexity Coffee Group <friam at redfish.com>
Subject: Re: [FRIAM] More on levels of sequence organization

I did have some energy and it was a pretty entertaining read.

So 7/8ths of the authors for this paper are at Facebook's AI group, though one gives an email address @gmail.com<http://gmail.com>.  The group that won the CASP13 (Critical Assessment of Structure Prediction) competition in December was from Google/DeepMind, as memorialized by https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/.  The DeepMind model, called AlphaFold, was supervised learning of 3D structure coordinates from amino acid sequences.  DeepMind has yet to publish a paper detailing the methods used by AlphaFold

This model is unsupervised learning to predict a missing amino acid given the rest of the sequence, so you plug in a new protein sequence of N amino acids and it spits out an amino acid probability distribution for each of the N positions, an N*25 dimensional vector that represents everything it learned from the training set.  They report a series of tests that appear to support their claims, there doesn't appear to be any major cherry picking or data censoring involved in the tests.  I'm not sure how they're encoding 25 amino acids, since wikipedia is pretty sure that 22 is all there are in proteins.

But they don't actually extract the levels of organization from the model.  They take the levels of organization as known facts and construct observations of the model that make predictions consistent with the levels.  So if there are levels of organization as yet unidentified, they are at least as obscure in the model as they are in reality.   And to claim that the levels of organization emerge from the model sort of ignores how much work went into constructing the observations.

On the other hand, one might be surprised that all these levels are implicit in the amino acid sequences, but life knew that already, that's why it only remembers the sequences.

The most complex model they fit learned 700 million parameters, and it wasn't overfit, so they're presumably gearing up to fit a series of bigger models to that exponentially growing database of known protein sequences.   AlphaFold, meanwhile, is stuck working with the more slowly growing database of known protein 3D structures.

-- rec --

On Tue, Apr 30, 2019 at 9:40 PM Marcus Daniels <marcus at snoutfarm.com<mailto:marcus at snoutfarm.com>> wrote:
Cool!

“For synthetic biology, iteratively querying a model of the mutational fitness landscape could help efficiently guide the introduction of mutations to enhance protein function (Romero & Arnold, 2009), inform protein design using a combination of activating mutants (Hu et al., 2018), and make rational substitutions to optimize protein properties such as substrate specificity (Packer et al., 2017), stability (Tan et al., 2014), and binding (Ricatti et al., 2019).”

Get a few billion people to get full genome sequencing, and let the TPUs discover how we work!    Everyone gets a custom cocktail to improve stamina, fight off cancer, etc. etc.

Marcus

From: Friam <friam-bounces at redfish.com<mailto:friam-bounces at redfish.com>> on behalf of Roger Critchlow <rec at elf.org<mailto:rec at elf.org>>
Reply-To: The Friday Morning Applied Complexity Coffee Group <Friam at redfish.com<mailto:Friam at redfish.com>>
Date: Tuesday, April 30, 2019 at 8:49 PM
To: The Friday Morning Applied Complexity Coffee Group <Friam at redfish.com<mailto:Friam at redfish.com>>
Subject: [FRIAM] More on levels of sequence organization

This just turned up on hacker news:

   https://www.biorxiv.org/content/10.1101/622803v1

[...] To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. [...]

Don't know if I have the energy to plow through the text.

-- rec --
============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives back to 2003: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/ by Dr. Strangelove
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://redfish.com/pipermail/friam_redfish.com/attachments/20190502/d2d593ad/attachment.html>


More information about the Friam mailing list