[FRIAM] More on levels of sequence organization

Thu May 2 15:02:21 EDT 2019

I did have some energy and it was a pretty entertaining read.

So 7/8ths of the authors for this paper are at Facebook's AI group, though
one gives an email address @gmail.com.  The group that won the CASP13
(Critical Assessment of Structure Prediction) competition in December was
from Google/DeepMind, as memorialized by
https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/.
The DeepMind model, called AlphaFold, was supervised learning of 3D
structure coordinates from amino acid sequences.  DeepMind has yet to
publish a paper detailing the methods used by AlphaFold

This model is unsupervised learning to predict a missing amino acid given
the rest of the sequence, so you plug in a new protein sequence of N amino
acids and it spits out an amino acid probability distribution for each of
the N positions, an N*25 dimensional vector that represents everything it
learned from the training set.  They report a series of tests that appear
to support their claims, there doesn't appear to be any major cherry
picking or data censoring involved in the tests.  I'm not sure how they're
encoding 25 amino acids, since wikipedia is pretty sure that 22 is all
there are in proteins.

But they don't actually extract the levels of organization from the model.
They take the levels of organization as known facts and construct
observations of the model that make predictions consistent with the
levels.  So if there are levels of organization as yet unidentified, they
are at least as obscure in the model as they are in reality.   And to claim
that the levels of organization emerge from the model sort of ignores how
much work went into constructing the observations.

On the other hand, one might be surprised that all these levels are
implicit in the amino acid sequences, but life knew that already, that's
why it only remembers the sequences.

The most complex model they fit learned 700 million parameters, and it
wasn't overfit, so they're presumably gearing up to fit a series of bigger
models to that exponentially growing database of known protein sequences.
 AlphaFold, meanwhile, is stuck working with the more slowly growing
database of known protein 3D structures.

-- rec --

On Tue, Apr 30, 2019 at 9:40 PM Marcus Daniels <marcus at snoutfarm.com> wrote:

> Cool!
>
>
>
> “For synthetic biology, iteratively querying a model of the mutational
> fitness landscape could help efficiently guide the introduction of
> mutations to enhance protein function (Romero & Arnold, 2009), inform
> protein design using a combination of activating mutants (Hu et al., 2018),
> and make rational substitutions to optimize protein properties such as
> substrate specificity (Packer et al., 2017), stability (Tan et al., 2014),
> and binding (Ricatti et al., 2019).”
>
>
>
> Get a few billion people to get full genome sequencing, and let the TPUs
> discover how we work!    Everyone gets a custom cocktail to improve
> stamina, fight off cancer, etc. etc.
>
>
>
> Marcus
>
>
>
> *From: *Friam <friam-bounces at redfish.com> on behalf of Roger Critchlow <
> rec at elf.org>
> *Reply-To: *The Friday Morning Applied Complexity Coffee Group <
> Friam at redfish.com>
> *Date: *Tuesday, April 30, 2019 at 8:49 PM
> *To: *The Friday Morning Applied Complexity Coffee Group <
> Friam at redfish.com>
> *Subject: *[FRIAM] More on levels of sequence organization
>
>
>
> This just turned up on hacker news:
>
>
>
>    https://www.biorxiv.org/content/10.1101/622803v1
>
>
>
> [...] To this end we use unsupervised learning to train a deep contextual
> language model on 86 billion amino acids across 250 million sequences
> spanning evolutionary diversity. The resulting model maps raw sequences to
> representations of biological properties without labels or prior domain
> knowledge. The learned representation space organizes sequences at multiple
> levels of biological granularity from the biochemical to proteomic levels.
> [...]
>
>
>
> Don't know if I have the energy to plow through the text.
>
>
>
> -- rec --
> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at cafe at St. John's College
> to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com
> archives back to 2003: http://friam.471366.n2.nabble.com/
> FRIAM-COMIC http://friam-comic.blogspot.com/ by Dr. Strangelove
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://redfish.com/pipermail/friam_redfish.com/attachments/20190502/d09f317e/attachment.html>