[FRIAM] Grokking Mechanistic Interpretability
Jon Zingale
jonzingale at gmail.com
Wed Aug 14 23:19:03 EDT 2024
I wouldn't be surprised if this was already discussed here, but I found
this pretty great interview with Deep Mind's Neel Nanda on his mechanistic
interpretability. Mechanistic interpretability is a reverse engineering
approach to understanding what transformer models are conceptually doing. I
figured it would be a good thing to add here because he describes at a
fairly high level what researchers in the field are actually doing to study
things like LLMs.
I have the video posted below queued to a discussion of a phenomenon he
calls grokking. It reminds me of Nick's description of Tolman's rat maze
research and latent learning, but with some additional twists. Neel
describes three phases: memorization, circuit formation, and then
generalization. What appeared like a sudden generalization tests out to be
a gradual and systematic transition to generalization followed by a sudden
clean up of the parameters. It's unfortunate for me that I don't actually
know the history of learning theory, so I can't actually comment on whether
there is anything truly new here. Interesting stuff.
https://www.youtube.com/watch?v=_Ygf0GnlwmY&t=1945s
for those that prefer an article:
https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://redfish.com/pipermail/friam_redfish.com/attachments/20240814/2aa8f7af/attachment.html>
More information about the Friam
mailing list