[FRIAM] Big data forensics

Thu Jun 24 17:08:40 EDT 2021

I agree about gumshoes, Glen, but I think maybe the line in this that draws my interest is a sense that there is an enormous gap in scale that requires some other kind of design to fill it.  It would be of a piece with things that bugged me in past transitions.

I think for me it started the first time when public-key cryptography came out and Phil Zimmerman made the assertion “the Web of Trust will replace central authenticators”.  (No.)  I spent years trying to make a kind of sum-over-paths formulation of how confidence degrades in a web of trust without some kind of organizational algorithm, but how parallelism could be used to draw a more robust confidence from path combinatorics than from single central entities.  (My thinking at the time was completely naive, but I won’t digress further here.)  That veered off into every possible conceptual obscurity, including what one means by “identity”, how practices like good key security should be taken to relate to judgment in public-key endorsements, how to use real world proof-of-stake to make costly signals, and yada yada yada.  When PageRank came out, and I tried to make a pitch to Pierre Omidyar on design of insurance schemes surrounding proof-of-value-at-risk and some kind of network-feedback, in a conversation on the SFI patio on a sunny lovely afternoon, Pierre was kind enough not to say flatly “You don’t understand _anything_ about how companies work”, which he was still steep on the learning curve about at the time.  But that never came to anything, except that the conceptual questions interested Martin Shubik enough that they began our collaboration (within which that was never a research question).

When the arXiv first came out and said "I will replace journals”, I thought that, once again, there is a bottleneck of human time and attention, and simply flooding everyone with everything all the time, while it will open new opportunities that gatekeepers keep closed off, it will cause random sampling error to replace systemic bias as the main cause of really suboptimal solutions.  It’s not that the need to take responsibility for understanding something oneself can ever be got around — that is a fundamental constraint — only that the concept of a “fiduciary” proposes that a good vetting and recommendation design can cause you to do the least-badly at balancing awareness and understanding possible, given the techniques of the time and your own preferences for how to split the two.  We are still stuck with journals because, bad as they are, we haven’t really designed and established alternatives that are enough better to displace the journals’ role.  F1000 and others were efforts in this direction, but they are distantly on the margins.

Then for languages, I fought the linguists for years, who wanted to extract single, microscopic, very-strong features of language that they could analyze to death, but which remain totally silent about almost-all questions of interest, because the strong signatures are few and the questions many.  I wanted probability methods to get distributional evidence about weak and distributed, but numerous and reinforcing, patterns of concordance in language.  That was the easiest idea, because we already know how to do it and it was just a matter of fighting a reactionary culture.  I think just the change of generations is already well on the way toward winning that battle, quite independently of any tiny contribution (if any at all) that we made.

For this one (the genomic forensics), I feel like we know who many of the organized actors are.  Governments will try to “control the narrative”, to the extent that they perceive doing so to be in their interest, and to the extent that the norms and institutions of the society give them cover in doing it.  Governments that both are authoritarian and that depend on promulgating an ideology are probably the most committed to doing this comprehensively.  There are mid-level skirmishes, like between the US and Wikileaks, but I think because of the new horizons in big computing together with being able to seal off borders, China is pioneering a new frontier in this, which could be a “more is different” moment.  I can’t think of a counterpart to them anywhere else in the world just now.  The Russian model is quite different (I like things that Masha Gessen and Gary Kasparov say about that approach, granting that each of them has a POV); I have wondered how much confidence to attach to public health data coming-out of Vietnam, which is by many measures kind of an okay functioning society, but in which any building or billboard made of durable materials is still plastered with official slogans and propaganda.  (That one is not a case about which I know almost anything, so my cautions there are nearly empty.). 

Against these actors, we have other big actors, like intelligence agencies, and that is probably okay to produce some balance of power.  But they are all monoliths.

The few cases where we have interesting data for the viral question, from Yuri Deigin’s collaborators and now Jesse, are tiny data points acquired at large personal time and effort, guided by insight about particular questions.  I do feel like early sequence data are particularly high-value, because with what we currently can estimate about mutation rates, we could plug sequence-diversity data into back-of-the-envelop epidemiological models and try to get a sense of how much circulation there was in any community at any time, and try to back out timelines for founder infections, sort of like the LANL group did for HIV in the 1990s (?).  (That was Bette Korber, Tanmoy Bhattacharya, Alan Perelson, and their cohort, plus I am sure other groups that I don’t know.). 

Yet we must bs swimming in genome data of an incidental nature, like stray reads that end up in repositories that only by accident one would look for.  I continue to wonder if there is some “design” of a sieve that could automate or crowdsource some of these questions, so there could be a “public option” to go alongside the NSA/CIA vs. Governments dyad.  Viral genomics seems to be a problem whose structure is well-matched to distributed, public-data surveillance.

I worry that, as more of these discoveries come out, the government intrusion, micromanagement, and punitiveness toward academic and institute researchers in China is going to become just miserable.  Even it it was a wild outbreak, the essentially adversarial stance the CCP takes toward the rest of the world would cause them to suppress information, because they don’t trust the rest of the world not to draw motivated conclusions for the sake of working against them (and that is not an unreasonable fear; it’s where Miranda rights come from).  So to the degree that an accurate and reasonably confident story could be put together for this problem, perhaps it would reduce the time this particular pain will be drawn out.  I would also like to think it could contribute to a sense that there are limits to what countries can expect to get away with.

Anyway, sorry, rambling.

Eric

> On Jun 24, 2021, at 11:07 PM, uǝlƃ ☤>$ <gepropella at gmail.com> wrote:
> 
> It's a wonderful example of careful science. I only have 1 criticism. "There is no plausible scientific reason for the deletion: the sequences are perfectly concordant with the samples described in Wang et al.(2020a,b), there are no corrections to the paper, the paper states human subjects approval was obtained, and the sequencing shows no evidence of plasmid or sample-to-sample contamination."
> 
> There's never *any* scientific reason to delete anything. So, the 1st clause in the sentence is *merely* an attempt to rouse the rabble. 8^D Otherwise known as "trolling". But buried under all the excellent, and excellently hygienic, sentences in the paper, it makes that trawl more poignant and well done.
> 
> Writ large, though, the phrase "systematic forensis" seems like a paradox. The approach I take, inspired by systems engineering, is to *log* absolutely everything, under version control, persistently. Rather than being a part of systematic forensis, it *facilitates* forensis. In light of our conversation on the myth of the objective, forensics imputes causality into a mesh of events ... hunts down *the* criminal, *the* offending "$ shed -u" command. Nothing brings that to the public forum quite like the gumshoe's pavement-pounding response to her *hunch*.
> 
> It doesn't sound quite right to talk of systematic forensics. It sounds more right to say systematic bookkeeping for the sake of more publicizing to the forum.
> 
> On 6/23/21 9:42 PM, David Eric Smith wrote:
>> Speaking of big data forensics (which no-one was):
>> https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.biorxiv.org%2fcontent%2f10.1101%2f2021.06.18.449051v1.full.pdf&c=E,1,Z2jur-l53hzcae_u0sVVK2Yah_YCSg4vsAiGyvyGj3M0Dxk8_Fiaubin1XCtfb2FPg6Bg2Z0vh4cuEYtpfO35SKzgDcydGRVeNxiqo3S&typo=1 <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.biorxiv.org%2fcontent%2f10.1101%2f2021.06.18.449051v1.full.pdf&c=E,1,7mqQ_rEsbqCRYjlOkyI7lQx-TLyZukWJfSNRmUefn14mK7BjxRCs3ftApro7zeGOknAUFPgdpxCR3I08pS70z-kQgvb83cqjGMvaH94J&typo=1>
>> 
>> [...]
>> I post because (apart from general interest), in the last paragraph of his introduction, he makes a call for data forensics to be done more systematically.
> 
> 
> -- 
> ☤>$ uǝlƃ
> 
> - .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
> FRIAM Applied Complexity Group listserv
> Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam
> un/subscribe https://linkprotect.cudasvc.com/url?a=http%3a%2f%2fredfish.com%2fmailman%2flistinfo%2ffriam_redfish.com&c=E,1,WZwFAwcrPPNkIivV6OPz36K2Bakg5sOjmfnXWYIN9WHbdeh46Pnn8R1xvFIslMFOQrhhndt2XuekmrckWlUNkoHQ1qDE_-avXYWjMxz5ET3zMcy_1Nw,&typo=1
> FRIAM-COMIC https://linkprotect.cudasvc.com/url?a=http%3a%2f%2ffriam-comic.blogspot.com%2f&c=E,1,4nt08LoHHUnBctzODMoOEVViRBDc3MT2inSs-LshPQv5H3s04h5S9TxH66AROB_I7fl8uoWbbA-JlZhxMU3ZvTfP2g95-gtV2ayijdFvdrui&typo=1
> archives: http://friam.471366.n2.nabble.com/