[FRIAM] Big data forensics

Fri Jun 25 14:15:42 EDT 2021

If it is important for a checklist, it is important enough to automate.

-----Original Message-----
From: Friam <friam-bounces at redfish.com> On Behalf Of u?l? ?>$
Sent: Friday, June 25, 2021 11:05 AM
To: friam at redfish.com
Subject: Re: [FRIAM] Big data forensics

I know, right? That's how we all think ... for the most part. But that's not how we *should* be thinking. That's not how Special Agent Smith thinks when he's reading all my emails, some of which are about how much I hate cottage cheese or my trypophobia, and only a tiny few are about how I might be accessing paywalled articles without an accompanying credit card transaction to the journal.

We're all so egocentric and conceited as to think that what we *think* is important enough to log in the repository is *actually* important. It's like when my cat watches my feet instead of my eyes when she's worried about whether I'll step on her tail ... or when you're doing something repetitively that leads to some stress disorder. It's the things you're *not* paying attention to that are important.

On 6/25/21 10:49 AM, Marcus Daniels wrote:
> There are a small number of occasions I can remember that I wish I could remember how a situation arose.   The world goes on roaring around me and mostly I don't have any influence over it.   Nothing is held constant, really.  To remember or communicate the things that work they have to be reproducible and have some story behind them about why they ought to work, or a set of tried and true practices where they have worked.   Having all the .history and keystrokes of everything I have ever typed is not particularly informative about that.   If it is important enough to track, I'll put it in revision control.  The rest is noise and idiosyncrasies of consciousness.   I could easily imagine someone not putting a set of sequences into the SRA because they were not comfortable with the provenance, because they didn't run the equipment themselves, or something like that.   That they just didn't want to pollute the public databases.
> 
> -----Original Message-----
> From: Friam <friam-bounces at redfish.com> On Behalf Of glen ep ropella
> Sent: Friday, June 25, 2021 8:41 AM
> To: friam at redfish.com
> Subject: Re: [FRIAM] Big data forensics
> 
> The NIH launched a project that, I think, lobs something at your target: https://commonfund.nih.gov/bridge2ai It's a weird program with mantras like "hypothesis-agnostic data generation". And proposals will be evaluated without a "study section". I enthusiastically support it, despite being a dyed in the wool skeptic.
> 
> In our own (smaller) work, there's a healthy tension between the triad of algorithmic, data-oriented, and task-oriented modeling. The 1st seems mostly axiomatic. The latter 2 more gumshoey. But as I think Bloom shows well, there's a competence with all 3 that facilitates productive use. 
> 
> So, my sense is not that there's a categorical leap brought on by *scale* so much as a categorical leap caused by some sort of inter-disciplinary facility. It's similar to the idea that robust reasoning is an interwoven combination of in-, ab-, and de-duction. What I find disheartening is a kind of "moralism", for lack of a better term. People tend to invest too much faith in what they know, what's succeeded in the past, whatever the cool kids are doing these days, etc. And what I think Bloom shows nicely is the required kind of *agnosticism*, especially to where clues may lie, what methods may lead to good product, etc.
> 
> It's the ability to commit to surveillance logging (e.g. sequencing every strand that comes down the pipe, every modification to some R script, every detail of every machine, etc.), ubiquitous induction and semi-automated selection of induced artifacts, and a willingness to dive into that chaotic ocean "on a mission". *That* ability/willingness is the categorical disjunction.
> 
> On 6/24/21 2:08 PM, David Eric Smith wrote:
>> I agree about gumshoes, Glen, but I think maybe the line in this that draws my interest is a sense that there is an enormous gap in scale that requires some other kind of design to fill it.  It would be of a piece with things that bugged me in past transitions.
>>
>> I think for me it started the first time when public-key cryptography came out and Phil Zimmerman made the assertion “the Web of Trust will replace central authenticators”.  (No.)  I spent years trying to make a kind of sum-over-paths formulation of how confidence degrades in a web of trust without some kind of organizational algorithm, but how parallelism could be used to draw a more robust confidence from path combinatorics than from single central entities.  (My thinking at the time was completely naive, but I won’t digress further here.)  That veered off into every possible conceptual obscurity, including what one means by “identity”, how practices like good key security should be taken to relate to judgment in public-key endorsements, how to use real world proof-of-stake to make costly signals, and yada yada yada.  When PageRank came out, and I tried to make a pitch to Pierre Omidyar on design of insurance schemes surrounding proof-of-value-at-risk and some kind of network-feedback, in a conversation on the SFI patio on a sunny lovely afternoon, Pierre was kind enough not to say flatly “You don’t understand _anything_ about how companies work”, which he was still steep on the learning curve about at the time.  But that never came to anything, except that the conceptual questions interested Martin Shubik enough that they began our collaboration (within which that was never a research question).
>>
>> When the arXiv first came out and said "I will replace journals”, I thought that, once again, there is a bottleneck of human time and attention, and simply flooding everyone with everything all the time, while it will open new opportunities that gatekeepers keep closed off, it will cause random sampling error to replace systemic bias as the main cause of really suboptimal solutions.  It’s not that the need to take responsibility for understanding something oneself can ever be got around — that is a fundamental constraint — only that the concept of a “fiduciary” proposes that a good vetting and recommendation design can cause you to do the least-badly at balancing awareness and understanding possible, given the techniques of the time and your own preferences for how to split the two.  We are still stuck with journals because, bad as they are, we haven’t really designed and established alternatives that are enough better to displace the journals’ role.  F1000 and others were efforts in this direction, but they are distantly on the margins.
>>
>> Then for languages, I fought the linguists for years, who wanted to extract single, microscopic, very-strong features of language that they could analyze to death, but which remain totally silent about almost-all questions of interest, because the strong signatures are few and the questions many.  I wanted probability methods to get distributional evidence about weak and distributed, but numerous and reinforcing, patterns of concordance in language.  That was the easiest idea, because we already know how to do it and it was just a matter of fighting a reactionary culture.  I think just the change of generations is already well on the way toward winning that battle, quite independently of any tiny contribution (if any at all) that we made.
>>
>> For this one (the genomic forensics), I feel like we know who many of the organized actors are.  Governments will try to “control the narrative”, to the extent that they perceive doing so to be in their interest, and to the extent that the norms and institutions of the society give them cover in doing it.  Governments that both are authoritarian and that depend on promulgating an ideology are probably the most committed to doing this comprehensively.  There are mid-level skirmishes, like between the US and Wikileaks, but I think because of the new horizons in big computing together with being able to seal off borders, China is pioneering a new frontier in this, which could be a “more is different” moment.  I can’t think of a counterpart to them anywhere else in the world just now.  The Russian model is quite different (I like things that Masha Gessen and Gary Kasparov say about that approach, granting that each of them has a POV); I have wondered how much confidence to attach to public health data coming-out of Vietnam, which is by many measures kind of an okay functioning society, but in which any building or billboard made of durable materials is still plastered with official slogans and propaganda.  (That one is not a case about which I know almost anything, so my cautions there are nearly empty.). 
>>
>> Against these actors, we have other big actors, like intelligence agencies, and that is probably okay to produce some balance of power.  But they are all monoliths.
>>
>> The few cases where we have interesting data for the viral question, from Yuri Deigin’s collaborators and now Jesse, are tiny data points acquired at large personal time and effort, guided by insight about particular questions.  I do feel like early sequence data are particularly high-value, because with what we currently can estimate about mutation rates, we could plug sequence-diversity data into back-of-the-envelop epidemiological models and try to get a sense of how much circulation there was in any community at any time, and try to back out timelines for founder infections, sort of like the LANL group did for HIV in the 1990s (?).  (That was Bette Korber, Tanmoy Bhattacharya, Alan Perelson, and their cohort, plus I am sure other groups that I don’t know.). 
>>
>> Yet we must bs swimming in genome data of an incidental nature, like stray reads that end up in repositories that only by accident one would look for.  I continue to wonder if there is some “design” of a sieve that could automate or crowdsource some of these questions, so there could be a “public option” to go alongside the NSA/CIA vs. Governments dyad.  Viral genomics seems to be a problem whose structure is well-matched to distributed, public-data surveillance.
>>
>> I worry that, as more of these discoveries come out, the government intrusion, micromanagement, and punitiveness toward academic and institute researchers in China is going to become just miserable.  Even it it was a wild outbreak, the essentially adversarial stance the CCP takes toward the rest of the world would cause them to suppress information, because they don’t trust the rest of the world not to draw motivated conclusions for the sake of working against them (and that is not an unreasonable fear; it’s where Miranda rights come from).  So to the degree that an accurate and reasonably confident story could be put together for this problem, perhaps it would reduce the time this particular pain will be drawn out.  I would also like to think it could contribute to a sense that there are limits to what countries can expect to get away with.
> 
> 
> --
> glen ep ropella 971-599-3737
> - .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
> FRIAM Applied Complexity Group listserv Zoom Fridays 9:30a-12p Mtn 
> GMT-6  bit.ly/virtualfriam un/subscribe 
> http://redfish.com/mailman/listinfo/friam_redfish.com
> FRIAM-COMIC http://friam-comic.blogspot.com/
> archives: http://friam.471366.n2.nabble.com/
> - .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
> FRIAM Applied Complexity Group listserv Zoom Fridays 9:30a-12p Mtn 
> GMT-6  bit.ly/virtualfriam un/subscribe 
> http://redfish.com/mailman/listinfo/friam_redfish.com
> FRIAM-COMIC http://friam-comic.blogspot.com/
> archives: http://friam.471366.n2.nabble.com/
> 

--
☤>$ uǝlƃ
- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
FRIAM-COMIC http://friam-comic.blogspot.com/
archives: http://friam.471366.n2.nabble.com/