“Succumber Bias” in AI-ML, American warplanes in World War II and how a lack of robust data collection strategy may have failed us vis-à-vis COVID-19 !
As you read the header of this article you might be wondering what is “succumber bias” and what on earth does it have to do with COVID-19 and how it is related to American aircrafts in the second world war ! But this is a tale with a twist, where a little knowledge of AIML, could have helped improve the data collection efforts of agencies/hospitals as it pertains to COVID-19. But first, what is “succumber bias” ? It may well be the first time that “succumber bias” is being coined, and as a reader, you are perhaps rightly feeling privileged to being a part of history being made, not surely the first of the many attempts by mankind, at the merry and often understated art of neologism.
War time mathematics and mathematicians have been immortalized by the brilliant Benedict Cumberbatch who essayed the role of Alan Turing in the much-acclaimed movie, Imitation Game. Similar to the British, USA had its own group of brilliant mathematicians and statisticians, who were part of what was called the “Statistical Research Group” (SRG). This was a classified program, no less in contribution to the American War efforts as the Manhattan Project, and this is where, quite a few American war time Generals believed, the war was won, through mathematics , not arms or ammunition. One such example was the problem that the great mathematician of Romanian descent, Abraham Wald solved, which is a classic example of what is known as the “survivorship bias”.
The problem before Wald was to examine the damage done to aircrafts that had returned from missions and recommend adding armor to the areas that needed it most, so that more planes and pilots could survive the missions and help US to win the battles and eventually, the war. The US military’s conclusion was that the most-hit areas of the plane needed additional armor. After thoroughly examining the data and putting it through a mathematical framework, Wald suggested otherwise.
Wald surmised that the military only collected data from aircrafts that had survived their missions; any aircraft that had been shot down was not considered in the data. The bullet holes in the aircrafts that survived, thus, represented areas where they could take hits, but still fly back safely. Thus, Wald proposed that the Navy should reinforce areas with additional armor, where the returning aircraft had no hits. This may sound trivial, but it was a huge and monumental observation, that perhaps changed the fate of the American war efforts, as more American warplanes, survived the German and Japanese anti-aircraft guns, post action taken, on the back of this recommendation.
Now let us fast forward to 2020.
Some attempts have been made to crowdsource the might of the AIML community at large and release data for them to work on and answer a host of questions at the peak of COVID-19. This was a great initiative.
One such initiative was the “COVID-19 Open Research Dataset Challenge (CORD-19)”. CORD-19 is a resource of over 400,000 scholarly articles, including over 150,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.
Another one on Kaggle was the “UNCOVER COVID-19 Challenge” by the Roche Data Science Coalition (RDSC), requesting the collaborative effort of the AI community to fight COVID-19. This challenge presented a curated collection of datasets from 20 global sources and asks you to model solutions to key questions that were developed and evaluated by a global frontline of healthcare providers, hospitals, suppliers, and policy makers.
While most of these contests had an unquestionably noble purpose and need to be lauded, the data present in them were not in a format that could be effectively used to answer a lot of pressing questions on COVID-19, including the very fundamental question on “Why some people survive the virus, while others don’t”. A host of the data sets that were provided in these contests were heavily skewed towards information collected on those “who succumbed to the virus” as against “ those who survived”. Hence, the term “succumber bias”.
Now, one must acknowledge that it is not easy to impose or persuade hospitals to collect data in a certain prescribed format, as the main aim of the hospitals is to cure patients that are with them at that point in time, rather than collect data that can possibly help future patients or perhaps better still, come up with a fundamental cure of the disease altogether(remember how Captain James Cook discovered the cure for scurvy :- (https://www.captaincooksociety.com/home/detail/scurvy-how-a-surgeon-a-mariner-and-a-gentleman-solved-the-greatest-medical-mystery-of-the-age-of-sail-bown-stephen-r-2003?). But the fundamental question is , have a group of medical practitioners, been mandated by WHO or the Governments to come up with a robust data collection mechanism that can be easily made available to hospitals, so that they can collect the data that can be useful in answering the question “Why some folks survive, while others don’t?”.
The intent of the article is not to recommend what kind of data should be collected and in what format, which is a topic for another day, but to drive home the point that we need to focus on collecting data, on patients that survived, as much as on patients that did not. I would be happy to be proven wrong and some folks somewhere maybe doing this already, but very robust data sets, with data at a granular level of patients(with right amount of masking to hide their identities and PIO information) can be equally great in fighting COVID-19 as with vaccines.
Secondly, the focus of a lot of these analysis and studies seem to be on a less important objective of predicting “how many deaths will occur”, “the number of infections”, “the rate of spread” et al. The real focus should perhaps be on answering the question which I had outlined earlier “What are the reasons responsible for the survival/recovery of patients”. So, in the models, the focus should be on identifying which variables/features have a very strong importance in the “survival”, rather than trying to focus on improving the predicting power/accuracy of the models . In fact in one of the very detailed studies, there was a good amount of effort invested in creating better features, but eventually they too fell to the curse of the data scientists , where they go to lengths to describe how they tweaked the Gradient boost algorithm to get better accuracy. The focus though could have been on the variables that came out significant in the model and then to design hypothesis driven studies/analysis around them to identify the key factors responsible for survival.
We need data scientists sitting down with doctors , to painstakingly create what could be very powerful features/variables that can be strong predictors in the model to predict “survival” and then collect the right data for those.
Only then can AIML algorithms work their full magic, else they too will keep succumbing to the “succumber bias”.