Natural History Magazine
Published under the title “Belly Up to the Error Bar.”
An implicit goal of the scientific method is to minimize human bias—one of the great sources of experimental blunder.
Formal accounts of the scientific method typically describe an hypothesis-posing, experiment-conducting process. You might see words such as induction, deduction, cause, and effect. What’s missing is that science can be a creative process in which practically anything goes—from middle-of-the-night hunches to mathematical formulations driven by scientific aesthetics—so long as the results accurately describe and predict phenomena in the real world.
When conducting an experiment on the frontier of human understanding of the universe, you never know what the “right” answer is supposed to be. Sometimes you don’t even know the right question! Often, guided by a particular vision of how the universe works, all you can do is make a series of measurements that you hope will lead you to the right answer. Answers to questions such as, How far away is the Moon? or What is the mass of the Sun? lend themselves standard statistical analysis. But answers to questions like, What kind of cheese is the Moon made of? does not, because it starts with the false assumption that the Moon is a cheesy place, which will most likely inhibit your acquisition of relevant data. In most experiments, some data points will come out above the true value while some will come out below. These are ordinary fluctuations—a bar chart of these measurements would look like the statistician’s beloved bell curve. The history of science has shown that if an experiment is well designed then most of the data will cluster around some value, presumably the right value.
Unfortunately, this value may bear little correspondence to the real word if human bias in involved. An implicit goal of the scientific method is to minimize human bias, for therein lies some of the greatest sources of experimental blunder. When making multiple measurements scientists can unwittingly discard values that deviate strongly from their expectations. This selective editing of experimental results can skew data and fatally compromise the experiment. Once results are published, sometimes only the experimenter knows which data were included and which were discarded.
In all fairness to the experimenter, some raw data do deserve to be discarded because of unavoidable experimental glitches, however, you just shouldn’t get carried away by having your preconceived notions drive your data selection. In any case, you must be honest about the breadth of measurements above and below the average value and report this uncertainty.
As introduced in Part I of this two-part essay, the general public holds no greater scientific misconception than on the meaning of experimental uncertainty. Scientists are partly to blame, because uncertainties are formally called “errors” in research parlance. Tell someone that your experiment had errors and nobody will believe your result. Tell someone that your experiment had quantifiable uncertainties and the entire scientific enterprise is salvaged.
The media hardly ever report the inherent errors in scientific discoveries. Instead of a range of uncertainty, reporters typically (and sensibly) write about the meaning of the experiment’s average value as given in scientific publications or press releases. Headline writers and general readers, however, are left vulnerable to drawing spurious conclusions. In one example several years ago, new and improved data led to the announcement that the oldest stars in the galaxy were born about 14 billion years ago, and that the age of the universe is about 12 billion years. The press invoked the you-can’t-be-older-than-your-mother principle, and portrayed the news as a cosmic controversy of the first rank. But when the numbers are accompanied by their published uncertainties—known as “error bars”—a sensible picture emerges that is quite undeserving of headlines. The age of the universe was 12±3 billion years. The age of the oldest stars was 14±2 billion years. The error bars suitably overlapped at 13 billion years, to where our present estimate of the age of the universe and the age of the oldest stars is converging.
After decades of ignoring the significance of measurement errors, the reported results of public opinion polls are now accompanied by “margins of error”—the sociologist’s error bar. If opinion pollsters query only 100 of the quarter billion people in the United States they will (or had better) frame their claims with fat uncertainties. So when a news station reports, the incumbent leads the challenger 54 percent to 46 percent with a ±5 percent margin of error,
then you have my permission to ignore all subsequent discussion and analysis of the significance of the incumbent’s lead. Fortunately, there are well-tested statistical methods that account for the size of your polled sample compared with the size of the total population. If all quarter-billion people were polled, there would be no uncertainties except for the unavoidable fact that some people change their minds with the breeze.
Often the most heated scientific controversies are conducted within the noise and confusion of messy data. Perhaps the most famous astronomical controversy of the second half of the twentieth century was over the numerical value of the famous Hubble constant, a measure of the expansion rate of the universe. Poor data allowed two warring factions to arise on opposite sides of the error bars. One group supported H = 100±10. Another group supported H = 50±5. What’s a factor of two between friends in a universe where factors of thousands and millions are common?
The problem here is that somebody’s error bars are way too small. Whose? Perhaps they are both too small. Regardless, experimental bias operating on messy data was at work. After the world’s most aggressive supporter of H = 100 (University of Texas astronomer Gerard deVaucouleurs), died several year ago, and after more and better data became available, the consensus converged around H = 70±10, falling comfortably between the extremes. Do we credit the new and improved value for the Hubble constant to better data? That would be good. Or did scientists continually re-interpret the data until things agreed, and then stopped trying? That would be bad. For the Hubble constant case, the former is more likely than the latter, but the latter remains a haunting specter over nearly all research questions whose answer you presume to know in advance of the experiment.
Speaking of scientists who take extreme views to their graves, the late, distinguished MITmathematician Irwin Segal remained to the end a spirited opponent to the Big Bang model for the origin of the universe. He even offered a theory of his own to replace it. Whenever a new discovery in support of the Big Bang was reported in the news, he would write a letter to the editor explaining why the Big Bang was all wrong. The problem was that readers got the (false) impression that Segal represented the other “half” of the responsible scientific view, when in fact his views were those of a vanishingly small minority. The German physicist Max Planck (father of the counterintuitive, but very real branch of physics known as quantum mechanics) perceptively penned in 1936:
An important scientific innovation rarely makes its way by gradually winning over and converting its opponents… What does happen is that its opponents gradually die out and that the growing generation is familiarized with the idea from the beginning.
Unpopular scientific ideas have a certain appeal. Perhaps people just like to root for underdogs. But for each correct idea in science, a hundred (or a thousand) respectable ideas failed before it. Possible reasons for failure? New data didn’t support the claims; logical inconsistencies emerged on further analysis; predictions of natural phenomena were proven to be false. You would think this route to reliable conclusions would be valued in society’s courts. Having never, until recently, lived more than a few years in the same place during my adult life, I had never been called for jury duty, which typically requires a minimum duration of residency. When I was finally called to serve, I went willingly and patriotically. After the classic long wait in the waiting room, I was finally called for selection and was questioned by an attorney: What is your profession?
Astrophysicist. What is an astrophysicist?
An astrophysicist studies the universe and the laws of physics that describe and predict its behavior. What sorts of things do you do?
Research, teach, administer. What courses do you teach?
This semester I happen to be teaching a seminar at Princeton on the critical evaluation of scientific evidence and the relative unreliability of human testimony. No further questions Dr. Tyson. Thank you.
I was on my way home twenty minutes later.
In courts of law, yes/no questions and multiple choice questions are common. But science does not lend itself to such responses without incurring a major misrepresentation of reality. I was once called by a lawyer who wanted to know what time the Sun set on the date of a particular car accident at a particular location. This question can be answered precisely, but later in the conversation I learned what that lawyer really wanted to know what time it gets dark outside. He was going to compare the time of sunset with the time of the car accident, and had been assuming that everything gets dark the instant the Sun sets. His question was poorly formed for the information he was seeking. A better question might have been, what time do the dark-sensitive street lights turn on? But even for that question, the presence or absence of clouds and the shadows of nearby buildings can affect the “right” answer.
Although I was tainted goods in the jury selection box, I once managed to help convict a person who was charged with a fatal hit-and-run. The driver of the vehicle had a photograph of himself, claiming it was taken at the time of the incident and that he was nowhere near the scene of the crime. The defense attorney asked me if I could verify the claimed time of the image from the lengths of shadows from cars and people in the photo. I said sure. Given the location and the date, I provided him with the time of the photo, plus or minus twelve minutes. The time did not correspond with the alibi by several hours.
How certain can we be of a scientific measurement? Confirmation matters. Only rarely is the importance of this fact captured in the media or the movies. An exception was the 1996 film Contact (based on the 1983 novel of the same name by the celebrated astronomer Carl Sagan) which portrayed what might happen—scientifically, socially, and politically—if we one day make radio-wave contact with extraterrestrial intelligence. When a signal from the star Vega is discovered that rises above the din of cosmic noise, Jodie Foster (who plays an astrophysicist) alerts observers in Australia, who could observe the signal long after the stars in that region of the sky have set for Americans. Only when the Australians confirm her measurements does she go public with the discovery. Her original signal could have been a systematic glitch in the telescope’s electronics. It could have been a local prankster beaming signals into the telescope from across the street. It could have been a local collective delusion. Her confidence was boosted only when somebody else on another telescope with different electronics driving an independent computer system got the same results.
Null results matter too. Occasionally scientists will test for an effect that does not exist or that fails to reveal itself through the chosen methods. That same scientist may elect not to publish the non-result. Another scientist may conduct the same experiment and find a statistical glitch that mimics a real effect. The second scientist elects to publish. The research literature now contains an innocent, yet guileful bias toward finding an effect when, in fact, none is present. Unfortunately, one of the cheapest, but blunder-prone ways of doing science is by conducting a survey of published surveys rather than by designing and conducting one’s own experiment.
A seminal, but little-known null result draws from the history of attempts to measure the speed of light. In the early 1600s, Galileo sent an assistant to a distant hill to flash the light of a lantern. Galileo responded immediately with flashes from a lantern of his own. His attempt to time the delay proved futile. Human reflexes were inadequate for such a task. Of the speed of light, Galileo noted, if not instantaneous it is extraordinarily rapid.
Now those are some large error bars.
Galileo could have never predicted that over three centuries later, the International Committee on Data for Science and Technology, in an unprecedented decision, would define the speed of light to be the current best experimental value: 299,792,458 meters per second—exactly. The speed of light had such small error bars that by defining its speed by fiat, subsequent, improved precision in the speed of light would translate directly to a modification in the length of the meter. The meter is now defined to be the distance traveled by a beam of light in a vacuum during 1⁄299,792,458 of a second. The speed of light went from having big error bars to having no error bars. Galileo would be proud.
Occasionally, when you have a good theory and good observations and you understand the ways of the universe then magic can happen. When comet Shoemaker-Levy 9 was discovered in 1992, its orbit was computed after a sufficient baseline of observations had accumulated. Laws of physics that were known in the days of Isaac Newton enabled us to predict that the comet would come so close to Jupiter on its next pass that the most likely trajectory would intersect the planet’s gaseous surface. How about the error bars? All paths within the uncertainties fell within the body of Jupiter. The inescapable conclusion: a titanic Jupiter-comet impact was imminent. We were certain enough to alert the media and mobilize all available telescopes to observe the event. Sure enough, two years later, on the comet’s next orbit, it slammed into Jupiter’s atmosphere with the destructive force of five billion atom bombs.
Ordinary uncertainties almost always lead to correct (and meaningful) answers, but systematic uncertainties are insidious. Your measurements may be statistically sound, but what if you measured the wrong object? Or the wrong phenomenon? What if the electrical current that fed your apparatus changed during the measurement? I was once observing our Milky Way galaxy from the Cerro Tololo Inter-American Observatory in the Chilean Andes. I was getting spectra of stars located near the galactic center when a 6.2 earthquake rolled through (not particularly uncommon in such a geologically active area). There was an expected loss of electricity in the observatory, but the data acquisition system and the computers were connected to an uninterruptible power supply that seamlessly kicked in during the power failure. I was able to save open files and backup my data. When power was restored, forty minutes later, I resumed taking data, only to discover that the telescope’s spectrograph had shifted and was now sensitive to an entirely different part of the spectrum. Had I mindlessly kept working, and then averaged the night’s data, the results would not only have been wrong, but meaningless, and I would not have known it for some time.
For some averages, you know right away they are meaningless. Try this one: If half the time I sit on the left side of the train and other half of the time I sit on the right side of the train, then on average, I sit in the aisle. Or how about my all-time favorite, which I credit to the mathematician and occasional humorist John Allen Paulos: The average American has one breast and one testicle.