In 2017, Roger Guimerà and Marta Sales-Pardo discovered a cause of cell division, the process driving the growth of living beings. But they couldn’t immediately reveal how they learned the answer. The researchers hadn’t spotted the crucial pattern in their data themselves. Rather, an unpublished invention of theirs — a digital assistant they called the “machine scientist” — had handed it to them. When writing up the result, Guimerà recalls thinking, “We can’t just say we fed it to an algorithm and this is the answer. No reviewer is going to accept that.”
The duo, who are partners in life as well as research, had teamed up with the biophysicist Xavier Trepat of the Institute for Bioengineering of Catalonia, a former classmate, to identify which factors might trigger cell division. Many biologists believed that division ensues when a cell simply exceeds a certain size, but Trepat suspected there was more to the story. His group specialized in deciphering the nanoscale imprints that herds of cells leave on a soft surface as they jostle for position. Trepat’s team had amassed an exhaustive data set chronicling shapes, forces, and a dozen other cellular characteristics. But testing all the ways these attributes might influence cell division would have taken a lifetime.
Instead, they collaborated with Guimerà and Sales-Pardo to feed the data to the machine scientist. Within minutes it returned a concise equation that predicted when a cell would divide 10 times more accurately than an equation that used only a cell’s size or any other single characteristic. What matters, according to the machine scientist, is the size multiplied by how hard a cell is getting squeezed by its neighbors — a quantity that has units of energy.
“It was able to pick up something that we were not,” said Trepat, who, along with Guimerà, is a member of ICREA, the Catalan Institution for Research and Advanced Studies.
Because the researchers hadn’t yet published anything about the machine scientist, they did a second analysis to cover its tracks. They manually tested hundreds of pairs of variables, “irrespective of … their physical or biological meaning,” as they would later write. By design, this recovered the machine scientist’s answer, which they reported in 2018 in Nature Cell Biology.
Four years later, this awkward situation is quickly becoming an accepted method of scientific discovery. Sales-Pardo and Guimerà are among a handful of researchers developing the latest generation of tools capable of a process known as symbolic regression.
Symbolic regression algorithms are distinct from deep neural networks, the famous artificial intelligence algorithms that may take in thousands of pixels, let them percolate through a labyrinth of millions of nodes, and output the word “dog” through opaque mechanisms. Symbolic regression similarly identifies relationships in complicated data sets, but it reports the findings in a format human researchers can understand: a short equation. These algorithms resemble supercharged versions of Excel’s curve-fitting function, except they look not just for lines or parabolas to fit a set of data points, but billions of formulas of all sorts. In this way, the machine scientist could give the humans insight into why cells divide, whereas a neural network could only predict when they do.
Researchers have tinkered with such machine scientists for decades, carefully coaxing them into rediscovering textbook laws of nature from crisp data sets arranged to make the patterns pop out. But in recent years the algorithms have grown mature enough to ferret out undiscovered relationships in real data — from how turbulence affects the atmosphere to how dark matter clusters. “No doubt about it,” said Hod Lipson, a roboticist at Columbia University who jump-started the study of symbolic regression 13 years ago. “The whole field is moving forward.”
Rise of the Machine Scientists
Occasionally physicists arrive at grand truths through pure reasoning, as when Albert Einstein intuited the pliability of space and time by imagining a light beam from another light beam’s perspective. More often, though, theories are born from marathon data-crunching sessions. After the 16th-century astronomer Tycho Brahe passed away, Johannes Kepler got his hands on the celestial observations in Brahe’s notebooks. It took Kepler four years to determine that Mars traces an ellipse through the sky rather than the dozens of other egglike shapes he considered. He followed up this “first law” with two more relationships uncovered through brute-force calculations. These regularities would later point Isaac Newton toward his law of universal gravitation.
The goal of symbolic regression is to speed up such Keplerian trial and error, scanning the countless ways of linking variables with basic mathematical operations to find the equation that most accurately predicts a system’s behavior.