Thursday, April 30, 2026

High-Quality Data is Worth a Thousand LLMs in Resolving Ambiguities About UFOs



It only takes a high quality data point observed from multiple directions to produce secure information.


this is a serious effort to winkle quality from what are at least two hundred thousand lights in the sky.  It may work.

It is also why i am sceptical regarding out interpretation outside our galaxy.  No angle to confirm anything.

High-Quality Data is Worth a Thousand LLMs in Resolving Ambiguities About UFOs



https://avi-loeb.medium.com/high-quality-data-is-worth-a-thousand-llms-in-resolving-ambiguities-about-ufos-dab9bc74c7c0

Could artificial intelligence (AI), machine learning (ML), large language models (LLM) or natural language processing (NLP) help us figure out the nature of Unidentified Flying Object (UFOs) or Unidentified Anomalous Phenomena (UAP), by analyzing verbal reports from humans?

Today, I received an email from a group of researchers who stated: “We’ve been working on a machine learning project that classifies reports from the National UFO Reporting Center by narrative “dramaticness,” essentially modeling the language and content of witness reports to distinguish brief, ambiguous observations from highly detailed extraordinary accounts. The pipeline combines structured features, free-text NLP, gradient-boosted models, and an LLM baseline, with explainability built in. We see this as a content-side complement to instrument-side efforts like the Galileo Project: the witness reports are noisy and selection-biased, but they’re also the longest continuous record of public UAP reporting we have, and the language inside them turns out to carry a lot of structure.”

My response clarified the following fundamental points.

In scientific research, low significance data is most abundant but is of little use because it is often swamped by noise. UFOs or UAP are a mixed bag with many reports triggered by human-made or natural phenomena. Humans cannot be trusted as scientific detectors. We need instruments to document the evidence.

This is evident from the legal system, where convicts who were put on death row based on eyewitness testimonies under oath, were later exonerated based on DNA tests. Among 51 cases of death row exonerations, a study posted here found that 45.9% involved informants, while 25.2% involved erroneous eyewitness identification. The same level of misinformation is also evident in common reports on car accidents, where testimonies are often full of imagined narratives and wishful thinking. Stories told by different people about the same car accident are different and sometimes contradictory. Given that there is only one physical reality, they cannot all be correct. Ambiguities are best resolved not by AI/ML/LLM/NLP systems analyzing verbal testimonies, but rather by multiple video cameras observing the car accident.

Since humans know about each other’s story, their narratives are often interwoven and correlated. The fundamental question is whether any of them is right. This is well known to FIFA (Fédération Internationale de Football Association), the soccer worldwide organization. Instead of consulting the goalkeeper or the numerous fans in the audience and using AI/ML/LLM/NLP to sort through their narratives, FIFA uses advanced camera-based technologies, including Goal-Line Technology (GLT) and Video Assistant Referee (VAR), to confirm goals, offsides, and fouls. GLT uses 14 high-speed cameras to confirm if the ball crosses the line and sends a signal to the referee within one second, while VAR reviews video footage for overall accuracy.

We can spend a lifetime chasing ghosts based on verbal reports or low-quality data. The Galileo Project under my leadership is focused on getting high-quality data from multiple observing directions, allowing us to infer the distance, velocity and acceleration of objects in the sky. Without distance measurements, it is difficult to assess how anomalous a moving object is. Having a lot of uncertain information is not of interest to the Galileo Project, irrespective of how advanced the AI/ML system that analyzes it is.

On April 17, 2026, President Trump announced in a speech, accessible here, that the first release of classified UFO files will be coming out very soon. As I discussed in a previous essay, posted here, the question is whether the released videos will be the most intriguing ones. Being flooded by blurry videos with no information about the distance of UFOs from the camera will not resolve ambiguities about whether they deviate from the performance envelope of human-made technologies.

When information is limited, intelligence has limited powers. It matters less how advanced the AI/ML/LLM/NLP being used is. What matters the most is the quality of the data. A picture is worth a thousand words. For the same reason, high quality data is worth a thousand LLMs.