80% of the information we generate becomes ‘dark data.’ This is how to bring it to light

According to NASA, “matter” is any substance that has mass and occupies space. But there’s more to the universe than the matter we can see. Dark matter and dark energy are mysterious substances that affect and shape the cosmos, and scientists are still trying to figure them out. 

What if we were to look at the amount of data created over the last two decades or more in the same way? If dark matter makes up 85% of the matter in the universe, in the earthly world of business intelligence and analytics, only about 20% of information is numeric and easily studied using statistical techniques. This means the other 80% is largely invisible, like dark matter, silently influencing many outcomes in business and the larger world without being subject to scientific, objective, scaled study. 

Now, with the capabilities of generative AI (GenAI), and specifically large language models (LLMs), scientists can examine this unstructured, dark data, in new and exciting ways, leading to vast modern analytical capabilities that can unlock new meaning in all the world’s information. For leaders, this capability heralds a sea change and presents early AI adopters with a rare chance for true competitive advantage. 

Where the dark data lives now

The hunt to civilize and harness the insights contained in dark data is well underway. In the modern digital world, a continual barrage of text data is constantly created through news and social posts. But this dark data can’t be processed at scale with traditional means.

A recent study by researchers and academia in the legal domain hypothesized that evidence of legal violations could be found hidden in most information. Various LLM and other AI approaches were used to dissect samples of the data, validating the usefulness of these tools to identify violations. Interestingly, the researchers showed that evidence of legal violations could be found using AI, and they could even associate those violations with specific victims.  

Other researchers have shown that LLMs can be used to code qualitative data. Coding involves assigning a designation to text and is historically done by human raters. This takes substantial time and often involves sampling the data rather than coding all of it (not to mention being excruciatingly boring and difficult to carry out at high accuracy levels).

Once the data is coded, it can be further subjected to statistical analysis. Similarly, it has been shown that ChatGPT can be used to cheaply and efficiently code tweets with results superior to human coders. These researchers calculated that it costs $0.003 per annotation, which is about twenty times cheaper than using human coders through a mechanical turk-type process.  

Turning to healthcare, consider that medical science progresses through careful analysis of highly specific numeric datasets, yet there is a tremendous cache of information to be found in images, physician notes, test results, and scientific study descriptions.

For example, researchers note there are many promising applications of LLMs, including analyzing medical studies at scale. While there are ethical considerations here (e.g., ensuring that a poorly trained AI doesn’t recommend the wrong treatment), there is also great potential to advance medical care by better utilizing patient and scientific data for things like early detection of conditions or predicting reactions to specific medication based on an individual’s unique profile.  

Tapping the business opportunity 

Many companies fail to adequately analyze the available numeric data, and bad data itself has been estimated to cause losses of $3.1 trillion per year to the U.S. economy alone. If numeric data is only 20% of the total information available, then the opportunity to understand and utilize dark data at scale is transformational. While GenAI can help you to summarize long documents, the real benefit of LLMs is understanding all information and using this insight to inform business decisions.  

Consider customer and employer survey data, and all the open-ended comments that are largely unreviewed, or the many other types of unstructured information that are languishing in databases such as product reviews, customer feedback, job candidate profile data and resumes, expert financial analyses, corporate policies, technical manuals, legal contracts and opinions, and on and on. This dark data can now be quantified and studied, and not just once as a typical manual analysis or audit would do, but in a continual, scaled manner.  

Finnish researchers recently described their attempt to determine the value of using LLMs in qualitative data analysis. In their multi-agent approach, they broke down AI tasks into several discrete steps including thematic, content, narrative, and discourse analysis, plus a step that even creates theories from the analysis. After implementing their approach with a variety of datasets, the researchers found that practitioner experts rated their automated results very highly. While this and other approaches are quite new, there is tremendous potential to leverage LLMs to make sense of unstructured datasets.  

Generation vs. intelligence 

Most of the tech world is stuck on the generative component of GenAI: fun images of fanciful ideas, summaries of long documents, ideas for activities at your nephew’s fifth birthday party, etc. But while entertaining and certainly labor-saving for individuals, corporate utilization, despite all the hype, is surprisingly low, and many companies creating AI tools have yet to successfully monetize their investments

The key to understanding new efforts to civilize dark data is not the generative aspect of LLMs, but their ability to intelligently understand human commands and carry out instructions. Using a retrieval augmented generative (RAG) approach, a user can feed documents into an LLM, and then ask the LLM questions about that information or even to evaluate that information in a specific way.

Let’s say you have a thousand-page contract to review and you need to evaluate it on various compliance standards required by your employer. You can do this the old-fashioned way, which is an excruciatingly slow process prone to errors, or you can feed it into a RAG system and score it on your organization’s standards. To be clear, you have to write some code to do this and tweak it to ensure it works properly, but once done, you can use it continuously.  

Up until now, AI and data analytics have worked best using structured and organized numerical data, and while there are many legacy techniques for exploring unstructured data such as sentiment analysis, topic modeling, and keyword extraction, LLMs are uniquely capable of parsing and manipulating non-numeric data.  

The key aspect of LLMs that makes them so good at processing text, in particular, is that they understand and can carry out human commands to a degree never before possible. Users can instruct them to analyze a corpus of text, looking for answers to questions or specific information, and can further ask them to rate the returned findings on an anchored rating scale. The AI thus converts qualitative data, which is not subject to easy statistical analysis, to meaningful quantitative data that can be combined with native numerical data and crunched using common statistical tools.  

Physicists might not be close to understanding dark matter, but businesses and researchers can now make real inroads to civilize dark data.  

https://www.fastcompany.com/91186993/80-percent-of-the-information-we-generate-becomes-dark-data-this-is-how-to-bring-it-to-light?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Erstellt 10mo | 10.09.2024, 11:30:04


Melden Sie sich an, um einen Kommentar hinzuzufügen

Andere Beiträge in dieser Gruppe

How AI is transforming corporate finance

The role of the CFO is evolving—and fast. In today’s volatile business environment, finance leaders are navigating everything from unpredictable tariffs to tightening regulations and rising geopol

05.07.2025, 13:10:03 | Fast company - tech
Want to move data between Apple and Google Maps? Try this  workaround

In June, Google released its newest smartphone operating system, Android 16. The same month, Apple previewed its next smartphone oper

05.07.2025, 10:40:07 | Fast company - tech
Tally lets you design great free surveys in 60 seconds

This article is republished with permission from Wonder Tools, a newsletter that helps you discover the most useful sites and apps. 

04.07.2025, 13:50:03 | Fast company - tech
How China is leading the humanoid robots race

I’ve worked at the bleeding edge of robotics innovation in the United States for almost my entire professional life. Never before have I seen another country advance so quickly.

In

04.07.2025, 09:20:03 | Fast company - tech
‘There is nothing that Aquaphor will not fix’: The internet is in love with this no-frills skin ointment

Aquaphor has become this summer’s hottest accessory.

The no-frills beauty staple—once relegated to the bottom of your bag, the glove box, or a bedside drawer—is now dangling from

03.07.2025, 23:50:07 | Fast company - tech
Is Tesla screwed?

Elon Musk’s anger over the One Big Beautiful Bill Act was evident this week a

03.07.2025, 17:10:05 | Fast company - tech