.
AI’s Top Sources: Where AI Gets Its Data
Table of Contents
- 1. AI’s Top Sources: Where AI Gets Its Data
- 2. what are the primary data sources used to train large language models (LLMs)?
- 3. Top Sources Used by Leading Artificial Intelligence Models: Insights from an Infographic
- 4. The Data landscape Fueling AI Innovation
- 5. Core Data Sources: A Breakdown
- 6. DeepSeek’s approach: Deep Thinking vs.Net search
- 7. Infographic Insights: Data Composition & Model Performance
- 8. The Impact of Data Quality on AI Output
artificial intelligence models are increasingly relying on a select few sources for information, shaping how they respond to our queries and potentially stabilizing the information ecosystem. A recent analysis by Semrush reveals which websites are most frequently cited by these AI systems, highlighting a shift in the landscape of information gathering.
The study, based on over 150,000 LLM (Large Language Model) citations, reveals a clear dominance by platforms like Reddit, Wikipedia, YouTube, Google, and yelp. These sources consistently top the list as the most “frequented” by AI.
This reliance on open and accessible platforms like reddit and Wikipedia is likely because many news publications and large publishers have blocked AI chatbots from accessing their content. The presence of youtube, Google, and Yelp likely indicates AI models also draw information from multimedia content, geographical data, and user reviews. Notably,Mapbox.com and openstreetmap.org,platforms for creating interactive maps,also rank within the top 10.
Google’s recent agreement wiht Reddit is expected to further solidify both platforms’ position as key sources for AI,with Reddit’s visibility likely to increase as AI adoption grows.
Interestingly, the study found a strong correlation between domains that rank well in Google’s organic search results and those cited by LLMs. Though, it’s not simply a case of directly quoting the top 10 URLs. LLMs often pull information from diverse pages within the same authoritative domain, resulting in a strong correlation at the domain level but less direct overlap at the URL level.
This raises the question: do these rankings reflect genuine reliability, or simply visibility? Reddit, while providing diverse perspectives, comes with inherent risks of misinformation. Wikipedia offers accuracy but can be overly standardized. AI models, it seems, prioritize information density and accessibility, operating on a Darwinian principle of “most reported information.” Like planets in a solar system, these sources attract and disseminate information, but true quality reporting often exists elsewhere, beyond the reach of AI scraping.
what are the primary data sources used to train large language models (LLMs)?
Top Sources Used by Leading Artificial Intelligence Models: Insights from an Infographic
The Data landscape Fueling AI Innovation
Artificial Intelligence (AI) models aren’t born bright; they learn. And that learning is heavily reliant on the quality and diversity of the data they’re trained on. Understanding the primary sources powering these models – from large language models (LLMs) like GPT-4 to image generation tools like DALL-E 2 – is crucial for anyone interested in the future of AI, data science, or machine learning. This article breaks down the key data sources, offering insights gleaned from recent analyses and infographics detailing the composition of training datasets. We’ll cover everything from common crawl data to curated datasets and the implications for AI performance.
Core Data Sources: A Breakdown
HearS a look at the major categories of data used to train today’s leading AI models:
The Common Crawl: This is arguably the largest publicly available dataset, a massive scrape of the internet. It provides petabytes of text data, forming a foundational layer for many LLMs. think of it as the raw, unfiltered internet – a starting point for AI learning.
WebText & WebText2: Developed by OpenAI, these datasets are curated subsets of the internet, focusing on high-quality content identified through outbound links from Reddit. This filtering process aims to improve the quality of training data.
Books3: A collection of over 196,000 books, Books3 provides a rich source of long-form text, crucial for developing AI’s understanding of narrative structure and complex reasoning. Its use has been subject to copyright debate.
Wikipedia: A cornerstone of knowledge, wikipedia provides a structured and relatively clean dataset for AI training. Its collaborative nature and broad coverage make it invaluable.
Academic Papers & Research Datasets: Platforms like ArXiv and datasets specifically designed for machine learning tasks (like ImageNet for image recognition) contribute specialized knowledge and benchmarks.
Code Repositories (GitHub): Essential for training AI models capable of code generation and understanding, GitHub provides a vast library of source code in various programming languages.
Social Media Data (Reddit,Twitter): While often noisy,social media data offers insights into current trends,colloquial language,and real-world opinions.Though, ethical considerations and bias mitigation are paramount when using this data.
DeepSeek’s approach: Deep Thinking vs.Net search
Recent advancements, like those offered by DeepSeek, highlight the evolving strategies for leveraging data in AI. DeepSeek’s “Deep Thinking” mode focuses on complex problem-solving through internal analysis, while its “Net Search” function taps into real-time information. This duality reflects the need for both pre-trained knowledge and the ability to access and process current data. (Source: https://www.zhihu.com/question/11321181970). This demonstrates a shift towards AI systems that aren’t solely reliant on static datasets.
Infographic Insights: Data Composition & Model Performance
Analyzing infographics detailing the composition of training datasets reveals several key trends:
Dominance of text data: The vast majority of data used to train LLMs remains text-based, with the Common Crawl and WebText consistently appearing as significant contributors.
Increasing emphasis on Code: As AI’s coding capabilities grow,the proportion of code data in training sets is steadily increasing.
The Rise of Synthetic Data: To address data scarcity and bias, researchers are increasingly using synthetic data – artificially generated data that mimics real-world patterns.
Multimodal Learning: Models like DALL-E 2 and Google’s Gemini are trained on multimodal* datasets, combining text with images, audio, and video. This allows them to understand and generate content across different modalities.
The Impact of Data Quality on AI Output
It’s not just