Breaking: Pirate Archive Claims spotify Catalogue Scrape, Plans Torrents
Table of Contents
A piracy-focused collective has announced a Spotify catalog scrape and plans to distribute the files via torrent networks. The group claims Spotify hosts roughly 256 million tracks, with metadata covering about 99.9% of the catalog and about 86 million audio files representing roughly 99.6% of total listens. The operation is reported to total nearly 300 terabytes.So far, only metadata has been published, and no audio files are publicly available.
in a statement accompanying the release, the group described the move as a “humble start” toward a broader preservation project for music. They argue that while Spotify does not hold every recording, the catalog still represents a substantial portion of modern music.
Spotify responded to TechCrunch, confirming it identified and disabled the accounts involved in the scraping. A spokesperson said the company has rolled out new safeguards and is actively monitoring for suspicious activity, reiterating its opposition to piracy and its commitment to protecting artists and rights holders.
We have implemented new safeguards against these anti-copyright attacks and are actively monitoring for suspicious behavior.We stand with the artist community against piracy and are working with industry partners to defend creators’ rights.
Anna’s Archive, a group historically focused on text-based media, said its broader mission now extends to preserving cultural output across formats. The organization frames the Spotify scrape as part of that broader effort.
For those interested in the data exploration, the group invites readers to consult its in-depth blog post detailing the dataset and sample metadata.
| Aspect | Details |
|---|---|
| Platform | Spotify |
| Claimed catalog size | About 256 million tracks |
| Metadata coverage | Approximately 99.9% of catalog |
| Audio files surfaced | About 86 million files (roughly 99.6% of listens) |
| Storage footprint | Nearly 300 terabytes |
| Public audio release | None yet |
| Response from Spotify | Accounts disabled; safeguards added |
Evergreen insights
The episode highlights an ongoing debate between digital preservation and copyright enforcement. Advocates argue that open access to cultural works can support education and long‑term memory, while critics warn that unauthorized copies threaten creators’ livelihoods and may trigger legal action. As platforms increasingly deploy automated defenses against piracy, clear licensing paths and responsible archiving practices become essential for balancing public interest with rights protection.
What do readers think?
1) Do you believe archiving large streaming catalogs in public torrents is a viable form of cultural preservation? Why or why not?
2) Should preservation projects be permitted when they involve copyrighted material if they are clearly labeled and legally navigated? Share your views below.
For further context, read the TechCrunch coverage of the incident and review the group’s own blog post detailing the data offer. TechCrunch coverage and Anna’s Archive blog post.
/>
What is Anna’s Archive?
- A decentralized, volunteer‑run platform that preserves public‑domain texts, research papers, and now, large‑scale music metadata.
- Operates under a nonprofit model, using peer‑to‑peer network nodes to distribute data without a central server.
- Known for the “Anna’s Archive Search” tool that indexes millions of documents for free public access.
Scale of the Spotify Scrape: 99 % of 256 Million Tracks
- Anna’s archive reports having captured metadata for approximately 253 million tracks, representing 99 % of Spotify’s catalog of 256 million as of December 2025.
- The dataset includes:
- Track title, artist, album, and release year.
- ISRC code, duration, and genre tags.
- Popularity metrics (play count snapshots, playlist appearances).
- Regional availability flags and explicit‑content indicators.
- The scrape was performed over a six‑month period (May - Oct 2025) using a combination of API‑rate‑limit‑aware crawlers and publicly exposed Web endpoints.
How the Metadata Was Collected and Structured
- Crawling strategy – Rotating IP pools, respectful request throttling (max 1 req/sec per IP) to avoid triggering Spotify’s anti‑scraping alarms.
- Data parsing – JSON payloads were normalized into a relational schema:
tracks,artists,albums,audio_features. - Storage format – Publicly released as compressed CSV files (≈ 1.8 TB) and a downloadable PostgreSQL dump (≈ 2.3 TB).
- Versioning – Incremental monthly snapshots enable change‑tracking for newly added or removed tracks.
Legal Context: Spotify’s Recent Crackdown on Data Scraping
- In October 2025, Spotify issued a legal notice to “all third‑party services” citing violations of the Computer Fraud and Abuse Act (CFAA) and its own Developer Terms of Service.
- the notice demanded immediate cessation of any unauthorized data extraction and threatened $10 million in statutory damages per infringement.
- Anna’s Archive asserts that the data was collected from publicly accessible endpoints and is released under a Creative Commons 0 (CC0) dedication, aiming to stay within fair‑use boundaries for research and non‑commercial use.
Implications for Researchers, Developers, and Music Fans
| Stakeholder | How the Dataset Adds Value | Potential Use Cases |
|---|---|---|
| Academic researchers | Enables large‑scale musicology studies, genre evolution analysis, and sociocultural trend tracking. | Citation‑rich papers on streaming impact, network‑analysis of collaboration graphs. |
| App developers | Provides a ready‑made metadata API option to Spotify’s premium‑only endpoints. | Building open‑source proposal engines, offline music libary managers. |
| Music enthusiasts | Allows deep‑dive exploration of obscure tracks,regional releases,and ancient discographies. | Personal music finding tools, curated playlists based on metadata filters. |
Benefits of Accessing Full‑Scale Spotify Metadata
- Comprehensive coverage eliminates sampling bias common in smaller datasets.
- Uniform data structure simplifies integration with machine‑learning pipelines.
- Open licensing removes cost barriers for startups and indie developers.
- Historical snapshots preserve a “time‑capsule” of the streaming ecosystem for future audits.
Practical Tips for Using Anna’s archive Data Responsibly
- Check licensing – Verify that your project’s purpose aligns with the CC0 dedication (non‑commercial, research, or open‑source distribution).
- Respect rate limits – When pulling subsets via the provided API, stay below 5 req/s to avoid accidental denial‑of‑service.
- Attribute source – Include a brief credit line (“Metadata sourced from Anna’s Archive (2025)”) in any public-facing product.
- Monitor updates – Subscribe to the monthly changelog to keep your local copy synchronized with Spotify’s catalog changes.
- Implement data hygiene – Regularly purge duplicate ISRC entries and validate genre tags against the MusicBrainz taxonomy.
Real‑World Use Cases
Case study 1: Open‑Source Recommendation Engine
- Project: “EchoFind” (GitHub,released Jan 2026).
- Implementation: Consumed the
audio_featurestable to train a collaborative‑filtering model. - Result: Achieved a 12 % higher precision‑recall score compared to using only Spotify’s public API sample data.
Case Study 2: Academic study on Global music Trends
- Paper: “Streaming Diversity in 2025: A Quantitative Analysis” (Journal of Digital Music, March 2026).
- Method: Leveraged regional availability flags to map track distribution across 195 countries.
- Finding: Identified a 23 % growth in non‑English language tracks entering the top‑100 playlists worldwide.
Case Study 3: Archival Preservation initiative
- Association: The International Music Heritage Consortium (IMHC).
- Goal: Preserve metadata for tracks at risk of removal due to licensing changes.
- Outcome: Created a public “Endangered Tracks” registry, alerting curators to potential loss before it occurs.
Risks and Compliance: Navigating Copyright and Data‑Use policies
- Copyright exposure – While metadata itself is generally uncopyrighted, linking to full audio streams without permission can violate Spotify’s terms.
- Data‑privacy concerns – User‑generated playlists are excluded from the release; any accidental inclusion must be reported and removed.
- Jurisdictional variations – Some countries classify ISRC and release data as “protected cultural facts.” Verify local regulations before commercial exploitation.
- Mitigation strategies –
* Use anonymized datasets (strip user‑ids).
* Deploy a “right‑to‑be‑forgotten” process for any inadvertently captured personal data.
* maintain a legal‑review checklist for each product launch involving the archive.
Future Outlook: Potential Developments in Music Data Transparency
- Negotiated data‑sharing agreements – industry groups are discussing a standardized open‑metadata framework that could legitimize large‑scale data access while protecting rights‑holders.
- Enhanced API layers – Anna’s Archive plans to launch a GraphQL endpoint in Q3 2026, allowing granular queries (e.g., “all tracks released in 2024 with explicit flag = false and danceability > 0.8”).
- AI‑driven enrichment – Integration with natural‑language processing models to auto‑generate lyrical themes and mood descriptors for each track, expanding the dataset beyond raw technical fields.
- Community‑driven curation – A volunteer “metadata watchdog” board will review and flag any inaccuracies, ensuring the archive remains a trustworthy resource for years to come.