AI’s progress has hit a critical constraint: access to real-world data. While public datasets and web scraping powered AI’s early breakthroughs, today’s models demand proprietary data from hospitals, enterprises, studios, and regulated environments – data that’s been locked away behind legal, technical, and governance barriers. This bottleneck affects every stage of AI development, from pre-training to evaluation, forcing model builders to rely on synthetic data that can’t fully replicate the complexity of human behavior and real-world scenarios. Protege addresses this fundamental gap by creating a platform where data holders can license their proprietary datasets while maintaining privacy, IP protections, and compliance – enabling AI builders to access clinical records, media content, audio conversations, motion capture data, and other hard-to-find information at scale. Working with data partners across healthcare, media, and motion capture, the company has aggregated access to billions of data points, including over 3B clinical notes, 100M medical images, 500K+ hours of video content, and 500K+ hours of audio across 50+ languages. With their recent acquisition of Calliope Networks and partnerships spanning from the majority of “Magnificent Seven” tech companies to hundreds of data providers, Protege is becoming the central infrastructure layer connecting proprietary data with AI development needs.
AlleyWatch sat down with Protege CEO and Co-Founder Bobby Samuels to learn more about the business, its future plans, recent funding round, and much much more…Who were your investors and how much did you raise?
Protege raised $30M in a Series A1 round led by Andreessen Horowitz (a16z). The financing expands the company’s $25M Series A from August 2025 and brings total funding to roughly $65M since Protege’s founding in 2024. The round also includes follow-on participation from existing investors such as Footwork, CRV, Bloomberg Beta, Flex Capital, Shaper Capital, and more.
Tell us about the product or service that Protege offers.
Protege is an AI data platform unlocking access to trusted, real-world data at scale. We are transforming how the world’s real data powers AI — enabling people and institutions to contribute their knowledge safely and shape intelligence built on integrity, expertise, and human purpose. We work with private data holders across healthcare, media, and other industries to license and curate high-quality datasets that AI builders need for training, evaluation, and benchmarking. Our role is to act as the connective tissue between these two sides, making it possible to unlock valuable data while preserving privacy, IP rights, and regulatory compliance.
At its core, Protege is about turning data that’s historically been siloed, sensitive, or underutilized into a responsibly governed asset. We focus on real-world data across industries because that’s what ultimately determines how AI systems perform once they leave the lab and operate in real environments.
What inspired the start of Protege?
While AI models and computers have advanced rapidly, access to the right data has become a bottleneck. The vast majority of the world’s most valuable data, especially in regulated industries like healthcare, is not publicly available, and synthetic or manufactured data can’t fully replicate real-world complexity. Protege was born from the belief that AI’s next leap will come from unlocking real-world data, ethically sourced, expert-curated, and shared on human terms.
My co-founders and I had spent years working in privacy-first data ecosystems, and we saw an opportunity to apply those lessons to AI. We believed there was a better path forward than data scraping from the internet – one that compensated data holders, respected privacy, and enabled AI builders to train systems that can actually work in the real world.
How is Protege different?
We are built around licensed, real-world data from day one. When AI builders come to Protege, they’re looking for real-world data: the most authentic signal of how people and systems actually behave. This is not synthetic data created by AI nor manufactured data created to simulate human behavior. Across every stage of the AI development lifecycle — from pre-training to post-training to fine-tuning to evaluation — AI builders need this data. They’re looking across modalities and industries: healthcare, video, audio, motion capture, gaming, manufacturing, life sciences, real estate, finance, education, and many more. Foundational, multi-modal model-builders (including the majority of the Magnificent Seven) now work with us across multiple domains along with dozens of other model builders.
We also focus on curation and fit-for-purpose datasets rather than solely volume. As AI developers’ needs have matured, they’ve shifted from “more data” to “the right data,” and our platform is designed to meet that demand, whether it’s representative clinical scenarios in healthcare, highly specific content in media, or updated audio and motion capture needs. We unlock revenue for data providers as well, empowering data stewards to share their data assets safely and help AI learn responsibly, so that progress is both powerful and representative of the broader human population.
What market does Protege target and how big is it?
Protege sits at the intersection of AI development and proprietary data, serving both AI builders and data holders across multiple verticals, such as healthcare, media, motion-capture, and more. Fundamentally, there are 3 bottlenecks to AI progress: compute, models, and data. There are already multiple companies in the first two categories worth billions, potentially trillions. There is yet to be a dominant player in the data that is needed for AI development, and that’s the gap that Protege aims to fill.
As AI becomes more multimodal and more embedded in real-world workflows, demand for licensed, domain-specific data will only grow. We believe solving AI’s data access problem is a generational opportunity, and the market spans nearly every industry touched by AI.
What’s your business model?
We currently operate as a two-sided data platform for AI development, where AI builders purchase licensed datasets and data holders are compensated through structured agreements. We earn revenue for facilitating access and providing value-added services like curation and de-identification where appropriate. Over time, we have also expanded into benchmarks and evaluation datasets to support AI development across the full lifecycle, not just initial training.

How are you preparing for a potential economic slowdown?
In our industry, we’ve seen an acceleration in demand across the different verticals that we serve. In particular, we feel well-positioned to take advantage of not only the growing need for data for AI development but also the growing trend towards ethical data licensing for AI across industries.
This has the potential to provide other companies, organizations, and rights-holders who may be in industries that are susceptible to economic slowdowns an additional revenue stream opportunity that didn’t previously exist. These are win-win situations where data rights holders can benefit from their existing assets, and we as a company are able to help package that data and connect data holders with AI builders actively seeking out those proprietary data sources. This helps to insulate us to broader market conditions while also providing others opportunities beyond their existing business lines.
What was the funding process like?
Protege has been growing quickly, and we were seeing clear signals in the market that there was an opportunity to raise capital in a way that would meaningfully accelerate what we were already doing: expanding data partnerships, hiring thoughtfully, and staying flexible around potential strategic opportunities. a16z stood out as the right partner given their depth in data infrastructure, AI, and healthcare, as well as the long-term orientation they bring to company building.
This round gives us more opportunities to accelerate product development, significantly expand Protege’s data network into new domains and data formats, deepen partnerships with leading institutions, and scale the team and infrastructure required to deliver AI-ready and rights-protected access to real-world data. At the same time, we get to bring on a world-class partner who’s deeply connected to the ecosystem in which we operate.
Having Daisy Wolf, Partner at a16z, invest in us was an important part of that decision, given her experience in healthcare and data is highly aligned with where we’re going. The round moved quickly and included continued participation from our existing investors, which we see as a strong vote of confidence in both the business and the direction we’re heading.
What are the biggest challenges that you faced while raising capital?
A big factor that’s often overlooked is how we convey our vision for the world and how we as a company fit into it when the world is changing so quickly. This is especially true in the AI space, where new models are released what seems like every week, and innovation (and disruption) is happening left and right. So having a clear and crisp vision that we can clearly communicate to investors is paramount to ensuring that we see eye-to-eye with them quickly. This helps investors develop conviction in our vision and mission quickly, while also ensuring that we feel confident that we’ve chosen the right partner for the long haul.
What factors about your business led your investors to write the check?
For years, the open internet powered rapid advances in AI—but that resource is now largely exhausted. Public datasets, such as Common Crawl, capture only a small slice of the web, while the vast majority of high-value data lives offline, inside hospitals, enterprises, studios, and other regulated or proprietary environments. The real bottleneck has shifted to accessing real-world data responsibly. Investors see Protege as essential infrastructure for that next phase, enabling licensed, privacy-preserving access to the data AI systems need to perform reliably in practice. In addition, folks noted the strength of the team from a variety of backgrounds, ranging from healthcare data to media to tech startups and more.
For years, the open internet powered rapid advances in AI—but that resource is now largely exhausted. Public datasets, such as Common Crawl, capture only a small slice of the web, while the vast majority of high-value data lives offline, inside hospitals, enterprises, studios, and other regulated or proprietary environments. The real bottleneck has shifted to accessing real-world data responsibly. Investors see Protege as essential infrastructure for that next phase, enabling licensed, privacy-preserving access to the data AI systems need to perform reliably in practice. In addition, folks noted the strength of the team from a variety of backgrounds, ranging from healthcare data to media to tech startups and more.
What are the milestones you plan to achieve in the next six months?
In the next six months, Protege aims to expand its verticals past healthcare, audiovisual, and motion capture, with the goal of becoming a trusted source of licensed, real-world data across domains.
Beyond just training data, the Protege platform plans to evolve to support all stages of the AI model development cycle, such as pre-training, post-training, fine-tuning, evaluation & benchmarking, and inference, into its infrastructure, allowing for a more advanced evaluation.
What advice can you offer companies in New York that do not have a fresh injection of capital in the bank?
Similar to previous eras, the one advantage that smaller companies and startups have that incumbents do not is speed. In the age of AI, this is especially true – the cost of developing new products, testing new ideas, and reaching new partners at scale has never been faster. While this can cause traditional channels to become saturated, it does also create a world where it’s never been easier for great ideas to reach the right audiences that care about what you are building.
As a result, leaning into the speed advantage is almost never a bad idea in the early stages. It increases the surface area of opportunities, while also creating more chances to discover new insights and pivot as necessary in the ever-changing landscape.
Where do you see the company going now over the near term?
Over the near term, Protege is focused on becoming the central platform for real-world, licensed data used in AI development across industries, while also being the leading voice in AI data best practices for model builders. We believe that human data that’s reflective of human activity in the real world will continue to play a greater and greater part of AI development. We aim to be the trusted leader for this type of data in the broader AI ecosystem.
What’s your favorite winter destination in and around the city?
I’m a big fan of a new AI-powered karaoke studio called Beatbox. It’s a ton of fun and a great space. (Though full disclosure, my wife and her cofounder opened it up late last year.)



