Datacurve Raises $15 Million Series A Led by Chemistry to Revolutionize AI Data Generation Through Gamified Bounty Platform
Datacurve, a San Francisco–based startup specialising in high-quality coding datasets for AI model training, has raised US $15 million in a Series A funding round. The round was led by Chemistry with participation from engineer-contributors from major AI labs. This raise follows a previous seed round of approximately US $2.7 million, bringing the company’s total funding to about US $17.7 million.
Founded by Serena Ge and Charley Lee and backed by the Y Combinator Winter 2024 batch, Datacurve is deploying a gamified “bounty-hunter” contributor network to tackle one of AI’s most pressing bottlenecks: collecting expert-level, code-engineering-specific data for post-training tasks. The company reports distributing more than US $1 million in bounty rewards to over 14,000 skilled software engineers via its platform, which is designed to produce debugging tasks, algorithmic challenges, private-repo benchmarks, and agent-flow traces rather than generic labelling work.
In announcing the financing, the company emphasised that its model treats data generation as a product—users apply, complete tasks, submit deliverables, and receive payouts—rather than a traditional crowd-labelling operation. Co-founder Serena Ge explained that this experience-first mindset is critical to attracting top engineers who can handle the hardest-to-source datasets required for modern AI systems. As AI model development shifts from static datasets toward dynamic, domain-expert contributions and RL-environment traces, Datacurve’s differentiated supply-chain approach is resonating with funders.
The Series A funds will support several strategic priorities: scaling the contributor network, enhancing the platform’s verification and quality-control layers, expanding into enterprise customers that build large code-base LLMs, and broadening the domain beyond software engineering into areas such as finance, marketing, and healthcare datasets. Datacurve also aims to deepen its tooling for RLHF (reinforcement-learning human feedback) and agentic-workflow training by producing datasets that reflect real developer behaviour—enabling models to learn end-to-end coding, commit workflows, and complex software reasoning.
Investor enthusiasm underscores the emerging importance of post-training data as a major frontier in AI infrastructure. Chemistry partner Mark Goldberg said that Datacurve is building “the next-generation data network” tailored for agentic AI workflows and deep coding tasks—areas where more generic data-labelling firms struggle. The involvement of contributors and staff from leading AI labs is seen as a vote of confidence in the technical rigor and specialised talent base of Datacurve’s platform.
Yet despite the strong backing and momentum, Datacurve faces significant execution challenges. The task of scaling a high-quality contributor network of expert engineers differs markedly from standard labelling operations—the community must stay motivated, the tasks must remain engaging, and the quality must meet enterprise standards. Additionally, converting custom dataset projects into recurring revenue and enterprise contracts remains key. The company will need to demonstrate that its specialised datasets deliver measurable improvements in model performance, not just collection outputs. Finally, as AI development becomes increasingly insourced by large labs, Datacurve must prove the value of an external data-supply layer in an ecosystem where many players build their own datasets.
With the US $15 million raise now secured, Datacurve is positioned to accelerate its transition from an early-stage startup to a data-infrastructure platform poised to serve the next wave of AI agents and code-based models. The company’s progress will likely be measured by how its datasets are adopted by major model developers, how well its contributor marketplace scales, and how its tools and workflow capabilities expand into adjacent domains.