TinyFish launches Bigset, an open-source tool for building live datasets from text prompts

TinyFish has launched Bigset, an open-source system designed to turn a plain-language request into a structured dataset built from live web sources. The company says the tool is aimed at users who want current, exportable data without having to write and maintain scraping scripts.

Bigset works by taking a text prompt and inferring the dataset schema from it. From there, it sends autonomous agents to search the web, gather information from public pages, verify what they find, remove duplicates, and return the results in a clean table. The output can be exported to common formats such as CSV and XLSX.

How Bigset is structured

According to TinyFish, the system divides the work between two kinds of agents. An orchestrator agent identifies which rows should belong in the dataset and where on the web to find the relevant information. It then assigns sub-agents to complete the individual records.

The orchestrator does not write data itself. Instead, each sub-agent is assigned a single entity to research and is limited to six tool calls. Those sub-agents use TinyFish Search and Fetch to pull information from real web pages, then insert one verified row along with source URLs and a record of how the data was collected.

TinyFish says the system is built to avoid fabricated entries. If a value cannot be confirmed, the agent is instructed to leave the field blank rather than guess. The system also rejects duplicate primary keys automatically.

The orchestrator continues running until the dataset reaches the requested size. TinyFish says this allows the system to become faster over time as it learns where the data is located.

Availability and technical details

Bigset is released under the AGPL-3.0 license and runs self-hosted through Docker. TinyFish says schema inference is powered by Claude Sonnet 4.6, while the agent roles use Qwen3.7-max by default. Both are routed through OpenRouter and can be configured separately.

The company describes the project as experimental. It says a dataset usually takes between two and five minutes to generate, and that the tool performs best on topics with publicly available web data. TinyFish also says the free tier includes 2,500 row operations per month.

To give users a starting point, Bigset ships with nine curated public datasets. These cover topics including AI companies hiring, GPU prices, model pricing, and top open-source repositories. The datasets are available to browse without creating an account.

TinyFish positions Bigset as an open-source alternative to proprietary natural-language dataset tools. Because it is self-hosted, the company says users get full control over the pipeline, with no per-seat pricing and no domain restrictions.

TinyFish’s broader platform

Bigset is built on top of TinyFish Search and Fetch, the same web infrastructure that underpins the company’s enterprise agent products. TinyFish is based in Palo Alto and says it has raised $47 million in Series A funding led by ICONIQ. The company counts Google, DoorDash, and Amazon among its enterprise customers and says it has processed more than 40 million agent operations.

With Bigset, TinyFish is extending its web automation stack into a public, open-source product focused on dataset generation. The launch reflects growing interest in systems that can convert natural-language instructions into structured data while keeping the resulting records traceable to live sources.