The convergence of Web3 and AI is a pivotal moment for both industries. In 2022 and 2023 alone, nearly $2 billion has been poured into startups and initiatives in the Web3+AI industry. Innovations, such as AI-specialized DePIN, distributed AI training, decentralized AI marketplace, and generative AI NFTs, have emerged and showcased the technological potential and promising markets.
Data is an AI fundamental. A huge part of industry leaders’ success is attributed to the massive and high-quality data. While technical details of GPT-4 were not revealed, GPT-3 was reported to have a training set with 45TB of compressed plaintext, resulting in a final model with 175 billion parameters. Stable Diffusion was trained on LAION-5B, one of the biggest image-text pair datasets in the world. This 5.85 billion set is the foundation of the model behind the most popular image generators for Midjourney, NightCafe, and DreamStudio.
The Web3 community has collectively built up the Web3 native training set—Millions of users across thousands of blockchains and networks produce tremendous data daily. However, the nature of decentralization brings up a challenge. Each blockchain or network may optimize its data structures for specific functionalities and end up with various data schemas and formats. The effort required to aggregate and structure the data across networks and simultaneously the demand for standardized, AI-ready data are booming together with the exponential growth of the Web3 ecosystem.
Consequently, an infrastructure that collectively indexes and delivers Web3 native data in AI-ready format is beneficial and essential for integrating Web3 and AI. As decentralized networks and protocols proliferate, the data they generate dramatic increases in volume and diversity. Besides, preliminary indexers are inadequate for delivering AI-ready data since AI requires structured and standardized data. The effort required to curate and structure the data in such decentralized environments thus becomes intensive.
Aside from comprehensiveness, Web3 specialized AIs also require timeliness. GPT-4 is reported to have a cutoff date “up to Sep 2021.” For a general-purpose LLM, it is sufficient and has already proven itself. However, in a Web3 native use case, AIs have to access the most up-to-date information to provide accurate domain knowledge and data—the industry is still in a rapidly emerging phase, and innovations and new knowledge are coming out every day. Therefore, an infrastructure providing not only a most complete coverage but also real-time, up-to-date data will be a game changer.
Organizing and providing AI-ready Web3 data as a robust solution can save time and resources for developers and projects. It boosts efficiency by allowing easy access to high-quality, ready-to-use datasets. Meanwhile, it also ensures any AI models or applications are reliable and better fitted for Web3 use cases.
The integration of Web3 and AI also addressed the data ownership issue. Traditionally, web users are the core contributors to vast amounts of data but are merely recognized or credited when big companies profit from the user-generated data. Today’s to-notch AI models stand on these AI models while the people behind the fundamental datasets are rarely acknowledged. Web3, on the other hand, introduces the opportunity to allow individual web participants to resume their data ownership.
Integrating AI and Web3 also equips users with tools to control their data and potentially be rewarded for their digital contributions. By leveraging Web3 technologies, it is possible to build AI that is actually “for the people, by the people. This ensures that the benefits brought by AI emergence are not just flowing toward the large corporations but instead more broadly distributed to every individual who participates across the internet.
This potential for individual users to be credited for their contributions fosters a more open and inclusive environment for a more healthy ecosystem. Innovations and decisions are more likely to be driven by collective interests and community values instead of corporate needs alone. In this environment, AI is not just a technology developed based on collective input from the public; it also serves as a tool for the public good and interests.
With the blockchain industry anticipated to reach $1 trillion by 2030 and the AI industry to over $3 trillion by 2033, the next decade will be crucial in establishing standards and infrastructures that will define how future AI systems are built. By building the foundational layer on which AI is trained, open data is a necessary good that will help break boundaries and push the merits of decentralization.