Common Crawl — Web Pulse coverage Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data" :: https://wpnews.pro/news/microsoft-trained-its-mai-models-on-unlicensed-web-data-despite-promising-grade