At the Microsoft Ignite conference recently, I saw a talk that mentioned the Microsoft Garage Project, Trove, which is designed to help people provide data for AI projects in a new way. You can read more about it and get the app for Android mobile devices.
Trove is built to help AI researchers find images and use them in projects. However, the data they get is provided by users, who make the choice to include their data. This is different than many AI projects, where anyone doing AI work often just gets data from various sources, sometimes without permissions, but often without the individuals who own the data understanding where their data is being used or for what purposes.
I like the idea here of people specifically giving permission for their data to be used. It's a good way for volunteers to provide data, and have some control over how the information you provide might be used and where it is used. That doesn't mean this is necessarily a good model for the future. First, I'm not sure we can easily verify that the images someone submits are their own. I could see that if there are payments made, I'm sure people will try to game this and earn more money by using images they don't own. We already have problems with people publishing content they didn't create. I'm sure we've have plenty more with something like Trove.
The other issue, and likely the biggest one I think is a problem, is that trying to understand what data is collected and how it's used by many companies is a challenge. Even when there is some disclosure, it can be difficult to understand what is being released. Even while reading this document on SQL Server data collection, I'm not sure what might be collected on my system that could be an issue.
I don't think this is malicious or deceitful on Microsoft's part, I'm just not sure I can understand the implications. That is where I feel we, as a society, and certainly with regards to regulations, are woefully immature. We don't have good controls, but I'm not sure we really know what we'd want.
This is a thorny problem, and one I know we need to find better solutions to over time. Especially as we use more and more data for large scale research and applications in areas such as Artificial Intelligence and Machine Learning.