One of the challenges of doing machine learning is getting lots of data. One of the other challenges of machine learning is having a lot of data.
This is a bit of a dichotomy with the data. A data scientist or machine learning hacker needs a lot of data to find accuracy in their model predicting something. The challenge can be with all the features (or columns in a data set) that represent potential items that influence the decision. Which of these do you choose?
I saw a post about looking over data and analyzing loans for risk. The post looks at how you might use a notebook to perform some analysis and does a good job of walking through the process with Databricks. It then builds, trains, and tests a model in the classic machine learning sense. The data is a public dataset from Lending Club, with actual loan data.
The post doesn't show the complete data set, and I didn't download it. I hope there isn't any PII data in here, but I'm sure that many people that actually analyze loans would have data related to a person's address, gender, and other PII type data. That makes sense for loan companies that need to create a financial contract with an individual, but it also allows humans to use their bias (intentionally or unintentionally) to affect decisions.
One of the promises of machine learning and AI is that this bias won't be present because the machines don't have feelings about individuals. Except, the data can reflect past feelings. If the data tends to show that people in a low-income neighborhood don't get loans, then the machine might pick this as a feature to look for in future data sets. Same for address or any other number of variables. Certainly, the person training the model may account for this and supervise the training, but I worry about this. Far too many people trust computers to do the right thing, but they are only as good as the data.
Garbage in, garbage out. A common phrase many of us have used, and it's applicable here. Often our data sets reflect the frailty and mistakes of past human decisions and may not be the best choices for how to train machines that will assist us in the future. They are useful, but we ought to be aware and cognizant of the bias that might exist in our data.
I do think machine learning and artificial intelligence are useful in our world. They improve our lives, and they can help reduce the common mistakes that humans can make and improve our lives. They also can be flawed, and we ought to be careful about how much we rely on them. Certainly the early deployment of any type of model should be controlled, limited, and carefully monitored. Whether this is for a model driving a car, approving loans, or identifying pictures. All your training and testing will likely be with limited data, so be wary when releasing these items into the wider world.