Machine Learning Challenges

  • Comments posted to this topic are about the item Machine Learning Challenges

  • I like reading 'Voice of the DBA'. Keep it going.
    This is my first post here! ML is an interesting subject like AI (is there much of a difference?). However, the goal of ML is often, as you say in your piece, to make predictions based on past outcomes and known significant factors. The most interesting aspect of ML, I believe, is the use of profiling which has been in use for some time - as you may be aware, it's how targeted advertising works - but it's based on similar rules. The biggest problem is that in large datasets we're dealing with statistics. To use your example, the statistic 'how many of the people who visit New York City also go for a run in Central Park' is used by ML as a rule. Now this is where things go awry. Statistics is for groups and not individuals. Statistics is for drawing general information based on the data for a population. Applying that generality back to the individual is an error.

  • I have to point out that machine learning is not just about making predictions from data. Machine learning can also be about determining something without being explicitly told to do so. For example, ML can be used to classify data and learn how to improve that classification over time every time it classifies something.

    I know that may sound like semantics to some, but there is a difference between manipulation versus definition just as there is a difference between supervised and unsupervised techniques among other things. While one approach may not be good for one problem, other approaches may be better.

    For example, every time you go out to run, learning what is you are doing when you do run in order to help you run better or maybe make that experience more enjoyable. For example, analyzing your exact location via GPS coordinates, analyzing data that may be relevant to that location and making you aware of certain events or locations that you may want to know. Like a delay, your favorite store, a unpredicted change in the weather, and so forth. Then maybe classifying these events based on how you react to them in order to learn from what is relevant and not relevant to you and how you are reacting to them over time. That way it's not just spamming you all the time or forcing you to define a series of preferences.

    In the Google world, they may also dip into other sources of data you use for other problems and tasks in order to inform itself. For example, if you are looking at a menu online for a restaurant, it may use that information to alert you that the restaurant is on the same heading you are running. Or if you looked up a certain movie on IMDB that you watched last night, let you know a certain actor of that movie you watched is doing an event nearby at that time you are jogging.

    Etc, etc.

  • Machine learning has it's limitations. There was a recent article on how a single bad pixel "broke" the training on a program designed to recognize images. Then there's the Google algorithms that fail in their ad and video recommendations, often in amusing ways. And the "smart" cars that crash into parked cars on the side of the road.

    Windows 10 brings all sorts of inane suggestions from Cortana. 🙂

  • lars.m.haggqvist - Tuesday, April 17, 2018 3:56 AM

    I like reading 'Voice of the DBA'. Keep it going.
    This is my first post here! ML is an interesting subject like AI (is there much of a difference?). However, the goal of ML is often, as you say in your piece, to make predictions based on past outcomes and known significant factors. The most interesting aspect of ML, I believe, is the use of profiling which has been in use for some time - as you may be aware, it's how targeted advertising works - but it's based on similar rules. The biggest problem is that in large datasets we're dealing with statistics. To use your example, the statistic 'how many of the people who visit New York City also go for a run in Central Park' is used by ML as a rule. Now this is where things go awry. Statistics is for groups and not individuals. Statistics is for drawing general information based on the data for a population. Applying that generality back to the individual is an error.

    This comment caused me to think of something similar that was used to predict the actions of large groups of people (a very emergent-type system, I would think.)  In the Asimov Foundation series, the science of "psychohistory."  Because to some extent, especially when trying to predict the actions of people, that's what this sounds like.  In theory, one could then plug different values into such a system to try to determine what might happen if some factor changes.  IE, would people still go jogging in Central Park if it were warm but very windy?  What about if there's a police situation in one area of the park?  From there, you could potentially alter peoples behavior (phone in a fake PD situation to keep people out of the park for some reason.)

    As it is, ML is only as good as the data fed into it and the quality of the model.  The more data you can "feed" the system and the more you can train the system, the more likely it will handle oddities such as the emergent systems.  Potentially, with enough time, data, and a strong enough model, one would expect you could fairly accurately predict the stock market.

  • One of the problems is that for real world situations, it's rarely about just a few convenient decision points. Just going a little deeper expands the points exponentially which then requires exponentially more (reliable) data. Quickly it gets to the point where the amount of data required is beyond anything short of onmiscience. At the same time the amount experience to train such a system also expands similarly.

    Do we really want an omniscient Facebook cataloging all our activities on the premise that some of them have predictive value?

    More data does not always improve performance, indeed there have been documented cases where it was detrimental.
    * More data is necessarily composed of lots of sources, of unknown bias and reliability.
    * More data can produce more artifacts, data points that simply do not relate reliably to the question at hand. This has been historically a problem in scattershot medical studies where large amounts of data were examined for statistical correlation, given a wide enough search all sorts of bogus correlations can be extracted.

    ...

    -- FORTRAN manual for Xerox Computers --

  • Since the introduction of machine learning (ML) I've been quite interested in it. It involves statistics, which I enjoy. But I can see how something like ML might not work in all situations. For example, I work for a state agency. We gather a lot of data and then periodically report it to various Federal agencies, to satisfy grant requirements. However, I seriously doubt that one could apply ML to things like the spread of disease in one area of the state, etc. Still, the topic interests me. Hopefully, at some point, somewhere, I'll be able to use ML.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • Rod at work - Wednesday, April 18, 2018 8:44 AM

    ... But I can see how something like ML might not work in all situations. For example, I work for a state agency. ...

    There is an old saying "if you only have a hammer, every job looks like a  nail". It's just too easy to look at small or very constrained problems and assume that more power and more data will solve bigger problems. 

    But quickly the number of permutations can exceed the number of electrons in the universe. And non trivial emergent properties are rarely discernable from statistical data.

    ...

    -- FORTRAN manual for Xerox Computers --

  • jay-h - Wednesday, April 18, 2018 8:54 AM

    Rod at work - Wednesday, April 18, 2018 8:44 AM

    ... But I can see how something like ML might not work in all situations. For example, I work for a state agency. ...

    There is an old saying "if you only have a hammer, every job looks like a  nail". It's just too easy to look at small or very constrained problems and assume that more power and more data will solve bigger problems. 

    But quickly the number of permutations can exceed the number of electrons in the universe. And non trivial emergent properties are rarely discernable from statistical data.

    I agree. That is, I believe, our problem. Well, "problem" is overstating it. "Issue" might be better. For example, we collect data of the emergence of hantavirus, as one of the diseases we report on. But things beyond what we collect, contribute to how much hantavirus may occur. We've had a mild winter. Antidotally, this means we're probably going to have a larger than normal occurrence of hantavirus, because the populations of the mice which carry the disease is likely to be larger. We know this to be true, but we don't correlate weather conditions with the impact it may have upon disease. And what happens if we experience significant forest fires, reducing the population to the hantavirus carrying mice? That, too, will have an impact upon the incidences of hantavirus. All of these things we know will effect it, but we don't collect that data. The sampling set could be quite huge, if one were to try and pull in everything that might impact these sorts of things.

    Kindest Regards, Rod Connect with me on LinkedIn.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply