%22kjersti+wictorsen+kola%22 Irrational

Learn The Math Easy Learnthemath %D1%82%D0%B0%D0%B1%D0%BB%D0%B8%D1%86%D0%B0%20%D0%BA%D0%B2%D0%B0%D0%B4%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D1%85%20%D0%BA%D0%BE%D1%80%D0%BD%D0%B5%D0%B9 Sv 1 Learn The Math Inappropriate Mathematics for Machine Learning « Machine Learning (Theory)

Learn The Math Easy Learnthemath %D1%82%D0%B0%D0%B1%D0%BB%D0%B8%D1%86%D0%B0%20%D0%BA%D0%B2%D0%B0%D0%B4%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D1%85%20%D0%BA%D0%BE%D1%80%D0%BD%D0%B5%D0%B9 Sv 1 Learn The Math

Reply
  • jl says:
    6/9/2008 at 3:20 pm

    Drew Bagnell points out Radically Elementary Probability Theory by Edward Nelson which contains some of the points above.

    Reply
  • jl says:
    6/9/2008 at 3:23 pm

    I can imagine this happening, but I haven’t run across it personally. Do you have an example?

    Reply
  • jl says:
    6/9/2008 at 3:37 pm

    I haven’t seen ‘the smallest postive real number’ play a role in Machine Learning.

    It’s perhaps worth pointing out that this post is about Machine Learning, and not about all places that mathematics might be applied.

    Reply
  • jl says:
    6/9/2008 at 4:05 pm

    I don’t quite understand the formula, because the \alpha are unconstrained. Perhaps there is an \alpha to \omega confusion?

    As I understand it, your point is that adaboost is nonconvergent in the sense that the limit point is never reached by iteration, which is something of an open vs. closed set distinction. Given this knowledge, it becomes clear that some early termination criteria is required.

    I don’t find this argument compelling because I believe the issue is convergence rate rather than an open vs. a closed set. Consider an algorithm adaboost’ which acts exactly like adaboost for the first 10^6 iterations, and then artificially imposes convergence for all future iterations. A basic claim is that adaboost’ is equivalent to adaboost for all practical purposes since nobody wants to waste the computation for 10^6 rounds of boosting. The same computational constraint implies that some form of early stopping is necessary. Since there is a sound alternative rational (computational considerations) which covers both adaboost and adaboost’, the argument above doesn’t seem compelling.

    Reply
  • Simon Lacoste-Julien says:
    3/30/2010 at 12:30 pm

    Given that I talked to John last week about this post and still didn’t have a counter-example, I would like to revisit the question by giving a meta-algorithm for someone to find a counter-example which will surely satisfy John. I tend to side with the mathematicians in term of the needs for rigor; my intuition is that we often use powerful mathematical abstractions in which we carry an analysis to carry results back to something concrete. If the analysis is not coherent (i.e. not only rigorous, but also *wrong*), something bad could happen when you transfer it back to the concrete example, even though the concrete example seems to be simpler than the abstraction (e.g. finite vs. infinite). My intuition is thus that there exists such situations that John is asking for (just that I haven’t pinpointed one yet).

    So here is the meta-algorithm:
    1) Think of a concrete machine learning setup (e.g. binary classification).
    2) Suggest an algorithm and ‘prove’ some properties of the algorithm using ’sloppy maths’ (where ’sloppy’ here means to not bother with the distinctions 1-3 mentioned in John’s original post). The property of the algorithm should useful enough for John to care about it.
    3) Do your analysis in such a way that the result is actually wrong (and so depends on distinctions 1-3). Make sure that the result is still wrong when applied on computer with finite precision.

    For example, one could ‘prove’ that their algorithm will always terminate in finite time, but then show a finite precision example for which the algorithm would actually not terminate in finite time. Or one could ‘prove’ that the generalization error of their algorithm is smaller than some quantity w.h.p, and then give a distribution on a finite set for which the generalization error of their algorithm is actually bigger than this quantity w.h.p.

    The wrong ‘proofs’ should be simple enough so that John accepts the need to actually be careful with those distinctions to avoid the obtained contradictions in the future…

    Actually, I am starting to think of a few ideas… I will try to work them out during some free time (scarce resource these times!). But hopefully somebody else can pinch in…

    Reply
  • Reply
  • Leave a Reply