I feel there is some great unification of Information Theory, classical signal processing, and controls with machine learning. Fundamentally they are different angles of applying the concepts of dynamical systems to design widgets that do something useful. Things like this quote from the top:
> Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal.
greatly underappreciate the value of the theory as it applies to the theory of things like ML, and one day I think we'll have a unifying theory that enables engineers to design systems for ML with little ad-hoc methods and more theoretical bases as we design filters and controls today. The issue is the nonlinearity of the networks, but I think we'll find a way there through category theory and topologies.
The main conceptual novelty that modern machine learning brings is the addition of computational complexity in the mix: from information theory's question
what is learnable?
to
what is learnable in
poly-time?
(or similar resource constraints). This was pioneered, as far as I am aware, in Valiant's A Theory of the Learnable (please correct me if I'm wrong, I'm not an ML/AI historian). Interestingly, we see a similar evolution of Shannon's thinking about cryptography (what is secure information theoretically, i.e. against computationally unbounded adversaries?) to: what is safe against a poly-time restricted adversary?
The significant thing is not that we're arbitrarily applying the log function to probabilities. It is that there's a relationship between expected code length and the entropy of a source. It is surprising, to me at least, that the length of a thing is related to its probability. From that relationship come a number of fascinating connections to many other fields. Check out Cover and Thomas' "Elements of Information Theory" for a very approachable introduction to all these connections.
> Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal.
greatly underappreciate the value of the theory as it applies to the theory of things like ML, and one day I think we'll have a unifying theory that enables engineers to design systems for ML with little ad-hoc methods and more theoretical bases as we design filters and controls today. The issue is the nonlinearity of the networks, but I think we'll find a way there through category theory and topologies.