ECE/TCOM5583: Information Theory and Statistical Inference
There has been a strong resurgence of AI in recent years. An important core technology of AI is statistical learning, which aims to automatically “program” machines with data. While the idea can date back to the 50's of the last century, the plethora of data and inexpensive computational power allow the techniques to thrive and penetrate into every aspect of our daily lives — customer behavior prediction, financial market prediction, fully automatic surveillance, selfdriving vehicles, autonomous robots, and beyond.
Information theory was first introduced and developed by the great communications engineer, Claude Shannon in the 50's of the last century. The theory was introduced in an attempt to explain the principle behind pointtopoint communication and data storing. However, the technique has been incorporated into statistical learning and has inspire many of the underlying principles. In this graduate course, we would try to explore the exciting area of statistical learning from the perspectives of information theorists. It facilitates students to have a deeper understanding of the omnipresent field of statistical learning and to appreciate the widespread significance of information theory.
The course will start by providing an overview of information theory and statistical learning. We will then aid students to establish a solid foundation on core information theory principles including information measures, AEP, source and channel coding theory. We will then introduce common and yet powerful statistical techniques such as Bayesian learning, decision forests, and belief propagation algorithms and discuss how these modern statistical learning techniques are connected to information theory. The main reference text is a book by Professor Mackay — Information Theory, Inference, and Learning Algorithms but we will also borrow heavily from materials available online. Other most important reference texts are
Other Reference Materials
Shannon, C. E. (1948), A mathematical theory of communication. Bell Sys. Tech. J. 27: 379423, 623656.
R. W. Yeung, On entropy, information inequalities, and Groups.
Law of Large Number by Terry Tao.
A First Course in Information Theory, by Raymond W. Yeung, New York: Springer.
Information Theory and Reliable Communication by R. Gallager, New York: Wiley.
Information Theory by Csiszar and Korner, New York: Academic Press.
Entropy and Information Theory by R. M. Gray, SpringerVerlag, 1990.
J. S. Yedidia, W. T. Freeman, and Y. Weiss, Understanding Belief Propagation and its Generalizations in Exploring Artificial Intelligence in the New Millennium: Science and Technology Books, 2003.
S. Verdu, Fifty years of Shannon theory, Information Theory, IEEE Transactions on, vol. 44, pp. 20572078, 1998.
I. Csiszar, The method of types, Information Theory, IEEE Transactions on, vol. 44, pp. 25052523, 1998.
Information Theoretic Inequality Prover.
Network Information Theory by El Gamal and Kim.
Stochastic Processes: Theory for Applications by Robert Gallager, 2013.
T Hofmann, B Schölkopf, AJ Smola, Kernel Methods in Machine Learning.
Office Hours
There are no “regular” office hours. But you are welcome to come catch me anytime or contact me through emails.
Course Syllabus (Tentative)
Probability review
Maximum likelihood estimator, MAP, Bayesian estimator
Graphical models and message passing algorithms
Lossless source coding theory, Huffmann coding, and introduction to Arithmetic coding
Asymptotic equipartition property (AEP), typicality and joint typicality
Entropy, conditional entropy, mutual information, and their properties
Channel coding theory, capacity, and Fano’s inequality
Continuous random variables, differential entropy, Gaussian source, and Gaussian channel
Error correcting codes, linear codes, and introduction to lowdensity parity check code
Methods of type, large deviation theory, maximum entropy principle
N.B. You will expect to expose to some Python and Matlab. You won't become an expert on these things after this class. But it is good to get your hands dirty and play with them early.
Projects
Grading
Homework: 30%.
"Midterm": 30%. Closed book test, and no Internet and cellphones, but you can bring calculator and two pieces of lettersize cheat sheet
Final Project: 40%.
Final Grade:
A: \(\sim\) 80 and above
B: \(\sim\) between 60 and 80
C: \(\sim\) between 40 and 60
D: \(\sim\) between 20 and 40
F: Below 20
Calendar
 Topics  Materials 
8/24  Overview of IT, probability overview, univariate normal distribution  Slides from Berkeley CS188, Slides from Wash U CSE 473 
8/29  Hermitian matrices, covariance matrices, independence vs conditional independence  slides, gaussian.pdf 
9/5  ML, MAP, Bayesian inference, marginalizing multivariate normal distribution  slides 
9/12  Principal component analysis (PCA), conditioning of multivariate normal distribution, product of multivariate normal distribution  slides 
9/19  Mixture of gaussian, conjugate prior, Bernoulli distribution, binomial distribution, Beta distribution  slides 
9/26  Multinormial distribution, Dirichlet distribution, Poisson distribution  slides 
10/3  Constraint optimization, Lagrange multiplier, KarushKuhnTucker condition, overview of Source Coding Theorem, Kraft's inequality, converse proof of Source Coding Theorem  slides 
10/10  ShannonFanoElias codes, forward proof of Source Coding Theorem, different entropy, Jensen's inequality  slides 
10/17  Joint entropy, conditional entropy, KullbackLeibler divergence, mutual information, Shannon's perfect secrecy  slides 
10/24  Decison trees, random forests, law of large number, asymptoic equipartion (AEP), typical sequences  slides, Random Forest notebook by Jason Sanchez 
10/31  Joint typical sequences, packing and covering lemmas, channel capacity, Channel Coding Theorem  slides 
11/7  Converse proof of Channel Coding Theorem, ratedistortion problem, ratedistortion theorem  slides 
11/14  Review  
11/21  Proof of ratedistortion theorem, Midterm  slides 
11/28  Method of type, large deviation theory  slides 
12/5  Bayesian networks, factor graphs, belief propagation algorithm  slides

