Andrew Thangaraj: Statistical Investigations into the Unseen - Missing Mass and its Variants
Abstract: Suppose we observe a sequence of samples from a very large alphabet and the number of samples is much lesser than the alphabet size. Several letters from the alphabet will be missing or unseen in the observed samples. What can be statistically inferred about the missing letters that were not seen? Such inferences, though non-obvious and non-trivial, are very useful in applications such as ecology, language models and DNA sequencing. The total probability mass of all missing letters is defined to be the “missing mass” in the samples. Surprisingly, the famous Good-Turing estimator (from the second world war period!) provides an “optimal” estimate of missing mass irrespective of the distribution or alphabet size. In this talk, we will discuss the missing mass, the Good-Turing estimator and its statistical basis/optimality. In closing, we will discuss new results on some variants of missing mass and estimators that can potentially shed more light on the unseen parts of more general distributions.
Only a knowledge of probability will be assumed in the talk.
Bio: Andrew Thangaraj received his B.Tech in Electrical Engineering from the Indian Institute of Technology (IIT), Madras, India in 1998 and a PhD in Electrical Engineering from the Georgia Institute of Technology, Atlanta, USA in 2003. He was a post-doctoral researcher at the GTL-CNRS Telecom lab at Georgia Tech Lorraine, Metz, France from August 2003 to May 2004. From June 2004, he has been with the Department of Electrical Engineering, IIT Madras, where he is currently a professor. From Jan 2012 to Jan 2018, he served as Editor for the IEEE Transactions on Communications. From July 2018 to July 2022, he served as an Associate Editor for Coding Techniques for the IEEE Transactions on Information Theory. Since Oct 2011, he has been serving as NPTEL coordinator at IIT Madras. He has played a key role in initiating and running NPTEL online courses and certification. He is currently the PI of the SWAYAM project of the Ministry of Education, Government of India. Since May 2020, he has been serving as Coordinator for the BS (Data Science) Program at IIT Madras. Since May 2024, he has been serving as Chair for the Centre for Outreach and Digital Education (CODE) at IIT Madras.