EE5390 Course Information

Table of Contents

Welcome to EE5390 (Source coding) 2021 edition. This is a follow-up course to EE 2340 (Information Sciences) and EE 5847 (Information theory). This course will cover the basic concepts of data compression. Classes will be online, primarily on piazza (for asynchronous discussions) and Google meet/classroom (for homework submissions and live classes).

Piazza and Google classroom links will be shared with registered students. If you have not got these links by the first class, please send me an email.

1 Prerequisites

  • Comfort with probability and random processes, and mathematical arguments
  • Ability to program in python/C/C++
  • EE2340 (Information Sciences) or EE5847 (Information theory)

2 Assessment (tentative):

Each student will be expected to

  • attend classes regularly and participate in discussions
  • solve in-class quizzes
  • solve homework assignments, roughly one per week (collaboration is fine as long as it is explicitly declared, but you have to write the solutions in your own words), and participate in class (online and/or offline)
  • present a paper/demo on data compression and submit a report. Details will be announced later.
Attendance and class participation 10%
Quizzes 10%
Homeworks 50%
Paper/demo 30%

3 Instructor:

Name Dr. Shashank Vatedka
Office 214/D, Block-C
Email shashankvatedka@ee.iith.ac.in

4 Class timings:

  • Slot A: Mondays 9:00-10:00, Wednesdays 11:00-12:00 and Thursdays 10:00-11:00
  • Class venue: Google classroom
  • Quiz schedule: in regular classes

5 Textbook and references:

Primarily, the material will be drawn from

  • Elements of Information Theory, (EIT) Thomas M Cover and Joy A Thomas, second edition, Wiley inc. (Amazon link). Limited number of copies available in the university library. While the first edition of the book has all the material we need for the course, the second edition has a much expanded problem set.
  • Lecture notes/slides, as and when needed.

Other references:

  • "A mathematical theory of communication", Claude Shannon, Bell systems Tech Journal, 1948. The classic paper that started this field.
  • Information theory, inference, and learning algorithms, David MacKay (soft copy available for free on the author's website). A very nice book, but gives a very different flavour that what will be covered in the course. Suggested for additional reading.
  • A student's guide to coding and information theory, Stefan Moser (amazon link).
  • Information theory: coding theorems for discrete memoryless systems, Imre Csiszar and Janos Korner. An advanced textbook. Recommended reading after covering this course as well as EE6317.

Light reading:

6 Tentative syllabus:

Recap of basic concepts in information theory; Source coding theorem; Fixed versus variable length compression; Typical sets and the asymptotic equipartition property; Proof of the source coding theorem; Variable length compression: nonsingular, uniquely decodable and prefix-free codes; Kraft inequality; Optimal codes and bounds on expected length; Huffman codes and optimality; Shannon-Fano-Elias codes and optimality; Universal codes; Arithmetic codes; Lempel-Ziv codes and optimality; Burrows Wheeler Transform (if time permits).

7 Tentative schedule

Topic Date Notes/links/material Homeworks
Fixed-length compression of memoryless sources Week 1-2 EIT Chapter 3  
- Discrete memoryless sources      
- Fixed and variable length codes      
- Typical sets and their properties      
- Proof of the source coding theorem for memoryless sources      
Variable-length compression Week 2 EIT Chapter 5  
- Nonsingular, uniquely decodable and prefix-free codes      
- Source coding theorem for variable-length codes      
- Bounds on optimal code length      
- Kraft inequality, tree codes      
Huffman and Shannon-Fano-Elias codes Week 2-3 EIT Chapter 5  
- Huffman codes, construction      
- Optimality      
- Shannon-Fano-Elias      
Universal codes Week 3    
- Introduction   EIT Section 13.3  
- Arithmetic codes   Slides  
- Properties      
The Lempel-Ziv algorithms Week4-5 EIT Section 13.4  
- LZ77      
- LZ78      
- Properties, optimality   EIT Section 13.5.1  
- Trie interpretation      

8 Suggested list of papers/projects

You are allowed to work individually or a group of 2. Each group must give a short presentation (around 15 mins including questions), and submit a report. You may also choose a project, in which case you are expected to give a working demo (code).

The list of topics below are only indicators of what is expected. You may have to read more background material. You are also free to choose a related topic not listed below.

Good sources for the latest works on data compression are the annual International Symposium on Information Theory (ISIT, but more broadly covers topics in Information Theory), the Data Compression Conference (DCC) and the IEEE Transactions on Information Theory. The Arxiv preprint server (https://arxiv.org/list/cs.IT/recent) also contains unpublished/yet-to-be published work and sometimes more detailed versions of published papers.

8.1 Compression of large collections of genomes

Genomes of two individuals of the same species are highly related. These are perhaps not well modeled by standard distributions, since one DNA sequence can be obtained by performing a small number of edit operations (insertion, deletion, transposition) on another. Traditional compression algorithms don't do a very good job of compressing large collections of DNA sequences.

  • Deorowicz et al, "Genome compression: a novel approach for large collections," Bioinformatics
  • Deorowicz et al, "GDC2: Compression of large collections of genomes," Scientific reports

8.2 Compression of FASTQ files

Finding the DNA sequence of an individual is a very interesting process in itself. The most popular technique for sequencing (estimating the DNA sequence) is called shotgun sequencing. This yields short subsequences of the actual DNA sequence (with errors). These, and the quality scores (a score measuring how reliable the read is) are stored in a format called FASTQ. Compressing such files is of great interest.

  • Deorowicz and Grabowski, "Compression of DNA sequence reads in FASTQ format," Bioinformatics
  • Ochoa et al, "QualComp: a new lossy compressor for quality scores based on rate distortion theory," BMC Bioinformatics
  • Chandak et al, "SPRING: a next-generation compressor for FASTQ data"

8.3 Reconstructing strings from random traces

How easy is it to reconstruct an unknown string if we are only given short substrings of it (without mentioning which locations they are taken from)? This is called the trace reconstruction problem, and is a fascinating topic closely related to the problem of reconstructing DNA from shotgun reads.

  • Batu et al, "Reconstructing a string from random traces," SODA 2004
  • Acharya et al, "String reconstruction from its substring compositions," https://arxiv.org/pdf/1403.2439.pdf

8.4 Compression with random insertions and deletions

This is also closely related to the above problems, but also occurs in cloud synchronization. If we have two files, one of which is an updated version of the other (and hence highly correlated), then how do we compress this jointly? In particular, if the second file is obtained by randomly inserting/deleting text from the first, then what is the minimum rate of compression of the second file given that both the compressor and decompressor have access to the first?

8.5 Zstandard compression algorithm

This is optimized for compression and decompression speed. See https://facebook.github.io/zstd for more details. The exact values of the parameters are not important to study here. What you need to focus on are the basic concepts, what the building blocks of the algorithm are, etc.

8.6 Genomic compression standards

MPEG-G is a new set of standards for storage and compression of sequencing data

8.7 Graph compression

What if the source is not a sequence but has a different structure, such as a graph? This arises in instances such as compression of the details of users of a social network, where we can consider each user as a vertex, and two users are connected if they are "friends."

8.8 Universal compression of memoryless sources over unknown alphabets

In most circumstances we can handle situations where the distribution of the source is unknown. But what happens when the alphabet itself is unknown, and what if the alphabet size could be arbitrarily large? In this scenario, it might not be possible to get a reliable estimate of the input distribution.

8.9 Processing over compressed data

Traditional compression algorithms are designed only for storage. Suppose that we have a large database, which we need to compress in order to save space. Suppose that we use gzip, zip, etc. for compression. If we want to query this database, then we have to first decompress, only then can we query the database. Can we design a compression algorithm such that the queries can be made directly on the compressed file?

8.10 Other topics

Author: Shashank Vatedka

Created: 2021-03-01 Mon 11:04