This course is an introduction to big data analytics. The course will cover algorithmic, statistical and data management aspects of big data analysis. The emphasis is on algorithm whose run time is linear in the size of the input.
============
Specifically we will cover the following areas:
- Using notebooks on AWS.
- numpy and Pandas.
- Matplotlib.
- Linear regression, LPC and vocoders.
- Vector quantization and K-Means.
- Singular value decomposition and compressing of cyclical signals.
- Kolmogorov Complexity and Kolmogorov Sufficient statistics.
- HDFS, Hadoop and map-reduce.
- Word-count
- Vector-Matrix Multiplication
- Selections
- Projections
- Natural Join
- Aggregation.
- Estimation through sampling, Hoeffding bound, Gilvenco-Cantelli theorem.
- Empirical Bernstein inequality and sequential estimation.
- Stratified sampling.
- HBase, Comparison of HBase to HDFS
- Hashing
- Min-Hash and finding similar documents.
- Locality Sensitive Hashing.
- LSH for L1 and L2 distances.
- LSH for the Entity resolution problem
- Counting distinct elements.
- Estimating moments (Alon-Matias-Szegedy algorithm)
- Counting ones in a window.
- Finding heavy hitters.