Big Data Analytics

This course is an introduction to big data analytics. The course will cover algorithmic, statistical and data management aspects of big data analysis. The emphasis is on algorithm whose run time is linear in the size of the input.

For additional information visit:

============

Specifically we will cover the following areas:

AWS, EC2, S3, Git and Github.

The IPython notebooks

Using notebooks on AWS.
numpy and Pandas.
Matplotlib.

Performance and the memory Hierarchy.

I/O efficient sorting.

Statistical Models and Compression.

Linear regression, LPC and vocoders.
Vector quantization and K-Means.
Singular value decomposition and compressing of cyclical signals.
Kolmogorov Complexity and Kolmogorov Sufficient statistics.

The Map-Reduce framework.

HDFS, Hadoop and map-reduce.
Word-count
Vector-Matrix Multiplication
Selections
Projections
Natural Join
Aggregation.

The art of sampling

Estimation through sampling, Hoeffding bound, Gilvenco-Cantelli theorem.
Empirical Bernstein inequality and sequential estimation.
Stratified sampling.

Column-based databases.

HBase, Comparison of HBase to HDFS
Hashing
Min-Hash and finding similar documents.
Locality Sensitive Hashing.
LSH for L1 and L2 distances.
LSH for the Entity resolution problem

Streaming algorithms.

Counting distinct elements.
Estimating moments (Alon-Matias-Szegedy algorithm)
Counting ones in a window.
Finding heavy hitters.

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
.ipynb_checkpoints		.ipynb_checkpoints
AWS_scripts		AWS_scripts
BigDataNotes		BigDataNotes
LocalScripts		LocalScripts
data		data
notebooks		notebooks
utils		utils
README.md		README.md
test		test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Analytics

For additional information visit:

AWS, EC2, S3, Git and Github.

The IPython notebooks

Performance and the memory Hierarchy.

I/O efficient sorting.

Statistical Models and Compression.

The Map-Reduce framework.

The art of sampling

Column-based databases.

Streaming algorithms.

About

Releases

Packages

Languages

czarifis/UCSD_BigData

Folders and files

Latest commit

History

Repository files navigation

Big Data Analytics

For additional information visit:

AWS, EC2, S3, Git and Github.

The IPython notebooks

Performance and the memory Hierarchy.

I/O efficient sorting.

Statistical Models and Compression.

The Map-Reduce framework.

The art of sampling

Column-based databases.

Streaming algorithms.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages