Algorithms for Big Data

Teaching assistance, Bachelor's course, USTC, 2025

This course introduces theoretical topics including random sampling, dimensionality reduction, data compression, distributed computing, streaming algorithms, clustering, classification, and stochastic optimization. It also covers the underlying theory and mathematical tools such as probabilistic methods, VC dimension, communication complexity, and learning theory in machine learning (statistical learning theory).

Syllabus

  • Probability Review and Concentration Bounds

  • Dimension Reduction
    • Johonson-Linenstrauss Lemma
    • Nearest Neighbor Search
    • Locality Sensitive Hashing
  • Streaming and Sketching Algorithms
    • Probabilistic Counting, Reservoir Sampling
    • Estimating the Number of Distinct Elements
    • Quantiles
    • Heavy Hitters: Misra-Gries Algorithm, Count-Min Sketch, Count Sketch
  • Clustering
    • k-means/median/center
    • Coreset for Clustering
    • Hierarchical Clustering
  • Graph-Structured Data
    • Basics of Spectral Graph Theory
    • Random Walks and Markov Chains
    • Sublinear-Time Algorithms for Graphs
    • Graph Sparsification