NPTEL
NPTEL Logo

IIT Kharagpur - Scalable Data Science 

  • Offered byNPTEL

Scalable Data Science
 at 
NPTEL 
Overview

Duration

8 hours

Total fee

Free

Mode of learning

Online

Difficulty level

Intermediate

Official Website

Explore Free Course External Link Icon

Credential

Certificate

Scalable Data Science
 at 
NPTEL 
Highlights

  • IIT Gandhinagar
  • Instructors - Prof. Anirban Dasgupta, Prof. Sourangshu Bhattacharya
  • AICTE approved FDP course
  • INDUSTRY SUPPORT : Google, Microsoft, Facebook, Amazon, Flipkart, LinkedIn etc.
Read more
Details Icon

Scalable Data Science
 at 
NPTEL 
Course details

Who should do this course?
  • Computer Science and Engineering students
More about this course
  • Consider the following example problems: One is interested in computing summary statistics (word count distributions) for a set of words which occur in the same document in entire Wikipedia collection (5 million documents). Naive techniques, will run out of main memory on most computers. One needs to train an SVM classifier for text categorization, with unigram features (typically ~10 million) for hundreds of classes. One would run out of main memory, if they store uncompressed model parameters in main memory. One is interested in learning either a supervised model or find unsupervised patterns, but the data is distributed over multiple machines. Communication being the bottleneck, naïve methods to adapt existing algorithms to such a distributed setting might perform extremely poorly. In all the above situations, a simple data mining / machine learning task has been made more complicated due to large scale of input data, output results or both. In this course, we discuss algorithmic techniques as well as software paradigms which allow one to develop scalable algorithms and systems for the common data science tasks.
Read more

Scalable Data Science
 at 
NPTEL 
Curriculum

Week 1: Background: Introduction (30 mins) Probability: Concentration inequalities, (30 mins) Linear algebra: PCA, SVD (30 mins) Optimization: Basics, Convex, GD. (30 mins) Machine Learning: Supervised, generalization, feature learning, clustering. (30 mins)

Week 2: Memory-efficient data structures: Hash functions, universal / perfect hash families (30 min) Bloom filters (30 mins) Sketches for distinct count (1 hr) Misra-Gries sketch. (30 min)

Week 3: Memory-efficient data structures (contd.): Count Sketch, Count-Min Sketch (1 hr) Approximate near neighbors search: Introduction, kd-trees etc (30 mins) LSH families, MinHash for Jaccard, SimHash for L2 (1 hr)

Week 4: Approximate near neighbors search: Extensions e.g. multi-probe, b-bit hashing, Data dependent variants (1.5 hr) Randomized Numerical Linear Algebra Random projection (1 hr)

Week 5: Randomized Numerical Linear Algebra CUR Decomposition (1 hr) Sparse RP, Subspace RP, Kitchen Sink (1.5 hr)

Week 6: Map-reduce and related paradigms Map reduce - Programming examples - (page rank, k-means, matrix multiplication) (1 hr) Big data: computation goes to data. + Hadoop ecosystem (1.5 hrs)

Week 7: Map-reduce and related paradigms (Contd.) Scala + Spark (1 hr) Distributed Machine Learning and Optimization: Introduction (30 mins) SGD + Proof (1 hr)

Week 8: Distributed Machine Learning and Optimization: ADMM + applications (1 hr) Clustering (1 hr) Conclusion (30 mins)

Scalable Data Science
 at 
NPTEL 
Entry Requirements

Eligibility criteriaUp Arrow Icon
Conditional OfferUp Arrow Icon
  • Not mentioned

Other courses offered by NPTEL

– / –
12 weeks
Beginner
– / –
8 weeks
Intermediate
– / –
12 weeks
Intermediate
Free
8 weeks
Intermediate
View Other 175 CoursesRight Arrow Icon
qna

Scalable Data Science
 at 
NPTEL 

Student Forum

chatAnything you would want to ask experts?
Write here...