Aegis School of Business, Data Science, Cyber Security & Telecommunication

Aegis School of Business, Data Science, Cyber Security & Telecommunication

Application fee: 13.61 USD
Course fee: 340.3 USD
GST: 18 %

Bigdata with Apache Spark

Application fee : 13.61 USD


Certification Body: Aegis School of Data Science
Location: On-campus (India, Mumbai, Pune, Bangalore)
Type: Certificate course
Director: Dr. Vinay Kulkarni
Coordinator: Ritin Joshi
Language: English
Course fee: 340.3 USD
GST: 18%
Total course fee: 401.55 USD
No Ratings


Course Details

Spark, an alternative for fast data analytics
Although Hadoop captures the most attention for distributed data analytics, there are alternatives that provide some interesting advantages to the typical Hadoop platform. Spark is a scalable data analytics platform that incorporates primitives for in-memory computing and therefore exercises some performance advantages over Hadoop's cluster storage approach. Spark is implemented in and exploits the Scala language, which provides a unique environment for data processing. Get to know the Spark approach for cluster computing and its differences from Hadoop.

Spark is an open source cluster computing environment similar to Hadoop, but it has some useful differences that make it superior in certain workloads—namely, Spark enables in-memory distributed datasets that optimize iterative workloads in addition to interactive queries.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala create a tight integration, where Scala can easily manipulate distributed datasets as locally collective objects.

Although Spark was created to support iterative jobs on distributed datasets, it's actually complementary to Hadoop and can run side by side over the Hadoop file system. This behavior is supported through a third-party clustering framework called Mesos. Spark was developed at the University of California, Berkeley, Algorithms, Machines, and People Lab to build large-scale and low-latency data analytics applications.

Spark cluster computing architecture
Although Spark has similarities to Hadoop, it represents a new cluster computing framework with useful differences. First, Spark was designed for a specific type of workload in cluster computing—namely, those that reuse a working set of data across parallel operations (such as machine learning algorithms). To optimize for these types of workloads, Spark introduces the concept of in-memory cluster computing, where datasets can be cached in memory to reduce their latency of access.

Course Content:

1. Basics

  • Why SPARK?
  • What does it mean to learn SPARK?
  • SPARK Basics
  • Installation
  • SCALA: An Introduction
  • The SPARK Context
  • Introduction to RDDs
  • RDDs: Creation / Transformation / Actions
  • Exercises and applications

2. Dataframes, Datasets, SQL SPARK Streaming

  • Introduction
  • The SQL Context
  • Data I/O
  • Transformations and Actions
  • Concepts and elements of Streaming
  • Working with SPARK Streaming
  • Exercises and applications

3. SPARK Mlib SPARK GraphX

  • Basics of Mlib
  • Statistics using Mlib
  • Machine Learning using Mlib
  • Basics of Graph Processing
  • GraphX RDDs
  • Applications of GraphX
  • Exercises and applications

4. Case studies, applications Project Discussions

  • Case studies, applications
  • Project Discussions
  • SPARK resources
  • Trends

5. Final Project