Aegis School of Business, Data Science, Cyber Security & Telecommunication

Application fee:	1000 * INR
Course fee:	25000 * INR
GST:	18 %

Apply

Bigdata with Apache Spark

Application fee : 1000 * INR

Apply

About

Details

Certification Body:	Aegis School of Data Science
Location:	On-campus (India, Mumbai, Pune, Bangalore)
Type:	Certificate course
Director:	Dr. Vinay Kulkarni
Coordinator:	Ritin Joshi
Language:	English
Course fee:	25000 * INR
GST:	18%
Total course fee:	29500 * INR
Rating:	★ ★ ★ ★ ★ No Ratings

Gallery

Course Details

Spark, an alternative for fast data analytics
Although Hadoop captures the most attention for distributed data analytics, there are alternatives that provide some interesting advantages to the typical Hadoop platform. Spark is a scalable data analytics platform that incorporates primitives for in-memory computing and therefore exercises some performance advantages over Hadoop's cluster storage approach. Spark is implemented in and exploits the Scala language, which provides a unique environment for data processing. Get to know the Spark approach for cluster computing and its differences from Hadoop.

Spark is an open source cluster computing environment similar to Hadoop, but it has some useful differences that make it superior in certain workloads—namely, Spark enables in-memory distributed datasets that optimize iterative workloads in addition to interactive queries.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala create a tight integration, where Scala can easily manipulate distributed datasets as locally collective objects.

Although Spark was created to support iterative jobs on distributed datasets, it's actually complementary to Hadoop and can run side by side over the Hadoop file system. This behavior is supported through a third-party clustering framework called Mesos. Spark was developed at the University of California, Berkeley, Algorithms, Machines, and People Lab to build large-scale and low-latency data analytics applications.

Spark cluster computing architecture
Although Spark has similarities to Hadoop, it represents a new cluster computing framework with useful differences. First, Spark was designed for a specific type of workload in cluster computing—namely, those that reuse a working set of data across parallel operations (such as machine learning algorithms). To optimize for these types of workloads, Spark introduces the concept of in-memory cluster computing, where datasets can be cached in memory to reduce their latency of access.

Course Content:

1. Basics

Why SPARK?
What does it mean to learn SPARK?
SPARK Basics
Installation
SCALA: An Introduction
The SPARK Context
Introduction to RDDs
RDDs: Creation / Transformation / Actions
Exercises and applications

2. Dataframes, Datasets, SQL SPARK Streaming

Introduction
The SQL Context
Data I/O
Transformations and Actions
Concepts and elements of Streaming
Working with SPARK Streaming
Exercises and applications

3. SPARK Mlib SPARK GraphX

Basics of Mlib
Statistics using Mlib
Machine Learning using Mlib
Basics of Graph Processing
GraphX RDDs
Applications of GraphX
Exercises and applications

4. Case studies, applications Project Discussions

Case studies, applications
Project Discussions
SPARK resources
Trends

5. Final Project