Programming Big Data Applications
In the age of the Internet of Things and social media platforms, huge amounts of digital data are generated by and collected from many sources, including sensors, mobile devices, wearable trackers and security cameras. These data, commonly referred to as Big Data, are challenging current storage, processing and analysis capabilities. New models, languages, systems and algorithms continue to be developed to effectively collect, store, analyze and learn from Big Data.
The book Programming Big Data Applications introduces and discusses models, programming frameworks and algorithms to process and analyze large amounts of data. In particular, the book provides an in-depth description of the properties and mechanisms of the main programming paradigms for Big Data analysis, including MapReduce, workflow, BSP, message passing, and SQL-like.
Through programming examples it also describes the most used frameworks for Big Data analysis like Hadoop, Spark, MPI, Hive, Storm and others. We discuss and compare the different systems by highlighting the main features of each of them, their diffusion (both within their community of developers and users), and their main advantages and disadvantages in implementing Big Data analysis applications.
Programming Big Data Applications
Scalable Tools and Frameworks for Your Needs
https://doi.org/10.1142/q0444 | June 2024
Pages: 296
Domenico Talia, Paolo Trunfio, Fabrizio Marozzo, Loris Belcastro, Riccardo Cantini, and Alessio Orsino
University of Calabria, Italy
Complementing the text, the book includes downloadable lecture slides (in both PDF and PowerPoint formats) tailored to meet the needs of university students, educators, and professionals alike.
Programming Big Data: Lecture Slides (Powerpoint) (64 MB)
Programming Big Data: Lecture Slides (PDF) (26 MB)
How to cite the book
@book{doi:10.1142/q0444,
author = {Talia, Domenico and Trunfio, Paolo and Marozzo, Fabrizio and Belcastro, Loris and Cantini, Riccardo and Orsino, Alessio},
title = {Programming Big Data Applications},
publisher = {WORLD SCIENTIFIC (EUROPE)},
year = {2024},
doi = {10.1142/q0444},
URL = {https://www.worldscientific.com/doi/abs/10.1142/q0444},
eprint = {https://www.worldscientific.com/doi/pdf/10.1142/q0444}
}
Book Exercises
The code for all exercices proposed in the book are available on a public GitHub repository.
Users are free to download and use this code. To facilitate usage, a README.md
file has been included in
the folder of each exercise, providing details about the code and explaining how to run it in the
distributed environment. Additionally, each exercise folder includes a bash script, run.sh
,
that automates the process of building, setting up, and running the example.
Below, we present the list of the available exercises for the different systems covered in the book:
Apache Spark
- 4.3.1.4-1: Market basket analysis
- 4.3.1.4-2: Use of DataFrame APIs for data querying
- 4.5.1.4: PageRank using GraphX
- 5.3.1.1: Trajectory mining
- 5.3.2.2: Real-time network intrusion detection using Spark Streaming
- 5.3.3.2: Region-of-Interest mining from social media data using SparkSQL
- 5.3.4.1: TextRank for extractive summarization using GraphX
Apache Hadoop
Docker containers
To run the provided exercises in a distributed environment, a cluster of Docker containers has been configured as follows:
MASTER
n. 1 instance
It is a single container that acts as:
- Hadoop NameNode/Resource Manager
- Spark Master
- Hive Server
- ZooKeeper Server
- Airflow Scheduler/Triggerer/Webserver
History Server
n. 1 instance
It is a single container acting as the Hadoop/Spark History server to enable developers to monitor the metrics and performance of completed Spark/Hadoop applications.
Jupyter
n. 1 instance
A single container that provides an instance of Jupyter Lab, providing a browser-based interface that allows developers to use multiple notebooks together. In particular, the provided instance includes two kernels, namely ipykernel and spylon-kernel, for running notebook code in Python and Scala.
WORKERS
n. 2 instances
Two containers, namely worker1 and worker2, which act as:
- Hadoop DataNode/NodeManager
- Spark Worker
- Storm Supervisor
- Airflow Celery Worker Server
Metastore
n. 1 instance
An instance of Postgres server acting as metastore for systems requiring DBMS for storing data (e.g., Apache Hive, Apache Airflows).
Rabbitmq
n. 1 instance
An instance of RabbitMQ server, a messaging and streaming broker used by Apache Airflow.
Getting started
Before using the code included in this project, ensure that Git is installed on your computer. For detailed instructions on how to install Git, please refer to this guide.
To run the examples, you also need to install Docker on your machine. Docker is available for multiple platforms (Mac, Windows, Linux). For detailed instructions on installing, setting up, configuring, and using Docker, please refer to the official guide.
After installing Docker, you must clone or download the repository:
git clone https://github.com/BigDataProgramming/bigdataprogramming.github.io.git
After downloading or cloning the repository, go into the project root folder and run the following commands to deploy the cluster:
cd bigdataprogramming.github.io
bash build.sh
docker compose up -d
cd bigdataprogramming.github.io
./build.bat
docker compose up -d
After running all the containers, you can start using them. In particular, the following endpoints are available:
Apache Hadoop
- ResourceManager: http://localhost:8088
- NameNode: http://localhost:9870
- HistoryServer: http://localhost:19888
- Datanode 1: http://localhost:9864
- Datanode 2: http://localhost:9865
- NodeManager 1: http://localhost:8042
- NodeManager 2: http://localhost:8043
Jupyter Notebook
- Jupyter Lab UI: http://localhost:8888/lab
Apache Spark
- Spark Master: http://localhost:8080
- Spark Worker 1: http://localhost:8081
- Spark Worker 2: http://localhost:8082
Apache Hive
- Hive URI: jdbc:hive2://localhost:10000
Apache Airflow
- Airflow UI: http://localhost:8881