Programming Big Data Applications

In the age of the Internet of Things and social media platforms, huge amounts of digital data are generated by and collected from many sources, including sensors, mobile devices, wearable trackers and security cameras. These data, commonly referred to as Big Data, are challenging current storage, processing and analysis capabilities. New models, languages, systems and algorithms continue to be developed to effectively collect, store, analyze and learn from Big Data.

The book Programming Big Data Applications introduces and discusses models, programming frameworks and algorithms to process and analyze large amounts of data. In particular, the book provides an in-depth description of the properties and mechanisms of the main programming paradigms for Big Data analysis, including MapReduce, workflow, BSP, message passing, and SQL-like.

Through programming examples it also describes the most used frameworks for Big Data analysis like Hadoop, Spark, MPI, Hive, Storm and others. We discuss and compare the different systems by highlighting the main features of each of them, their diffusion (both within their community of developers and users), and their main advantages and disadvantages in implementing Big Data analysis applications.

Programming Big Data Applications

Scalable Tools and Frameworks for Your Needs

https://doi.org/10.1142/q0444 | June 2024

Pages: 296

Domenico Talia, Paolo Trunfio, Fabrizio Marozzo, Loris Belcastro, Riccardo Cantini, and Alessio Orsino

University of Calabria, Italy

Readership: Undergraduate and graduate students in computer science, computer engineering, data science, and data engineering. PhD students and researchers in computer science and engineering, and data science.
How to cite the book
@book{doi:10.1142/q0444,
    author = {Talia, Domenico and Trunfio, Paolo and Marozzo, Fabrizio and Belcastro, Loris and Cantini, Riccardo and Orsino, Alessio},
    title = {Programming Big Data Applications},
    publisher = {WORLD SCIENTIFIC (EUROPE)},
    year = {2024},
    doi = {10.1142/q0444},
    URL = {https://www.worldscientific.com/doi/abs/10.1142/q0444},
    eprint = {https://www.worldscientific.com/doi/pdf/10.1142/q0444}
}

Book Exercises

The code for all exercices proposed in the book are available on a public GitHub repository.

Users are free to download and use this code. To facilitate usage, a README.md file has been included in the folder of each exercise, providing details about the code and explaining how to run it in the distributed environment. Additionally, each exercise folder includes a bash script, run.sh, that automates the process of building, setting up, and running the example.

Below, we present the list of the available exercises for the different systems covered in the book:

* Please note that the programs provided are not a commercial product and are provided solely with illustrative purposes. The author and the publisher are not responsible for losses, damages etc. as a result of program implementation.

Docker containers

To run the provided exercises in a distributed environment, a cluster of Docker containers has been configured as follows:

MASTER

n. 1 instance

It is a single container that acts as:

  • Hadoop NameNode/Resource Manager
  • Spark Master
  • Hive Server
  • ZooKeeper Server
  • Airflow Scheduler/Triggerer/Webserver

History Server

n. 1 instance

It is a single container acting as the Hadoop/Spark History server to enable developers to monitor the metrics and performance of completed Spark/Hadoop applications.

Jupyter

n. 1 instance

A single container that provides an instance of Jupyter Lab, providing a browser-based interface that allows developers to use multiple notebooks together. In particular, the provided instance includes two kernels, namely ipykernel and spylon-kernel, for running notebook code in Python and Scala.

WORKERS

n. 2 instances

Two containers, namely worker1 and worker2, which act as:

  • Hadoop DataNode/NodeManager
  • Spark Worker
  • Storm Supervisor
  • Airflow Celery Worker Server

Metastore

n. 1 instance

An instance of Postgres server acting as metastore for systems requiring DBMS for storing data (e.g., Apache Hive, Apache Airflows).

Rabbitmq

n. 1 instance

An instance of RabbitMQ server, a messaging and streaming broker used by Apache Airflow.

Getting started

Before using the code included in this project, ensure that Git is installed on your computer. For detailed instructions on how to install Git, please refer to this guide.

To run the examples, you also need to install Docker on your machine. Docker is available for multiple platforms (Mac, Windows, Linux). For detailed instructions on installing, setting up, configuring, and using Docker, please refer to the official guide.

After installing Docker, you must clone or download the repository:

git clone https://github.com/BigDataProgramming/bigdataprogramming.github.io.git

After downloading or cloning the repository, go into the project root folder and run the following commands to deploy the cluster:

cd bigdataprogramming.github.io
bash build.sh
docker compose up -d
cd bigdataprogramming.github.io
./build.bat
docker compose up -d

After running all the containers, you can start using them. In particular, the following endpoints are available:

Apache Hadoop

Jupyter Notebook

Apache Spark

Apache Hive

Apache Airflow

Authors

Domenico Talia is a Professor of Computer Engineering at University of Calabria, Italy. He is a senior associate editor of ACM Computing Surveys, an associate editor of Computer, and a member of the editorial board of Future Generation Computer Systems, IEEE Transactions on Parallel and Distributed Systems, the International Journal of Web and Grid Services, the Journal of Cloud Computing, Big Data and Cognitive Computing, and the International Journal of Next-Generation Computing. He is a senior member of IEEE and ACM.

Domenico Talia

Paolo Trunfio is a Professor of Computer Engineering at University of Calabria, Italy. He currently serves as an associate editor of the Journal of Big Data, ACM Computing Surveys and is a member of the editorial boards of several scientific journals, including Future Generation Computer Systems, Big Data and Cognitive Computing, the International Journal of Web Information Systems, and the International Journal of Parallel, Emergent and Distributed Systems. He is a senior member of IEEE and ACM.

Paolo Trunfio

Fabrizio Marozzo is an Assistant Professor of Computer Engineering at University of Calabria, Italy. He is in the editorial boards of several journals, including IEEE Transactions on Big Data, Journal of Big Data, Big Data and Cognitive Computing, IEEE Access, Algorithms, Frontiers in Big Data, Heliyon, and SN Computer Science. He served as a guest editor for several journals, including Concurrency and Computation: Practice and Experience. He is a senior member of IEEE.

Fabrizio Marozzo

Loris Belcastro is a Researcher of Computer Engineering at University of Calabria, Italy. He is in the editorial boards of several journals, including Journal of Supercomputing, SN Computer Science, Journal of Autonomous Intelligence. He served as a guest editor for numerous journals, such as Future Generation Computer Systems, Journal of Big Data, Sensors, Algorithms, Applied Sciences, and Frontiers in Big Data.

Loris Belcastro

Riccardo Cantini is a Researcher of Computer Engineering at University of Calabria, Italy. He was a visiting researcher at the Barcelona Supercomputing Center. His research interests include social media and big data analysis, machine and deep learning, natural language processing, opinion mining, topic detection, edge computing, and high-performance data analytics.

Riccardo Cantini

Alessio Orsino is currently pursuing his PhD in Information and Communication Technologies at University of Calabria, Italy. He was a visiting researcher at the University of Cambridge, Mobile Systems Research Lab. His research interests include big data analysis, parallel and distributed computing, cloud and edge computing, and machine learning.

Alessio Orsino