E-Locus - Institutional Repository of the University of Crete

Home Collections School/Department School of Sciences and Engineering Department of Computer Science Post-graduate theses

Post-graduate theses

Search command : Author="Στεφανίδης" And Author="Κωνσταντίνος"

Current Record: 25 of 824

[Add to Basket]

Identifier

000457364

Title

Analysis of server throughput for managed big data analytics frameworks

Alternative Title

Ανάλυση της απόδοσης του διακομιστή για πλαίσια ανάλυσης μεγάλου όγκου δεδομένων

Author

Αναγνωστάκης, Εμμανουήλ Μ.

Thesis advisor

Πρατικάκης, Πολύβιος

Reviewer

Μπίλας, Άγγελος
Μαγκούτης, Κωνσταντίνος

Abstract

Managed big data frameworks, such as Apache Spark and Giraph demand a large amount of memory per core to process massive volume datasets effectively. The memory pressure that arises from the big data processing leads to high garbage collection (GC) overhead. Big data analytics frameworks attempt to remove this overhead by offloading objects to storage devices. At the same time, infrastructure providers, trying to address the same problem, attribute more memory to increase memory per instance leaving cores underutilized. For frameworks, trying to avoid GC through offloading to storage devices leads to high Serialiation/Deserialization (S/D) overhead. For infrastructure, the result is that resource usage is decreased. These limitations prevent managed big data frameworks from effectively utilizing the CPU thus leading to low server throughput. In this thesis, we conduct a methodological analysis of server throughput for managed big data analytics frameworks. More specifically, we examine, whether reducing GC and S/D can help increase the effective CPU utilization of the server. We use a system called TeraHeap (TH) that moves objects from the Java managed heap (H1) to a secondary heap over a fast storage device (H2) to reduce the GC overhead and eliminate S/D over data. We focus on analyzing the system’s performance under the co-location of multiple memory-bound instances to utilize all available DRAM and study server throughput. Our detailed methodology includes choosing the DRAM budget for each instance and how to distribute this budget among H1 and Page Cache (PC). We try two different distributions for the DRAM budget, one with more H1 and one with more PC to study the needs of both approaches. We evaluate under 3 different memory-per-core scenarios using Spark and Giraph with native JVM or JVM with TeraHeap. We do this to check throughput changes when memory capacity increases. Our experimental results show that increasing memory per core does not help reach max server throughput for analytics. Effective solutions for this problem is using systems like TeraHeap that offload objects from the managed heap without increasing the CPU load. Moving large parts of the heap to fast storage, decreases the DRAM GB per core needs and increases the utilization of the server. Finally, we also include a cost estimation to show that using an approach like TeraHeap could reduce monetary cost by up to 50% for running big data analytics in a world cluster like Amazon’s EC2 or Google Cloud Platform or Microsoft Azure Cloud, which are available to everyone.

Language

English

Subject

CPU utilization

Garbage collection

Giraph

Memory per code

Mobile

Serialization / Deserialization

Spark

Storage

Αξιοποίηση επεξεργαστικής ισχύος