Your browser does not support JavaScript!

Post-graduate theses

Search command : Author="Στεφανίδης"  And Author="Κωνσταντίνος"

Current Record: 25 of 824

Back to Results Previous page
Next page
Add to Basket
[Add to Basket]
Identifier 000457364
Title Analysis of server throughput for managed big data analytics frameworks
Alternative Title Ανάλυση της απόδοσης του διακομιστή για πλαίσια ανάλυσης μεγάλου όγκου δεδομένων
Author Αναγνωστάκης, Εμμανουήλ Μ.
Thesis advisor Πρατικάκης, Πολύβιος
Reviewer Μπίλας, Άγγελος
Μαγκούτης, Κωνσταντίνος
Abstract Managed big data frameworks, such as Apache Spark and Giraph demand a large amount of memory per core to process massive volume datasets effectively. The memory pressure that arises from the big data processing leads to high garbage collection (GC) overhead. Big data analytics frameworks attempt to remove this overhead by offloading objects to storage devices. At the same time, infrastructure providers, trying to address the same problem, attribute more memory to increase memory per instance leaving cores underutilized. For frameworks, trying to avoid GC through offloading to storage devices leads to high Serialiation/Deserialization (S/D) overhead. For infrastructure, the result is that resource usage is decreased. These limitations prevent managed big data frameworks from effectively utilizing the CPU thus leading to low server throughput. In this thesis, we conduct a methodological analysis of server throughput for managed big data analytics frameworks. More specifically, we examine, whether reducing GC and S/D can help increase the effective CPU utilization of the server. We use a system called TeraHeap (TH) that moves objects from the Java managed heap (H1) to a secondary heap over a fast storage device (H2) to reduce the GC overhead and eliminate S/D over data. We focus on analyzing the system’s performance under the co-location of multiple memory-bound instances to utilize all available DRAM and study server throughput. Our detailed methodology includes choosing the DRAM budget for each instance and how to distribute this budget among H1 and Page Cache (PC). We try two different distributions for the DRAM budget, one with more H1 and one with more PC to study the needs of both approaches. We evaluate under 3 different memory-per-core scenarios using Spark and Giraph with native JVM or JVM with TeraHeap. We do this to check throughput changes when memory capacity increases. Our experimental results show that increasing memory per core does not help reach max server throughput for analytics. Effective solutions for this problem is using systems like TeraHeap that offload objects from the managed heap without increasing the CPU load. Moving large parts of the heap to fast storage, decreases the DRAM GB per core needs and increases the utilization of the server. Finally, we also include a cost estimation to show that using an approach like TeraHeap could reduce monetary cost by up to 50% for running big data analytics in a world cluster like Amazon’s EC2 or Google Cloud Platform or Microsoft Azure Cloud, which are available to everyone.
Language English
Subject CPU utilization
Garbage collection
Giraph
Memory per code
Mobile
Serialization / Deserialization
Spark
Storage
Αξιοποίηση επεξεργαστικής ισχύος
Αποθήκευση
Μεγάλος όγκος δεδομένων
Μνήμη ανά πυρήνα
Σειριοποίηση / Αποσειριοποίηση
Συλλογή σκουπιδιών
Issue date 2023-07-21
Collection   School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses
  Type of Work--Post-graduate theses
Permanent Link https://elocus.lib.uoc.gr//dlib/5/f/3/metadata-dlib-1689671822-367861-3237.tkl Bookmark and Share
Views 610

Digital Documents
No preview available

Download document
View document
Views : 9