Abstract
Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.
Original language | English |
---|---|
Article number | 43 |
Journal | ACM Computing Surveys |
Volume | 53 |
Issue number | 2 |
Pages (from-to) | 1-37 |
Number of pages | 37 |
ISSN | 0360-0300 |
DOIs | |
Publication status | Published - Apr 2020 |
MoE publication type | A1 Journal article-refereed |
Fields of Science
- 113 Computer and information sciences
- Parameter tuning
- self-tuning
- MapReduce
- Spark
- Storm
- stream
- MAPREDUCE
- PERFORMANCE
- OPTIMIZATION
- MANAGEMENT
- SIMULATION
- TOOLKIT