Abstract: Transformers and large language models (LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a "memory wall", i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect bandwidth and computational capabilities of CPUs and GPUs. In this paper, we leverage a key observation that the interleaving of the forward, backward and update phases generate fluctuations in the GPU memory utilization, which can be exploited to dynamically move a part of the optimizer state between the host and the GPU memory at each iteration. To this end, we design and implement Deep Optimizer States, a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU based on our proposed performance model that addresses the trade-off between data movement cost, acceleration on the GPUs vs the CPUs, and competition for shared resources. We integrate our approach with DeepSpeed and demonstrate 2.5\texttimes faster iterations over state-of-the-art approaches using extensive experiments. BibTex
[2]Tan, N., Assogba, K., Ashworth, W.J., Bogale, B., Cappello, F., Rafique, M.M., Taufer, M. and Nicolae, B. 2024. Towards Affordable Reproducibility Using Scalable Capture and Comparison of Intermediate Multi-Run Results. MIDDLEWARE ’24: The 25th International Middleware Conference (Hong Kong, China, 2024), 392–403.Details
Keywords: results reproducibility, checkpoint analysis, high-performance computing, error-bounded hashing Abstract: Ensuring reproducibility in high-performance computing (HPC) applications is a significant challenge, particularly when nondeterministic execution can lead to untrustworthy results. Traditional methods that compare final results from multiple runs often fail because they provide sources of discrepancies only a posteriori and require substantial resources, making them impractical and unfeasible. This paper introduces an innovative method to address this issue by using scalable capture and comparing intermediate multi-run results. By capitalizing on intermediate checkpoints and hash-based techniques with user-defined error bounds, our method identifies divergences early in the execution paths. We employ Merkle trees for checkpoint data to reduce the I/O overhead associated with loading historical data. Our evaluations on the nondeterministic HACC cosmology simulation show that our method effectively captures differences above a predefined error bound and significantly reduces I/O overhead. Our solution provides a robust and scalable method for improving reproducibility, ensuring that scientific applications on HPC systems yield trustworthy and reliable results.
BibTex
[3]Nicolae, B. et al. 2024. Diaspora: Resilience-Enabling Services for Real-Time Distributed Workflows. NRDPISI’24: The 1st The 1st International Workshop on Near Real-time Data Processing for Interconnected Scientific Instruments (co-located with eScience’24) (Osaka, Japan, 2024).Details
Keywords: real-time distributed HPC workflows, resilience, high-availability, data streaming, elasticity, anomaly detection and prediction Abstract: The need for real-time processing to enable automated decision making and experimental steering has driven a shift from high-performance computing workflows on a centralized system to a distributed approach that integrates remote data sources, edge devices, and diverse compute facilities. Under this paradigm, data can be processed close to the source where it is generated, thus reducing latency and bandwidth usage. System resilience is thus a key challenge, requiring distributed workflows to survive component failures and to meet stringent qualityof-service requirements, which results in the need to mitigate anomalies such as congestion and low availability of resources. To address these challenges, we propose Diaspora, a unified resilience framework that is inspired by event-driven communication patterns used in public clouds. Specifically, we propose an event fabric that extends across sites, facilities, and computations to provide timely, reliable, and accurate information about data, application, and resource status. On top of the event fabric, we build resilience-enabling services that combine QoS-aware data streaming, resilient data views, resilient compute and data resources, and anomaly detection and prediction, all of which collectively enhance workflow resilience for these scientific cases.
BibTex
[4]Ye, J., Cernuda, J., Rajesh, N., Bateman, K., Yildiz, O., Peterka, T., Nigmetov, A., Morozov, D., Sun, X.-H., Kougkas, A. and Nicolae, B. 2024. Viper: A High-Performance I/O Framework for Transparently Updating, Storing, and Transferring Deep Neural Network Models. ICPP’24: The 53nd International Conference on Parallel Processing (Gotland, Sweden, 2024).Details
Keywords: AI Workflows, Coupled Training and Inferences, Adaptive AI Model Checkpointing, Inferences During Partial Training Abstract: Scientific workflows increasingly need to train a DNN model in real-time during an experiment (e.g. using ground truth from a simulation), while using it at the same time for inferences. Instead of sharing the same model instance, the training (producer) and inference server (consumer) often use different model replicas that are kept synchronized. In addition to efficient I/O techniques to keep the model replica of the producer and consumer synchronized, there is another important trade-off: frequent model updates enhance inference quality but may slow down training; infrequent updates may lead to less precise inference results. To address these challenges, we introduce Viper: a new I/O framework designed to determine a near-optimal checkpoint schedule and accelerate the delivery of the latest model updates. Viper builds an inference performance predictor to identify the optimal checkpoint schedule to balance the trade-off between training slowdown and inference quality improvement. It also creates a memory-first model transfer engine to accelerate model delivery through direct memory-to memory communication. Our experiments show that Viper can reduce the model update latency by 9x using the GPU-to-GPU data transfer engine and 3x using the DRAM-to-DRAM host data transfer. The checkpoint schedule obtained from Viper’s predictor also demonstrates improved cumulative inference accuracy compared to the baseline of epoch-based solutions.
BibTex
[5]Bouvier, T., Nicolae, B., Costan, A., Bicer, T., Foster, I. and Antoniu, G. 2024. Efficient Distributed Continual Learning for Steering Experiments in Real-Time. Future Generation Computer Systems. (2024). DOI:https://doi.org/10.1016/j.future.2024.07.016.Details
Keywords: continual learning, data-parallel training, experience replay, distributed rehearsal buffers, asynchronous data management, scalability, streaming, generative AI Abstract: Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training suffers from catastrophic forgetting (i.e., new patterns are reinforced at the expense of previously acquired knowledge). Training from scratch each time new training data becomes available would result in extremely long training times and massive data accumulation. Rehearsal-based continual learning has shown promise for addressing the catastrophic forgetting challenge, but research to date has not addressed performance and scalability. To fill this gap, we propose an approach based on a distributed rehearsal buffer that efficiently complements data-parallel training on multiple GPUs to achieve high accuracy, short runtime, and scalability. It leverages a set of buffers (local to each GPU) and uses several asynchronous techniques for updating these local buffers in an embarrassingly parallel fashion, all while handling the communication overheads necessary to augment input mini-batches (groups of training samples fed to the model) using unbiased, global sampling. We further propose a generalization of rehearsal buffers to support both classification and generative learning tasks, as well as more advanced rehearsal strategies (notably dark experience replay, leveraging knowledge distillation). We illustrate this approach with a real-life HPC streaming application from the domain of ptychographic image reconstruction. We run extensive experiments on up to 128 GPUs of the ThetaGPU supercomputer to compare our approach with baselines representative of training-from-scratch (the upper bound in terms of accuracy) and incremental training (the lower bound). Results show that rehearsal-based continual learning achieves a top-5 validation accuracy close to the upper bound, while simultaneously exhibiting a runtime close to the lower bound.
BibTex
[6]Gossman, M.J., Nicolae, B. and Calhoun, J.C. 2024. Scalable I/O aggregation for asynchronous multi-level checkpointing. Future Generation Computer Systems. 160, (2024), 420–432. DOI:https://doi.org/10.1016/j.future.2024.06.003.Details
Keywords: checkpoint-restart, asynchronous I/O, distributed I/O aggregation Abstract: Checkpointing distributed HPC applications is a common I/O pattern with many use cases: resilience, job management, reproducibility, revisiting previous intermediate results, etc. This is a difficult pattern for a large number of processes that need to capture massive data sizes and write them persistently to shared storage (e.g., parallel file system), which is subject to I/O bottlenecks due to limited I/O bandwidth under concurrency. In addition to I/O performance and scalability considerations, there are often limits that users impose on the number of files or objects that can be used to capture the checkpoints. For example, users need to move checkpoints between HPC systems or parallel file systems, which is inefficient for a large number of files, or need to use the checkpoints in workflows that expect related objects to be grouped together. As a consequence, I/O aggregation is often used to reduce the number of files and objects persistent to shared storage such that it is much lower than the number of processes. However, I/O aggregation is challenging for two reasons: (1) if more than one process is writing checkpointing data to the same file, this causes additional I/O contention that amplifies the I/O bottlenecks; (2) scalable state-of-art checkpointing techniques are asynchronous and rely on multi-level techniques to capture the data structures to local storage or memory, then flush it from there to shared storage in the background, which competes for resources (I/O, memory, network bandwidth) with the application that is running in the foreground. State of art approaches have addressed the problem of I/O aggregation for synchronous checkpointing but are insufficient for asynchronous checkpointing. To fill this gap, we contribute with a novel I/O aggregation strategy that operates efficiently in the background to complement asynchronous C/R. Specifically, we explore how to (1) develop a network of efficient, thread-safe I/O proxies that persist data via limited-sized write buffers, (2) prioritize remote (from non-proxy processes) and local data on I/O proxies to minimize write overhead, and (3) load-balance flushing on I/O proxies. We analyze trade-offs of developing such strategies and discuss the performance impact on large-scale micro-benchmarks, as well as a real HPC application (HACC).
BibTex
[7]Maurya, A., Underwood, R., Rafique, M., Cappello, F. and Nicolae, B. 2024. DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models. HPDC’24: The 33nd International Symposium on High-Performance Parallel and Distributed Computing (Pisa, Italy, 2024).Details
Keywords: scalable checkpointing, asynchronous multi-level I/O, machine learning and AI, LLMs and transformers Abstract: LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 4x faster checkpointing and 2.2x faster end-to-end training runtime compared with the state-of-art checkpointing approaches.
BibTex
[8]Underwood, R., Madhyastha, M., Burns, R. and Nicolae, B. 2024. EvoStore: Towards Scalable Storage of Evolving Learning Models. HPDC’24: The 33nd International Symposium on High-Performance Parallel and Distributed Computing (Pisa, Italy, 2024).Details
Keywords: AI, Model Repository, Network Architecture Search, Regulartized Evolution, Distributed, AI for HPC Abstract: Deep Learning (DL) has seen rapid adoption in all domains. Since training DL models is expensive, both in terms of time and resources, application workflows that make use of DL increasingly need to operate with a large number of derived learning models, which are obtained through transfer learning and fine-tuning. At scale, thousands of such derived DL models are accessed concurrently by a large number of processes. In this context, an important question is how to design and develop specialized DL model repositories that remain scalable under concurrent access, while addressing key challenges: how to query the DL model architectures for specific patterns? How to load/store a subset of layers/tensors from a DL model? How to efficiently share unmodified layers/tensors between DL models derived from each other through transfer learning? How to maintain provenance and answer ancestry queries? State of art leaves a gap regarding these challenges. To fill this gap, we introduce EvoStore, a distributed DL model repository with scalable data and metadata support to store and access derived DL models efficiently. Large-scale experiments on hundreds of GPUs show significant benefits over state-of-art with respect to I/O and metadata performance, as well as storage space utilization.
BibTex
[9]Bouvier, T., Nicolae, B., Chaugier, H., Costan, A., Foster, I. and Antoniu, G. 2024. Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers. CCGrid 2024: IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (Philadelphia, USA, 2024), 1–10.Details
Keywords: continual learning, data-parallel training, experience replay, distributed rehearsal buffers, asynchronous data management, scalability Abstract: Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training suffers from catastrophic forgetting (i.e., new patterns are reinforced at the expense of previously acquired knowledge). Training from scratch each time new training data becomes available would result in extremely long training times and massive data accumulation. Rehearsal-based continual learning has shown promise for addressing the catastrophic forgetting challenge, but research to date has not addressed performance and scalability. To fill this gap, we propose an approach based on a distributed rehearsal buffer that efficiently complements data-parallel training on multiple GPUs, allowing us to achieve short runtime and scalability while retaining high accuracy. It leverages a set of buffers (local to each GPU) and uses several asynchronous techniques for updating these local buffers in an embarrassingly parallel fashion, all while handling the communication overheads necessary to augment input mini-batches (groups of training samples fed to the model) using unbiased, global sampling. In this paper we explore the benefits of this approach for classification models. We run extensive experiments on up to 128 GPUs of the ThetaGPU supercomputer to compare our approach with baselines representative of training-from-scratch (the upper bound in terms of accuracy) and incremental training (the lower bound). Results show that rehearsal-based continual learning achieves a top-5 classification accuracy close to the upper bound, while simultaneously exhibiting a runtime close to the lower bound.
BibTex
[10]Maurya, A., Ye, J., Rafique, M., Cappello, F. and Nicolae, B. 2024. Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers. FlexScience’24: The 14th IEEE/ACM Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures (Pisa, Italy, 2024).Details
Keywords: deep learning, distributed caching, data pipelines, reuse or training data Abstract: Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order of weeks or months. Thanks to 3D model parallelism (data, pipeline, and tensor-level parallelism), the training can scale to a large number of GPUs, which reduces the duration of the training but dramatically increases the cost. Even when a large number of GPUs are available, the aggregated GPU memory is often not enough to hold the full training state (optimizer state, model parameters, and gradients). To compensate, state-of-the-art approaches offload the optimizer state at least partially to the host memory and perform hybrid CPU-GPU computations. Such flexible solutions dramatically reduce the GPU memory utilization, which makes it feasible to run the training on a smaller number of GPUs at the cost of performance penalty. Unfortunately, the challenges and bottlenecks of adopting this strategy are not sufficiently studied by state-of-the-art, which results in poor management of the combined host-GPU memory and poor overlapping between data movements and computations. In this paper, we aim to fill this gap by characterizing the behavior of offloaded training using the DeepSpeed runtime. Specifically, we study the GPU memory utilization over time during each iteration, the activity on the PCIe related to transfers between the host memory and the GPU memory, and the relationship between resource utilization and the steps involved in each iteration. Thanks to this study, we reveal opportunities for future improvements of offloading solutions, which enable greater flexibility to optimize the cost-performance trade-off in the context of transformer and LLM training.
BibTex
[11]Assogba, K., Rafique, M. and Nicolae, B. 2023. Optimizing the Training of Co-Located Deep Learning Models Using Cache-Aware Staggering. HIPC’23: 30th IEEE International Conference on High Performance Computing, Data, and Analytics (Goa, India, 2023).Details
Keywords: Deep Learning, Caching and Reuse of Training Data, Co-Located Training, Performance Modeling Abstract: Despite significant advances, training deep learning models remains a time-consuming and resource-intensive task. One of the key challenges in this context is the ingestion of the training data, which involves non-trivial overheads: read the training data from a remote repository, apply augmentations and transformations, shuffle the training samples, and assemble them into mini-batches. Despite the introduction of abstractions such as data pipelines that aim to hide such overheads asynchronously, it is often the case that the data ingestion is slower than the training, causing a delay at each training iteration. This problem is further augmented when training multiple deep learning models simultaneously on powerful compute nodes that feature multiple GPUs. In this case, the training data is often reused across different training instances (e.g., in the case of multi-model or ensemble training) or even within the same training instance (e.g., data-parallel training). However, transparent caching solutions (e.g., OS-level POSIX caching) are not suitable to directly mitigate the competition between training instances that reuse the same training data. In this paper, we study the problem of how to minimize the makespan of running two training instances that reuse the same training data. The makespan is subject to a trade-off: if the training instances start at the same time, competition for I/O bandwidth slows down the data pipelines and increases the makespan. If one training instance is staggered, competition is reduced but the makespan increases. We aim to optimize this trade-off by proposing a performance model capable of predicting the makespan based on the staggering between the training instances, which can be used to find the optimal staggering that triggers just enough competition to make optimal use of transparent caching in order to minimize the makespan. Experiments with different combinations of learning models using the same training data demonstrate that (1) staggering is important to minimize the makespan; (2) our performance model is accurate and can predict the optimal staggering in advance based on calibration overhead.
BibTex
[12]Maurya, A., Rafique, M.M., Cappello, F. and Nicolae, B. 2023. Towards Efficient I/O Pipelines using Accumulated Compression. HIPC’23: 30th IEEE International Conference on High Performance Computing, Data, and Analytics (Goa, India, 2023).Details
Keywords: GPU compression and checkpointing, data accumulation, fast compression Abstract: High-Performance Computing (HPC) workloads generate large volumes of data at high-frequency during their execution, which needs to be captured concurrently at scale. These workloads exploit accelerators such as GPU for faster performance. However, the limited onboard high-bandwidth memory (HBM) on the GPU, and slow device-to-host memory PCIe interconnects lead to I/O overheads during application execution, thereby exacerbating their overall runtime. To overcome the aforementioned limitations, techniques such as compression and asynchronous transfers have been used by data management runtimes. However, compressing small blocks of data leads to a significant runtime penalty on the application. In this paper, we design and develop strategies to optimize the tradeoff between compressing checkpoints instantly and enqueuing transfers immediately versus accumulating snapshots and delaying compression to achieve faster compression throughput. Our evaluations on synthetic and real-life workloads for different systems and workload configurations demonstrate 1.3× to 8.3× speedup compared to the existing checkpoint approaches.
BibTex
[13]Underwood, R., Madhastha, M., Burns, R. and Nicolae, B. 2023. Understanding Patterns of Deep Learning Model Evolution in Network Architecture Search. HIPC’23: 30th IEEE International Conference on High Performance Computing, Data, and Analytics (Goa, India, 2023).Details
Keywords: Transfer Learning, AI, Network Architecture Search, Regularized Evolution, Characterization Study Abstract: Network Architecture Search and specifically Regularized Evolution is a common way to refine the structure of a deep learning model. However, little is known about how models empirically evolve over time which has design implications for designing caching policies, refining the search algorithm for particular applications, and other important use cases. In this work, we algorithmically analyze and quantitatively characterize the patterns of model evolution for a set of models from the Candle project and the Nasbench-201 search space. We show how the evolution of the model structure is influenced by the regularized evolution algorithm. We describe how evolutionary patterns appear in distributed settings and opportunities for caching and improved scheduling. Lastly, we describe the conditions that affect when particular model architectures rise and fall in popularity based on their frequency of acting as a donor in a sliding window.
BibTex
[14]Li, J., Bouteiller, A., Bosilca, G. and Nicolae, B. 2023. Elastic deep learning through resilient collective operations. AI4S’23: 4th Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (with SC’23) (Denver, USA, 2023), 44–50.Details
Keywords: Distributed deep learning, fault tolerance, elastic training, resilient collective communication Abstract: A robust solution that incorporates fault tolerance and elastic scaling capabilities for distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level Failure Mitigation (ULFM), this novel approach promotes efficient and lightweight failure management and encourages smooth scaling in volatile computational settings. The proposed ULFM MPI-centered mechanism outperforms the only officially supported elastic learning framework, Elastic Horovod (using Gloo and NCCL), by a significant factor. These results reinforce the capability of MPI extension to deal with resiliency, and promote ULFM as an effective technique for fault management, minimizing downtime, and thereby enhancing the overall performance of distributed applications, in particular elastic training in high-performance computing (HPC) environments and machine learning applications.
BibTex
[15]Assogba, K., Dam, H.V., Rafique, M. and Nicolae, B. 2023. Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics. SuperCheck’23: The 4th International Symposium on Checkpointing for Supercomputing (with SC’23) (Denver, USA, 2023), 1748–1756.Details
Keywords: result reproducibility, checkpoint analysis, high performance computing, asynchronous multi-level checkpointing Abstract: High-performance computing applications are increasingly integrating checkpointing libraries for reproducibility analytics. However, capturing an entire checkpoint history for reproducibility study faces the challenges of high-frequency checkpointing across thousands of processes. As a result, the runtime overhead affects application performance and intermediate results when interleaving is introduced during floating-point calculations. In this paper, we extend asynchronous multi-level checkpoint/restart to study the intermediate results generated from scientific workflows. We present an initial prototype of a framework that captures, caches, and compares checkpoint histories from different runs of a scientific application executed using identical input files. We also study the impact of our proposed approach by evaluating the reproducibility of classical molecular dynamics simulations executed using the NWChem software. Experiment results show that our proposed solution improves the checkpoint write bandwidth when capturing checkpoints for reproducibility analysis by a minimum of 30 × and up to 211 × compared to the default checkpointing approach in NWChem.
BibTex
[16]Nicolae, B., Islam, T., Ross, R., Dam, H.V., Assogba, K., Shpilker, P., Titov, M., Turilli, M., Wang, T., Kilic, O., Jha, S. and Pouchard, L. 2023. Building the I (Interoperability) of FAIR for performance reproducibility of large-scale composable workflows in RECUP. REWORDS’23: The 3rd Workshop on Reproducible Workflows, Data Management, and Security (with eScience’23) (Limassol, Cyprus, 2023), 1–7.Details
Keywords: High performance computing, HPC, performance reproducibility, workflow execution patterns, workflow execution provenance, metadata capture, research software engineering, FAIR4RS, FAIR4HPC, RO-Crate Abstract: Scientific computing communities increasingly run their experiments using complex data- and compute-intensive workflows that utilize distributed and heterogeneous architectures targeting numerical simulations and machine learning, often executed on the Department of Energy Leadership Computing Facilities (LCFs). We argue that a principled, systematic approach to implementing FAIR principles at scale, including fine-grained metadata extraction and organization, can help with the numerous challenges to performance reproducibility posed by such workflows. We extract workflow patterns, propose a set of tools to manage the entire life cycle of performance metadata, and aggregate them in an HPC-ready framework for reproducibility (RECUP). We describe the challenges in making these tools interoperable, preliminary work, and lessons learned from this experiment.
BibTex
[17]Gossman, M., Nicolae, B. and Calhoun, J. 2023. Modeling Multi-Threaded Aggregated I/O for Asynchronous Checkpointing on HPC Systems. ISPDC’23: The 22nd IEEE International Conference on Parallel and Distributed Computing (Bucharest, Romania, 2023), 101–105.Details
Keywords: performance evaluation, checkpointing, parallel file systems, multi-threaded I/O Abstract: HPC systems encompass more components with each new generation. As a result, the process of interacting with stable storage systems like parallel file systems (PFS) becomes increasingly difficult. Larger systems often result in more frequent failures, increasing the need and frequency to incorporate fault-tolerant mechanisms. One example is checkpoint-restart (C/R), where applications or systems save their data to non-volatile storage devices, such as a PFS. On failure, the system or application is restored to a saved state and computation continues. Today, asynchronous C/R is gaining traction for its ability to checkpoint data to permanent storage concurrently with the application. However, asynchronous C/R brings about many new challenges. For starters, asynchronous C/R introduces complex resource contention between the application and the C/R implementation. Additionally, some implementations adopt file-per-process writing strategies, which overwhelm PFS’ at high core counts. In this work, we explore how multi-threaded POSIX I/O impacts aggregated throughput. To this extent we characterize the influence of different I/O parameters, such as the number of writer threads and how they access storage devices, has on aggregated I/O. We use the information gathered in this study to identify best practices when performing aggregated I/O as a first step in designing an efficient I/O aggregation scheme for asynchronous C/R.
BibTex
[18]Tan, N., Luettgau, J., Marquez, J., Terianishi, K., Morales, N., Bhowmick, S., Cappello, F., Taufer, M. and Nicolae, B. 2023. Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication. ICPP’23: The 52nd International Conference on Parallel Processing (Salt Lake City, USA, 2023), 665–674.Details
Keywords: checkpointing, data versioning, incremental storage, de-duplication, GPU parallelization Abstract: Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many HPC workflows. This pattern introduces high I/O overheads and results in increased storage space utilization especially for workflows that need to capture the evolution of data structures with high frequency as checkpoints. In this context, many applications, such as graph pattern matching, perform sparse updates to large data structures between checkpoints. For these applications, incremental checkpointing techniques that save only the differences from one checkpoint to another can dramatically reduce the checkpoint sizes, I/O bottlenecks, and storage space utilization. However, such techniques are not without challenges: it is non-trivial to transparently determine what data has changed since a previous checkpoint and assemble the differences in a compact fashion that does not result in excessive metadata. State-of-art data reduction techniques (e.g., compression and de-duplication) have significant limitations when applied to modern HPC applications that leverage GPUs: slow at detecting the differences, generate a large amount of metadata to keep track of the differences, and ignore crucial spatiotemporal checkpoint data redundancy. This paper addresses these challenges by proposing a Merkle tree-based incremental checkpointing method to exploit GPUs’ high memory bandwidth and massive parallelism. Experimental results at scale show a significant reduction of the I/O overhead and space utilization of checkpointing compared with state-of-the-art incremental checkpointing and compression techniques.
BibTex
[19]Underwood, R. and Nicolae, B. 2023. MPIGDB: A Flexible Debugging Infrastructure for MPI Programs. FlexScience’23: The 13th Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures (with HPDC’23) (Orlando, USA, 2023), 11–18.Details
Keywords: MPI, debugging, distributed state Abstract: The advent of flexible science workflows that span traditional HPC, the cloud, and beyond require more than ever efficient, scalable, portable, feature-ful debugging tools. This work presents the design and implementation of MPIGDB a flexible debugging infrastructure for MPI programs. MPIGDB is designed to be highly capable at scale, available across platforms, easily interactive, and extensible way to debug distributed MPI programs. This work demonstrates that scalability of this our approach to 128 processes on debugging memory access violations in a heat diffusion code as an example use case, while providing features competitive with proprietary debugging tools such as GPU kernel debugging, mixed language support, and scripting abilities. Together these tools allow users to quickly debug challenges where-ever their science is run.
BibTex
[20]Maurya, A., Rafique, M., Tonellot, T., AlSalem, H., Cappello, F. and Nicolae, B. 2023. GPU-Enabled Asynchronous Multi-level Checkpoint Caching and Prefetching. HPDC’23: The 32nd International Symposium on High-Performance Parallel and Distributed Computing (Orlando, USA, 2023), 73–85.Details
Keywords: High-Performance Computing (HPC), Graphics Processing Unit (GPU), asynchronous multi-level checkpointing, hierarchical cache management, prefetching Abstract: Checkpointing is an I/O intensive operation increasingly used by High-Performance Computing (HPC) applications to revisit pre- vious intermediate datasets at scale. Unlike the case of resilience, where only the last checkpoint is needed for application restart and rarely accessed to recover from failures, in this scenario, it is important to optimize frequent reads and writes of an entire history of checkpoints. State-of-the-art checkpointing approaches often rely on asynchronous multi-level techniques to hide I/O overheads by writing to fast local tiers (e.g. an SSD) and asynchronously flush- ing to slower, potentially remote tiers (e.g. a parallel file system) in the background, while the application keeps running. However, such approaches have two limitations. First, despite the fact that HPC infrastructures routinely rely on accelerators (e.g. GPUs), and therefore a majority of the checkpoints involve GPU memory, ef- ficient asynchronous data movement between the GPU memory and host memory is lagging behind. Second, revisiting previous data often involves predictable access patterns, which are not exploited to accelerate read operations. In this paper, we address these limitations by proposing a scalable and asynchronous multi-level checkpointing approach optimized for both reading and writing of an arbitrarily long history of checkpoints. Our approach exploits GPU memory as a first-class citizen in the multi-level storage hierarchy to enable informed caching and prefetching of checkpoints by leveraging foreknowledge about the access order passed by the application as hints. Our evaluation using a variety of scenarios under I/O concurrency shows up to 74x faster checkpoint and restore throughput as compared to the state-of-art runtime and optimized unified virtual memory (UVM) based prefetching strategies and at least 2x shorter I/O wait time for the application across various workloads and configurations.
BibTex
[21]Madhyastha, M., Underwood, R., Burns, R. and Nicolae, B. 2023. DStore: A Lightweight Scalable Learning Model Repository with Fine-Grained Tensor-Level Access. ICS’23: The 2023 International Conference on Supercomputing (Orlando, USA, 2023), 133–143.Details
Keywords: DL model repository, fine-grained tensor storage and access, benchmarking Abstract: The ability to share and reuse deep learning (DL) models is a key driver that facilitates the rapid adoption of artificial intelligence (AI) in both industrial and scientific applications. However, stateof-the-art approaches to store and access DL models efficiently at scale lag behind. Most often, DL models are serialized by using various formats (e.g., HDF5, SavedModel) and stored as files on POSIX file systems. While simple and portable, such an approach exhibitshigh serialization and I/O overheads, especially under concurrency. Additionally, the emergence of advanced AI techniques (transfer learning, sensitivity analysis, explainability, etc.) introduces the need for fine-grained access to tensors to facilitate the extraction and reuse of individual or subsets of tensors. Such patterns are underserved by state-of-the-art approaches. Requiring tensors to be read in bulk incurs suboptimal performance, scales poorly, and/or overutilizes network bandwidth. In this paper we propose a lightweight, distributed, RDMA-enabled learning model repository that addresses these challenges. Specifically we introduce several ideas: compact architecture graph representation with stable hashing and client-side metadata caching, scalable load balancing on multiple providers, RDMA-optimized data staging, and direct access to raw tensor data. We evaluate our proposal in extensive experiments that involve different access patterns using learning models of diverse shapes and sizes. Our evaluations show a significant improvement (between 2 and 60x over a variety of state-of-the-art model storage approaches while scaling to half the Cooley cluster at the Argonne Leadership Computing Facility.
BibTex
[22]Peterka, T., Morozov, D., Nigmetov, A., Yildiz, O., Nicolae, B. and Davis, P.E. 2023. LowFive: In Situ Data Transport for High-Performance Workflows. IPDPS’23: The 37th IEEE International Parallel and Distributed Processing Symposium (St. Petersburg, USA, 2023), 985–995.Details
Keywords: workflow, data model, data transport, in situ Abstract: We describe LowFive, a new data transport layer based on the HDF5 data model, for in situ workflows. Executables using LowFive can communicate in situ (usi g in-memory data and MPI message passing), reading and writing traditional HDF5 files to physical storage, and combining the two modes. Minimal and often no source-code modification is needed for programs that already use HDF5. LowFive maintains deep copies or shallow references of datasets, configurable by the user. More than one task can produce (write) data, and more than one task can consume (read) data, accommodating fan-in and fan-out in the workflow task graph. LowFive supports data redistribution from n producer processes to m consumer processes. We demonstrate the above features in a series of experiments featuring both synthetic benchmarks as well as a representative use case from a scientific workflow, and we also compare with other data transport solutions in the literature.
BibTex
[23]Abhinit, I. et al. 2022. Novel Proposals for FAIR, Automated, Recommendable, and Robust Workflows. WORKS’22: 17th Workshop on Workflows in Support of Large-Scale Science (in conjunction with SC’22) (Dallas, USA, 2022), 84–92.Details
Keywords: reproducibility, scalable data collection, metadata aggregation and indexing Abstract: This paper introduces RECUP, a meta(data) framework for reproducing hybrid workflows with FAIR support. We target reproducibility from two perspectives: results and performance. To this end, we collect both intermediate data and performance metrics throughout the lifetime of repeated runs of the same workflows. We propose lightweight and scalable approaches to curate, aggregate and index the intermediate data and the performance metrics, which enables an in-depth study of their evolution during repeated runs. Starting from this foundation, we propose approaches to determine if repeated runs diverge, when they do so and what is the root cause for the divergence. We envison an implemention of these approaches as a set of comprehensive tools to facilitate reproducibility.
BibTex
[24]Maurya, A., Nicolae, B., Rafique, M.M., Elsayed, A.M., Tonellot, T. and Cappello, F. 2022. Towards Efficient Cache Allocation for High-Frequency Checkpointing. HiPC’22: The 29th IEEE International Conference on High Performance Computing, Data, and Analytics (Bangalore, India, 2022), 262–271.Details
Keywords: GPU checkpointing, multi-level caching, fast initialization Abstract: While many HPC applications are known to have long runtimes, this is not always because of single large runs: in many cases, this is due to ensembles composed of many short runs (runtime in the order of minutes). When each such run needs to checkpoint frequently (e.g. adjoint computations using a checkpoint interval in the order of milliseconds), it is important to minimize both checkpointing overheads at each iteration, as well as initialization overheads. With the rising popularity of GPUs, minimizing both overheads simultaneously is challenging: while it is possible to take advantage of efficient asynchronous data transfers between GPU and host memory, this comes at the cost of high initialization overhead needed to allocate and pin host memory. In this paper, we contribute with an efficient technique to address this challenge. The key idea is to use an adaptive approach that delays the pinning of the host memory buffer holding the checkpoints until all memory pages are touched, which greatly reduces the overhead of registering the host memory with the CUDA driver. To this end, we use a combination of asynchronous touching of memory pages and direct writes of checkpoints to untouched and touched memory pages in order to minimize end-to-end checkpointing overheads based on performance modeling. Our evaluations show a significant improvement over a variety of alternative static allocation strategies and state-of-art approaches.
BibTex
[25]Whitlock, M., Morales, N., Bosilca, G., Bouteiller, A., Nicolae, B., Teranishi, K., Giem, E. and Sarkar, V. 2022. Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach. CLUSTER’22: The 2022 IEEE International Conference on Cluster Computing (Heidelberg, Germany, 2022), 418–428.Details
Keywords: Fault Tolerance, Resilience, Checkpointing, MPI-ULFM, Kokkos, Fenix, HPC Abstract: Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance – in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More applicationspecific choice in resilience strategies allows for better long-term flexibility, performance, and — importantly — simplicity.
BibTex
[26]Liu, J., Nicolae, B. and Li, D. 2022. Lobster: Load Balance-Aware I/O for Distributed DNN Training. ICPP ’22: The 51st International Conference on Parallel Processing (Bordeaux, France, 2022), 26:1–26:11.Details
Keywords: deep learning, data pipelines, collaborative caching, prefetching, load balancing Abstract: The resource-hungry and time-consuming process of training Deep Neural Networks (DNNs) can be accelerated by optimizing and/or scaling computations on accelerators such as GPUs. However, the loading and pre-processing of training samples then often emerges as a new bottleneck. This data loading process engages a complex pipeline that extends from the sampling of training data on external storage to delivery of those data to GPUs, and that comprises not only expensive I/O operations but also decoding, shuffling, batching, augmentation, and other operations. We propose in this paper a new holistic approach to data loading that addresses three challenges not sufficiently addressed by other methods: I/O load imbalances among the GPUs on a node; rigid resource allocations to data loading and data preprocessing steps, which lead to idle resources and bottlenecks; and limited efficiency of caching strategies based on pre-fetching due to eviction of training samples needed soon at the expense of those needed later. We first present a study of key bottlenecks observed as training samples flow through the data loading and preprocessing pipeline. Then, we describe Lobster, a data loading runtime that uses performance modeling and advanced heuristics to combine flexible thread management with optimized eviction for distributed caching in order to mitigate I/O overheads and load imbalances. Experiments with a range of models and datasets show that the Lobster approach reduces both I/O overheads and end-to-end training times by up to 1.5× compared with stateof-the-art approaches.
BibTex
[27]Liu, J., Nicolae, B., Li, D., Wozniak, J.M., Bicer, T., Liu, Z. and Foster, I. 2022. Large Scale Caching and Streaming of Training Data for Online Deep Learning. FlexScience’22: The 12th IEEE/ACM Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures (Minneapolis, USA, 2022), 19–26.Details
Keywords: deep learning, distributed caching, data pipelines, reuse or training data Abstract: The training of deep neural network models on large data remains a difficult problem, despite progress towards scalable techniques. In particular, there is a mismatch between the random but predetermined order in which AI flows select training samples and the streaming I/O patterns for which traditional HPC data storage (e.g., parallel file systems) are designed. In addition, as more data are obtained, it is feasible neither simply to train learning models incrementally, due to catastrophic forgetting (i.e., bias towards new samples), nor to train frequently from scratch, due to prohibitive time and/or resource constraints. In this paper, we study data management techniques that combine caching and streaming with rehearsal support in order to enable efficient access to training samples in both offline training and continual learning. We revisit state-of-art streaming approaches based on data pipelines that transparently handle prefetching, caching, shuffling, and data augmentation, and discuss the challenges and opportunities that arise when combining these methods with data-parallel training techniques. We also report on preliminary experiments that evaluate the I/O overheads involved in accessing the training samples from a parallel file system (PFS) under several concurrency scenarios, highlighting the impact of the PFS on the design of the data pipelines.
BibTex
[28]Nicolae, B., Hobson, T., Yildiz, O., Peterka, T. and Morozov, D. 2022. Towards Low-Overhead Resilience for Data Parallel Deep Learning. CCGrid’22: The 22th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (Messina, Italy, 2022), 336–345.Details
Keywords: deep learning, data-parallel training, failure simulation, performance model, trade-off analysis Abstract: Data parallel techniques have been widely adopted both in academia and industry as a tool to enable scalable training of deep learning models. At scale, DL training jobs can fail due to software or hardware bugs, may need to be preempted or terminated due to unexpected events, or may perform suboptimally because they were misconfigured. Under such circumstances, there is a need to recover and/or reconfigure data-parallel DL training jobs on-the-fly, while minimizing the impact on the accuracy of the DNN model and the runtime overhead. In this regard, state-of-art techniques adopted by the HPC community mostly rely on checkpoint-restart, which inevitably leads to loss of progress, thus increasing the runtime overhead. In this paper we explore alternative techniques that exploit the properties of modern deep learning frameworks (overlapping of gradient averaging and weight updates with local gradient computations through pipeline parallelism) to reduce the overhead of resilience/elasticity. To this end we introduce a failure simulation framework and two resilience strategies (immediate mini-batch rollback and lossy forward recovery), which we study compared with checkpoint-restart approaches in a variety of settings in order to understand the trade-offs between the accuracy loss of the DNN model and the runtime overhead.
BibTex
[29]Nicolae, B. 2022. Scalable Multi-Versioning Ordered Key-Value Stores with Persistent Memory Support. IPDPS 2022: The 36th IEEE International Parallel and Distributed Processing Symposium (Lyon, France, 2022), 93–103.Details
Keywords: key-value store, ordered dictionary, versioning control, scalable access under concurrency, persistent memory Abstract: Ordered key-value stores (or sorted maps/dictionaries) are a fundamental building block in a large variety of both sequential and parallel/distributed algorithms. However, most state-of-art approaches are either based on ephemeral inmemory representations that are difficult to persist and/or not scalable enough under concurrent access (e.g., red-black trees, skip lists), and/or not lightweight enough (e.g. database engines). Furthermore, there is an increasing need to provide versioning support, which is needed in a variety of scenarios: introspection, provenance tracking, revisiting previous intermediate results. To address these challenges, we propose a new lightweight dictionary data structure that simultaneously provides support for multiversioning, persistency and scalability under concurrent access. We demonstrate its effectiveness through a series of experiments, in which it outperforms several state-of-art approaches, both in terms of vertical and horizontal scalability.
BibTex
[30]Wozniak, J.M., Liu, Z., Vescovi, R., Chard, R., Nicolae, B. and Foster, I.T. 2021. Braid-DB: Toward AI-Driven Science with Machine Learning Provenance. SMC’21: The 21st Smoky Mountains Computational Sciences and Engineering Conference (Virtual Event, 2021), 247–261.Details
Keywords: provenance, machine learning, database Abstract: Next-generation scientific instruments will collect data at unprecedented rates: multiple GB/s and exceeding TB/day. Such runs will benefit from automation and steering via machine learning methods, but these methods require new data management and policy techniques. We present here the Braid Provenance Engine (Braid-DB), a system that embraces AI-for-science automation in how and when to analyze and retain data, and when to alter experimental configurations. Traditional provenance systems automate record-keeping so that humans and/or machines can recover how a particular result was obtained—and, when failures occur, diagnose causes and enable rapid restart. Related workflow automation efforts need additional recording about model training inputs, including experiments, simulations, and the structures of other learning and analysis activities. Braid-DB combines provenance and version control concepts to provide a robust and usable solution.
BibTex
[31]Maurya, A., Nicolae, B., Rafique, M., Tonellot, T. and Cappello, F. 2021. Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing. MASCOTS’21: The 29th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (Virtual, Portugal, 2021), 1–8.Details
Keywords: GPU checkpointing, asynchronous I/O, peer-to-peer collaborative caching, multi-level checkpointing Abstract: Efficient checkpointing of distributed data structures periodically at key moments during runtime is a recurring fundamental pattern in a large number of uses cases: fault tolerance based on checkpoint-restart, in-situ or post-analytics, reproducibility, adjoint computations, etc. In this context, multilevel checkpointing is a popular technique: distributed processes can write their shard of the data independently to fast local storage tiers, then flush asynchronously to a shared, slower tier of higher capacity. However, given the limited capacity of fast tiers (e.g. GPU memory) and the increasing checkpoint frequency, the processes often run out of space and need to fall back to blocking writes to the slow tiers. To mitigate this problem, compression is often applied in order to reduce the checkpoint sizes. Unfortunately, this reduction is not uniform: some processes will have spare capacity left on the fast tiers, while others still run out of space. In this paper, we study the problem of how to leverage this imbalance in order to reduce I/O overheads for multi-level checkpointing. To this end, we solve an optimization problem of how much data to send from each process that runs out of space to the processes that have spare capacity in order to minimize the amount of time spent blocking in I/O. We propose two algorithms: one based on a greedy approach and the other based on modified minimum cost flows. We evaluate our proposal using synthetic and real-life application traces. Our evaluation shows that both algorithms achieve significant improvements in checkpoint performance over traditional multilevel checkpointing.
BibTex
[32]Liu, H., Nicolae, B., Di, S., Cappello, F. and Jog, A. 2021. Accelerating DNN Architecture Search at Scale Using Selective Weight Transfer. CLUSTER’21: The 2021 IEEE International Conference on Cluster Computing (Portland, USA, 2021), 82–93.Details
Keywords: deep learning, neural architecture search, checkpointing, reuse of intermediate data states Abstract: Deep learning applications are rapidly gaining traction both in industry and scientific computing. Unsurprisingly, there has been significant interest in adopting deep learning at a very large scale on supercomputing infrastructures for a variety of scientific applications. A key issue in this context is how to find an appropriate model architecture that is suitable to solve the problem. We call this the neural architecture search (NAS) problem. Over time, many automated approaches have been proposed that can explore a large number of candidate models. However, this remains a time-consuming and resource expensive process: the candidates are often trained from scratch for a small number of epochs in order to obtain a set of top-K best performers, which are fully trained in a second phase. To address this problem, we propose a novel method that leverages checkpoints of previously discovered candidates to accelerate NAS. Based on the observation that the candidates feature high structural similarity, we propose the idea that new candidates need not be trained starting from random weights, but rather from the weights of similar layers of previously evaluated candidates. Thanks to this approach, the convergence of the candidate models can be significantly accelerated and produces candidates that are statistically better based on the objective metrics. Furthermore, once the top-K models are identified, our approach provides a significant speed-up (1.4-1.5x on the average) for the full training.
BibTex
[33]Marcu, O., Costan, A., Nicolae, B. and Antoniu, G. 2021. Virtual Log-Structured Storage for High-Performance Streaming. CLUSTER’21: The 2021 IEEE International Conference on Cluster Computing (Portland, USA, 2021), 135–145.Details
Keywords: replicated virtual log, stream storage, log structured, durability, consistent stream ordering Abstract: Over the past decade, given the higher number of data sources (e.g., Cloud applications, Internet of things) and critical business demands, Big Data transitioned from batch-oriented to real-time analytics. Stream storage systems, such as Apache Kafka, are well known for their increasing role in real-time Big Data analytics. F r scalable stream data ingestion and processing, they logically split a data stream topic into multiple partitions. Stream storage systems keep multiple data stream copies to protect against data loss while implementing a stream partition as a replicated log. This architectural choice enables simplified development while trading cluster size with performance and the number of streams optimally managed. This paper introduces a shared virtual log-structured storage approach for improving the cluster throughput when multiple producers and consumers write and consume in parallel data streams. Stream partitions are associated with shared replicated virtual logs transparently to the user, effectively separating the implementation of stream partitioning (and data ordering) from data replication (and durability). We implement the virtual log technique in the KerA stream storage system. When comparing with Apache Kafka, KerA improves the cluster ingestion throughput (for replication factor three) by up to 4x when multiple producers write over hundreds of data streams.
BibTex
[34]Bicer, T., Yu, X., Ching, D.J., Chard, R., Cherukara, M.J., Nicolae, B. and Kettimuthu, R. 2021. High-Performance Ptychographic Reconstruction with Federated Facilities. SMC’21: The 2021 Smoky Mountains Computational Sciences and Engineering Conference (Kingsport, United States, 2021), 173–189.Details
Keywords: ptychography, high-performance computing, synchrotron light source, scientific computing, federation Abstract: Beamlines at synchrotron light source facilities are powerful scientific instruments used to image samples and observe phenomena at high spatial and temporal resolutions. Typically, these facilities are equipped only with modest compute resources for the analysis of generated experimental datasets. However, high data rate experiments can easily generate data in volumes that take days (or even weeks) to process on those local resources. To address this challenge, we present a system that unifies leadership computing and experimental facilities by enabling the automated establishment of data analysis pipelines that extend from edge data acquisition systems at synchrotron beamlines to remote computing facilities; under the covers, our system uses Globus Auth authentication to minimize user interaction, funcX to run user-defined functions on supercomputers, and Globus Flows to define and execute workflows. We describe the application of this system to ptychography, an ultra-high-resolution coherent diffraction imaging technique that can produce 100s of gigabytes to terabytes in a single experiment. When deployed on the DGX A100 ThetaGPU cluster at the Argonne Leadership Computing Facility and a microscopy beamline at the Advanced Photon Source, our system performs analysis as an experiment progresses to provide timely feedback.
BibTex
[35]Morales, N., Teranishi, K., Nicolae, B., Trott, C. and Cappello, F. 2021. Towards High Performance Resilience Using Performance Portable Abstractions. EuroPar’21: 27th International European Conference on Parallel and Distributed Systems (Lisbon, Portugal, 2021), 47–60.Details
Keywords: Performance Portability, Resilience, Fault Tolerance, Checkpointing, Programming Models Abstract: In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable abstractions such as those advocated by Kokkos, RAJA and HPX are becoming increasingly popular. At the same time, the unprecedented scalability requirements of such heterogeneous components means higher failure rates, motivating the need for resilience in systems and applications. Unfortunately, state-of-art resilience techniques based on checkpoint/restart are lagging behind performance portability efforts: users still need to capture consistent states manually, which introduces the need for fine-tuning and customization. In this paper we aim to close this gap by introducing a set of abstractions that make it easier for the application developers to reason about resilience. To this end, we extend the existing abstractions proposed by performance portability efforts towards resilience. By marking critical data structures that need to be checkpointed, one can enable an optimized runtime to automate checkpoint-restart using high performance and scalable asynchronously techniques. We illustrate the feasibility of our proposal using a prototype that combines the Kokkos runtime (HPC performance portability), with the VELOC runtime (large-scale low overhead checkpoint-restart). Our experimental results show negligible performance overhead compared compared with a manually tuned implementation of checkpoint-restart while requiring minimal changes in the application code
BibTex
[36]Tseng, S.-M., Nicolae, B., Cappello, F. and Chandramowlishwaran, A. 2021. Demystifying asynchronous I/O Interference in HPC applications. The International Journal of High Performance Computing Applications. 35, 4 (2021), 391–412. DOI:https://doi.org/10.1177/10943420211016511.Details
Keywords: asynchronous I/O, interference study Abstract: With increasing complexity of HPC workflows, data management services need to perform expensive I/O operations asynchronously in the background, aiming to overlap the I/O with the application runtime. However, this may cause interference due to competition for resources: CPU, memory/network bandwidth. The advent of multi-core architectures has exacerbated this problem, as many I/O operations are issued concurrently, thereby competing not only with the application but also among themselves. Furthermore, the interference patterns can dynamically change as a response to variations in application behavior and I/O subsystems (e.g. multiple users sharing a parallel file system). Without a thorough understanding, I/O operations may perform suboptimally, potentially even worse than in the blocking case. To fill this gap, this paper investigates the causes and consequences of interference due to asynchronous I/O on HPC systems. Specifically, we focus on multi-core CPUs and memory bandwidth, isolating the interference due to each resource. Then, we perform an in-depth study to explain the interplay and contention in a variety of resource sharing scenarios such as varying priority and number of background I/O threads and different I/O strategies: sendfile, read/write, mmap/write underlining trade-offs. The insights from this study are important both to enable guided optimizations of existing background I/O, as well as to open new opportunities to design advanced asynchronous I/O strategies.
BibTex
[37]Hobson, T., Yildiz, O., Nicolae, B., Huang, J. and Peterka, T. 2021. Shared-Memory Communication for Containerized Workflows. CCGrid’21: The 21th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (Virtual, Australia, 2021), 123–132.Details
Keywords: shared memory, workflow systems, containers Abstract: Scientific computation increasingly consists of a workflow of interrelated tasks. Containerization can make workflow systems more manageable, reproducible, and portable, but containers can impede communication due to their focus on encapsulation. In some circumstances, shared-memory regions are an effective way to improve performance of workflows; however sharing memory between containerized workflow tasks is difficult. In this work, we have created a software library called Dhmem that manages shared memory between workflow tasks in separate containers, with minimal code change and performance overhead. Instead of all code being in the same container, Dhmem allows a separate container for each workflow task to be constructed completely independently. Dhmem enables additional functionality: easy integration in existing workflow systems, communication configuration at runtime based on the environment, and scalable performance.
BibTex
[38]Nicolae, B., Moody, A., Kosinovsky, G., Mohror, K. and Cappello, F. 2021. VELOC: VEry Low Overhead Checkpointing in the Age of Exascale. SuperCheck’21: The First International Symposium on Checkpointing for Supercomputing (Virtual Event, 2021).Details
Keywords: HPC, checkpoint-restart, state preservation, resilience Abstract: Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications. However, such a pattern frequently leads to I/O bottlenecks that lead to poor scalability and performance. As modern HPC infrastructures continue to evolve, there is a growing gap between compute capacity vs. I/O capabilities. Furthermore, the storage hierarchy is becoming increasingly heterogeneous: in addition to parallel file systems, it comprises burst buffers, key-value stores, deep memory hierarchies at node level, etc. In this context, state of art is insufficient to deal with the diversity of vendor APIs, performance and persistency characteristics. This extended abstract presents an overview of VeloC (Very Low Overhead Checkpointing System), a checkpointing runtime specifically design to address these challenges for the next generation Exascale HPC applications and systems. VeloC offers a simple API at user level, while employing an advanced multi-level resilience strategy that transparently optimizes the performance and scalability of checkpointing by leveraging heterogeneous storage.
BibTex
[39]Wozniak, J., Yoo, H., Mohd-Yusof, J., Nicolae, B., Collier, N., Ozik, J., Brettin, T. and Stevens, R. 2020. High-bypass Learning: Automated Detection of Tumor Cells That Significantly Impact Drug Response. MLHPC’20: The 2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (in conjuction with SC’20) (Virtual Event, 2020), 1–10.Details
Keywords: deep learning, sensitivity analysis, outlier detection, ensembles, workflows Abstract: Machine learning in biomedicine is reliant on the availability of large, high-quality data sets. These corpora are used for training statistical or deep lea ning -based models that can be validated against other data sets and ultimately used to guide decisions. The quality of these data sets is an essential component of the quality of the models and their decisions. Thus, identifying and inspecting outlier data is critical for evaluating, curating, and using biomedical data sets. Many techniques are available to look for outlier data, but it is not clear how to evaluate the impact on highly complex deep learning methods. In this paper, we use deep learning ensembles and workflows to construct a system for automatically identifying data subsets that have a large impact on the trained models. These effects can be quantified and presented to the user for further inspection, which could improve data quality overall. We then present results from running this method on the nearexascale Summit supercomputer.
BibTex
[40]Nicolae, B. 2020. DataStates: Towards Lightweight Data Models for Deep Learning. SMC’20: The 2020 Smoky Mountains Computational Sciences and Engineering Conference (Nashville, United States, 2020), 117–129.Details
Keywords: deep learning, state preservation, clone, model reuse Abstract: A key emerging pattern in deep learning applications is the need to capture intermediate DNN model snapshots and preserve or clone them to explore a large number of alternative training and/or inference paths. However, with increasing model complexity and new training approaches that mix data, model, pipeline and layer-wise parallelism, this pattern is challenging to address in a scalable and efficient manner. To this end, this position paper advocates for rethinking how to represent and manipulate DNN learning models. It relies on a broader notion of data states, a collection of annotated, potentially distributed data sets (tensors in the case of DNN models) that AI applications can capture at key moments during the runtime and revisit/reuse later. Instead explicitly interacting with the storage layer (e.g., write to a file), users can "tag" DNN models at key moments during runtime with metadata that expresses attributes and persistency/movement semantics. A high-performance runtime is the responsible to interpret the metadata and perform the necessary actions in the background, while offering a rich interface to find data states of interest. Using this approach has benefits at several levels: new capabilities, performance portability, high performance and scalability.
BibTex
[41]Maurya, A., Nicolae, B., Guliani, I. and Rafique, M.M. 2020. CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters. DS-RT’20: The 24th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (Prague, Czech Republic, 2020), 167–174.Details
Keywords: high performance computing, job scheduling, checkpointing strategies Abstract: The increasing scale and complexity of scientific applications are rapidly transforming the ecosystem of tools, methods, and workflows adopted by the high-performance computing (HPC) community. Big data analytics and deep learning are gaining traction as essential components in this ecosystem in a variety of scenarios, such as, steering of experimental instruments , acceleration of high-fidelity simulations through surrogate computations, and guided ensemble searches. In this context, the batch job model traditionally adopted by the supercomputing in-frastructures needs to be complemented with support to schedule opportunistic on-demand analytics jobs, leading to the problem of efficient preemption of batch jobs with minimum loss of progress. In this paper, we design and implement a simulator, CoSim, that enables on-the-fly analysis of the trade-offs arising between delaying the start of opportunistic on-demand jobs, which leads to longer analytics latency, and loss of progress due to preemption of batch jobs, which is necessary to make room for on-demand jobs. To this end, we propose an algorithm based on dynamic programming with predictable performance and scalability that enables supercomputing infrastructure schedulers to analyze the aforementioned trade-off and take decisions in near real-time. Compared with other state-of-art approaches using traces of the Theta pre-Exascale machine, our approach is capable of finding the optimal solution, while achieving high performance and scalability.
BibTex
[42]Nicolae, B., Wozniak, J.M., Dorier, M. and Cappello, F. 2020. DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training. CLUSTER’20: The 2020 IEEE International Conference on Cluster Computing (Kobe, Japan, 2020), 226–236.Details
Keywords: deep learning, data-parallel training, layer-wise parallelism, state cloning and replication, large-scale AI Abstract: Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. "fork" the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of data-parallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.
BibTex
[43]Dey, T., Sato, K., Nicolae, B., Guo, J., Domke, J., Yu, W., Cappello, F. and Mohror, K. 2020. Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning. HPS’20: The 2020 IEEE International Workshop on High-Performance Storage (New Orleans, USA, 2020), 1036–1043.Details
Keywords: high performance computing, checkpoint-restat, machine learning optimization Abstract: With the emergence of versatile storage systems, multi-level checkpointing (MLC) has become a common approach to gain efficiency. However, multi-level checkpoint/restart can cause enormous I/O traffic on HPC systems. To use multilevel checkpointing efficiently, it is important to optimize checkpoint/restart configurations. Current approaches, namely modeling and simulation, are either inaccurate or slow in determining the optimal configuration for a large scale system. In this paper, we show that machine learning models can be used in combination with accurate simulation to determine the optimal checkpoint configurations. We also demonstrate that more advanced techniques such as neural networks can further improve the performance in optimizing checkpoint configurations.
BibTex
[44]Nicolae, B., Li, J., Wozniak, J., Bosilca, G., Dorier, M. and Cappello, F. 2020. DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models. CGrid’20: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (Melbourne, Australia, 2020), 172–181.Details
Keywords: deep learning, checkpointing, state preservation, multi-level data persistence, fine-grain asynchronous I/O Abstract: In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.
BibTex
[45]Nicolae, B., Moody, A., Gonsiorowski, E., Mohror, K. and Cappello, F. 2019. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale. IPDPS’19: The 2019 IEEE International Parallel and Distributed Processing Symposium (Rio de Janeiro, Brazil, 2019), 911–920.Details
Keywords: parallel I/O, checkpoint-restart, immutable data, adaptive multilevel asynchronous I/O Abstract: Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asyn-chronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non-trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.
BibTex
[46]Tseng, S.-M., Nicolae, B., Bosilca, G., Jeannot, E., Chandramowlishwaran, A. and Cappello, F. 2019. Towards Portable Online Prediction of Network Utilization using MPI-level Monitoring. EuroPar’19 : 25th International European Conference on Parallel and Distributed Systems (Goettingen, Germany, 2019), 47–60.Details
Keywords: Work stealing, Prediction of resource utilization, Timeseries forecasting, Network monitoring, Online learning Abstract: Stealing network bandwidth helps a variety of HPC runtimes and services to run additional operations in the background without negatively affecting the applications. A key ingredient to make this possible is an accurate prediction of the future network utilization, enabling the runtime to plan the background operations in advance, such as to avoid competing with the application for network bandwidth. In this paper, we propose a portable deep learning predictor that only uses the information available through MPI introspection to construct a recurrent sequence-to-sequence neural network capable of forecasting network utilization. We leverage the fact that most HPC applications exhibit periodic behaviors to enable predictions far into the future (at least the length of a period). Our online approach does not have an initial training phase, it continuously improves itself during application execution without incurring significant computational overhead. Experimental results show better accuracy and lower computational overhead compared with the state-of-the-art on two representative applications.
BibTex
[47]Liang, X., Di, S., Li, S., Tao, D., Nicolae, B., Chen, Z. and Cappello, F. 2019. Significantly Improving Lossy Compression Quality Based on an Optimized Hybrid Prediction Model. SC ’19: 32nd International Conference for High Performance Computing, Networking, Storage and Analytics (Denver, USA, 2019), 1–26.Details
Keywords: Error-Bounded Lossy Compression, Rate Distortion, Data Dumping/Loading, Compression Performance Abstract: With the ever-increasing volumes of data produced by today’s large-scale scientific simulations, error-bounded lossy compression techniques have become critical: not only can they significantly reduce the data size but they also can retain high data fidelity for postanalysis. In this paper, we design a strategy to improve the compression quality significantly based on an optimized, hybrid prediction model. Our contribution is fourfold. (1) We propose a novel, transform-based predictor and optimize its compression quality. (2) We significantly improve the coefficient-encoding efficiency for the data-fitting predictor. (3) We propose an adaptive framework that can select the best-fit predictor accurately for different datasets. (4) We evaluate our solution and several existing state-ofthe-art lossy compressors by running real-world applications on a supercomputer with 8,192 cores. Experiments show that our adaptive compressor can improve the compression ratio by 112∼165% compared with the second-best compressor. The parallel I/O performance is improved by about 100% because of the significantly reduced data size. The total I/O time is reduced by up to 60X with our compressor compared with the original I/O time.
BibTex
[48]Liang, X., Di, S., Tao, D., Li, S., Nicolae, B., Chen, Z. and Cappello, F. 2019. Improving Performance of Data Dumping with Lossy Compression for Scientific Simulation. CLUSTER’19: IEEE International Conference on Cluster Computing (Albuquerque, USA, 2019), 1–11.Details
Keywords: lossy compression, efficient data flush, parallel file systems Abstract: Because of the ever-increasing data being produced by today’s high performance computing (HPC) scientific simulations, I/O performance is becoming a significant bottleneck for their executions. An efficient error-controlled lossy compressor is a promising solution to significantly reduce data writing time for scientific simulations running on supercomputers. In this paper, we explore how to optimize the data dumping performance for scientific simulation by leveraging error-bounded lossy compression techniques. The contributions of the paper are threefold. (1) We propose a novel I/O performance profiling model that can effectively represent the I/O performance with different execution scales and data sizes, and optimize the estimation accuracy of data dumping performance using least square method. (2) We develop an adaptive lossy compression framework that can select the bestfit compressor (between two leading lossy compressors SZ and ZFP) with optimized parameter settings with respect to overall data dumping performance. (3) We evaluate our adaptive lossy compression framework with up to 32k cores on a supercomputer facilitated with fast I/O systems and using realworld scientific simulation datasets. Experiments show that our solution can mostly always lead the data dumping performance to the optimal level with very accurate selection of the bestfit lossy compressor and settings. The data dumping performance can be improved by up to 27% at different scales.
BibTex
[49]Nicolae, B., Riteau, P., Zhen, Z. and Keahey, K. 2019. Transparent Throughput Elasticity for Modern Cloud Storage: An Adaptive Block-Level Caching Proposal. Applying Integration Techniques and Methods in Distributed Systems and Technologies. IGI Global. 156–191.Details
Keywords: cloud computing, storage elasticity, adaptive I/O throughput, block-level caching Abstract: Storage elasticity on the cloud is a crucial feature in the age of data-intensive computing, especially when considering fluctuations of I/O throughput. In this chapter, the authors explore how to transparently boost the I/O bandwidth during peak utilization to deliver high performance without over-provisioning storage resources. The proposal relies on the idea of leveraging short-lived virtual disks of better performance characteristics (and more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored during runtime. They show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. Second, they introduce a corresponding performance and cost prediction methodology. They demonstrate the benefits of our proposal both for micro-benchmarks and for two real-life applications using large-scale experiments. They conclude with a discussion on how these techniques can be generalized for increasingly complex landscape of modern cloud storage.
BibTex
[50]Caino-Lores, S., Carretero, J., Nicolae, B., Yildiz, O. and Peterka, T. 2019. Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY. IEEE Access. 7, (2019), 156929–156955. DOI:https://doi.org/10.1109/ACCESS.2019.2949836.Details
Keywords: big data, HPC, convergence, data model, Spark, DIY Abstract: Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process- and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment.
BibTex
[51]Li, J., Nicolae, B., Wozniak, J. and Bosilca, G. 2019. Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training. MLHPC’19: The 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (in conjuction with SC’19) (Denver, USA, 2019), 1–8.Details
Keywords: deep learning, behavior analysis, Tensorflow, data-parallel learning, tensor parallelism Abstract: In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular. Despite the fact that high-performance computing (HPC) technologies have been designed to address such patterns efficiently, the behavior of data-parallel approaches on HPC platforms is not well understood. To address this issue, in this paper we study the behavior of Horovod, a popular data-parallel approach that relies on MPI, on Theta, a pre-Exascale machine at Argonne National Laboratory. Using two representative applications, we explore two aspects: (1) how performance and scalability is affected by important parameters such as number of nodes, number of workers, threads per node, batch size; (2) how computational phases are interleaved withall-reduce communication phases at fine granularity and what consequences this interleaving has in terms of potential bottlenecks. Our findings show that pipelining of back-propagation, gradient reduction and weight updates mitigate the effects of stragglers during all-reduce only partially. Furthermore, there can be significant delays between weights update, which can be leveraged to mask the overhead of additional background operations that are coupled with the training.
BibTex
[52]Nicolae, B., Cappello, F., Moody, A., Gonsiorowski, E. and Mohror, K. 2018. VeloC: Very Low Overhead Checkpointing System. SC ’18: 31th International Conference for High Performance Computing, Networking, Storage and Analysis (Dallas, USA, 2018).Details
Keywords: HPC, resilience, checkpoint-restart Abstract: Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications. However, such a pattern frequently leads to I/O bottlenecks that lead to poor scalability and performance. As modern HPC infrastructures continue to evolve, there is a growing gap between compute capacity vs. I/O capabilities. Furthermore, the storage hierarchy is becoming increasingly heterogeneous: in addition to parallel file systems, it comprises burst buffers, key-value stores, deep memory hierarchies at node level, etc. In this context, state of art is insufficient to deal with the diversity of vendor APIs, performance and persistency characteristics. This poster proposes VeloC, a low-overhead checkpointing system specifically designed to address the checkpointing needs of future exascale HPC systems. VeloC offers a simple API at user level, while employing an advanced multi-level resilience strategy that transparently optimizes the performance and scalability of checkpointing by leveraging heterogeneous storage.
BibTex
[53]Clemente-Castello, F.J., Nicolae, B., Mayo, R. and Fernandez, J.C. 2018. Performance Model of MapReduce Iterative Applications for Hybrid Cloud Bursting. IEEE Transactions on Parallel and Distributed Systems. 29, 8 (2018), 1794–1807. DOI:https://doi.org/10.1109/TPDS.2018.2802932.Details
Keywords: cloud computing, hybrid cloud, bursting, MapReduce Abstract: Hybrid cloud bursting (i.e., leasing temporary off-premise cloud resources to boost the overall capacity during peak utilization) can be a cost-effective way to deal with the increasing complexity of big data analytics, especially for iterative applications. However, the low throughput, high latency network link between the on-premise and off-premise resources (“weak link”) makes maintaining scalability difficult. While several data locality techniques have been designed for big data bursting on hybrid clouds, their effectiveness is difficult to estimate in advance. Yet such estimations are critical, because they help users decide whether the extra pay-as-you-go cost incurred by using the off-premise resources justifies the runtime speed-up. To this end, the current paper presents a performance model and methodology to estimate the runtime of iterative MapReduce applications in a hybrid cloud-bursting scenario. The paper focuses on the overhead incurred by the weak link at fine granularity, for both the map and the reduce phases. This approach enables high estimation accuracy, as demonstrated by extensive experiments at scale using a mix of real-world iterative MapReduce applications from standard big data benchmarking suites that cover a broad spectrum of data patterns. Not only are the produced estimations accurate in absolute terms compared with experimental results, but they are also up to an order of magnitude more accurate than applying state-of-art estimation approaches originally designed for single-site MapReduce deployments.
BibTex
[54]Caino-Lores, S., Carretero, J., Nicolae, B., Yildiz, O. and Peterka, T. 2018. Spark-DIY: A Framework for Interoperable Spark Operations with High Performance Block-Based Data Models. BDCAT’18: 5th IEEE/ACM International Conference on Big Data Computing Applications and Technologies (Zurich, Switzerland, 2018), 1–10.Details
Keywords: big data, Spark, high performance computing, convergence Abstract: Today’s scientific applications are increasingly relying on a variety of data sources, storage facilities, and computing infrastructures, and there is a growing demand for data analysis and visualization for these applications. In this context, exploiting Big Data frameworks for scientific computing is an opportunity to incorporate high-level libraries, platforms, and algorithms for machine learning, graph processing, and streaming; inherit their data awareness and fault-tolerance; and increase productivity. Nevertheless, limitations exist when Big Data platforms are integrated with an HPC environment, namely poor scalability, severe memory overhead, and huge development effort. This paper focuses on a popular Big Data framework -Apache Spark- and proposes an architecture to support the integration of highly scalable MPI block-based data models and communication patterns with a map-reduce-based programming model. The resulting platform preserves the data abstraction and programming interface of Spark, without conducting any changes in the framework, but allows the user to delegate operations to the MPI layer. The evaluation of our prototype shows that our approach integrates Spark and MPI efficiently at scale, so end users can take advantage of the productivity facilitated by the rich ecosystem of high-level Big Data tools and libraries based on Spark, without compromising efficiency and scalability.
BibTex
[55]Marcu, O.-C., Costan, A., Antoniu, G., Perez-Hernandez, M.S., Nicolae, B., Tudoran, R. and Bortoli, S. 2018. KerA: Scalable Data Ingestion for Stream Processing. ICDCS’18: 38th IEEE International Conference on Distributed Computing Systems (Vienna, Austria, 2018), 1480–1485.Details
Keywords: big data, stream computing, data ingestion Abstract: Big Data applications are increasingly moving from batch-oriented execution models to stream-based models that enable them to extract valuable insights close to real-time. To support this model, an essential part of the streaming processing pipeline is data ingestion, i.e., the collection of data from various sources (sensors, NoSQL stores, filesystems, etc.) and their delivery for processing. Data ingestion needs to support high throughput, low latency and must scale to a large number of both data producers and consumers. Since the overall performance of the whole stream processing pipeline is limited by that of the ingestion phase, it is critical to satisfy these performance goals. However, state-of-art data ingestion systems such as Apache Kafka build on static stream partitioning and offset-based record access, trading performance for design simplicity. In this paper we propose KerA, a data ingestion framework that alleviate the limitations of state-of-art thanks to a dynamic partitioning scheme and to lightweight indexing, thereby improving throughput, latency and scalability. Experimental evaluations show that KerA outperforms Kafka up to 4x for ingestion throughput and up to 5x for the overall stream processing throughput. Furthermore, they show that KerA is capable of delivering data fast enough to saturate the big data engine acting as the consumer.
BibTex
[56]Nicolae, B., Costa, C., Misale, C., Katrinis, K. and Park, Y. 2017. Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics. IEEE Transactions on Parallel and Distributed Systems. 28, 6 (2017), 1663–1674. DOI:https://doi.org/10.1109/TPDS.2016.2627558.Details
Keywords: elastic buffering, big data analytics, data shuffling, memory-efficient I/O, Spark Abstract: Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40% better performance with 50% less memory utilization for buffering and excellent weak scalability.
BibTex
[57]Clemente-Castello, F.J., Nicolae, B., Mayo, M.M.R.R. and Fernandez, J.C. 2017. Evaluation of Data Locality Strategies for Hybrid Cloud Bursting of Iterative MapReduce. CCGrid’17 : 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Madrid, Spain, 2017), 181–185.Details
Keywords: hybrid cloud, big data analytics, data locality, data management, scheduling, MapReduce, iterative Abstract: Hybrid cloud bursting (i.e., leasing temporary off-premise cloud resources to boost the overall capacity during peak utilization) is a popular and cost-effective way to deal with the increasing complexity of big data analytics. It is particularly promising for iterative MapReduce applications that reuse massive amounts of input data at each iteration, which compensates for the high overhead and cost of concurrent data transfers from the on-premise to the off-premise VMs over a weak inter-site link that is of limited capacity. In this paper we study how to combine various MapReduce data locality techniques designed for hybrid cloud bursting in order to achieve scalability for iterative MapReduce applications in a cost-effective fashion. This is a non-trivial problem due to the complex interaction between the data movements over the weak link and the scheduling of computational tasks that have to adapt to the shifting data distribution. We show that using the right combination of techniques, iterative MapReduce applications can scale well in a hybrid cloud bursting scenario and come even close to the scalability observed in single sites.
BibTex
[58]Marcu, O.-C., Costan, A., Antoniu, G., Perez-Hernandez, M.S., Tudoran, R., Bortoli, S. and Nicolae, B. 2017. Towards a unified storage and ingestion architecture for stream processing. BigData’17: 2017 IEEE International Conference on Big Data (Boston, USA, 2017), 2402–2407.Details
Keywords: Big Data, Streaming, Storage, Ingestion, Unified Architecture Abstract: Big Data applications are rapidly moving from a batch-oriented execution model to a streaming execution model in order to extract value from the data in real-time. However, processing live data alone is often not enough: in many cases, such applications need to combine the live data with previously archived data to increase the quality of the extracted insights. Current streaming-oriented runtimes and middlewares are not flexible enough to deal with this trend, as they address ingestion (collection and pre-processing of data streams) and persistent storage (archival of intermediate results) using separate services. This separation often leads to I/O redundancy (e.g., write data twice to disk or transfer data twice over the network) and interference (e.g., I/O bottlenecks when collecting data streams and writing archival data simultaneously). In this position paper, we argue for a unified ingestion and storage architecture for streaming data that addresses the aforementioned challenge. We identify a set of constraints and benefits for such a unified model, while highlighting the important architectural aspects required to implement it in real life. Based on these aspects, we briefly sketch our plan for future work that develops the position defended in this paper.
BibTex
[59]Marcu, O.-C., Tudoran, R., Nicolae, B., Costan, A., Antoniu, G. and Perez-Hernandez, M.S. 2017. Exploring Shared State in Key-Value Store for Window-Based Multi-Pattern Streaming Analytics. EBDMA’17: 1st Workshop on the Integration of Extreme Scale Computing and Big Data Management and Analytics (Madrid, Spain, 2017), 1044–1052.Details
Keywords: Big Data, sliding-window aggregations, memory deduplication, Apache Flink, streaming analytics Abstract: We are now witnessing an unprecedented growth of data that needs to be processed at always increasing rates in order to extract valuable insights. Big Data streaming analytics tools have been developed to cope with the online dimension of data processing: they enable real-time handling of live data sources by means of stateful aggregations (operators). Current state-of-art frameworks (e.g. Apache Flink [1]) enable each operator to work in isolation by creating data copies, at the expense of increased memory utilization. In this paper, we explore the feasibility of deduplication techniques to address the challenge of reducing memory footprint for window-based stream processing without significant impact on performance. We design a deduplication method specifically for window-based operators that rely on key-value stores to hold a shared state. We experiment with a synthetically generated workload while considering several deduplication scenarios and based on the results, we identify several potential areas of improvement. Our key finding is that more fine-grained interactions between streaming engines and (key-value) stores need to be designed in order to better respond to scenarios that have to overcome memory scarcity.
BibTex
[60]Nicolae, B., Costa, C., Misale, C., Katrinis, K. and Park, Y. 2016. Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics. CCGrid’16: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Cartagena, Colombia, 2016), 409–412.Details
Keywords: elastic buffering, Big data analytics, data shuffling, memory-efficient I/O, Spark Abstract: Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.
BibTex
[61]Clemente-Castello, F.J., Nicolae, B., Mayo, R., Fernandez, J.C. and Rafique, M. 2016. On Exploiting Data Locality for Iterative MapReduce Applications in Hybrid Clouds . BDCAT’16: 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (Shanghai, China, 2016), 118–122.Details
Keywords: hybrid cloud, bursting, big data analytics, iterative, MapReduce, data locality, data management, scheduling Abstract: Hybrid cloud bursting (i.e., leasing temporary off-premise cloud resources to boost the capacity during peak utilization), has made significant impact especially for big data analytics, where the explosion of data sizes and increasingly complex computations frequently leads to insufficient local data center capacity. Cloud bursting however introduces a major challenge to runtime systems due to the limited throughput and high latency of data transfers between on-premise and off-premise resources (weak link). This issue and how to address it is not well understood. We contribute with a comprehensive study on what challenges arise in this context, what potential strategies can be applied to address them and what best practices can be leveraged in real-life. Specifically, we focus our study on iterative MapReduce applications, which are a class of large-scale data intensive applications particularly popular on hybrid clouds. In this context, we study how data locality can be leveraged over the weak link both from the storage layer perspective (when and how to move it off-premise) and from the scheduling perspective (when to compute off-premise). We conclude with a brief discussion on how to set up an experimental framework suitable to study the effectiveness of our proposal in future work.
BibTex
[62]Nicolae, B., Kochut, A. and Karve, A. 2016. Towards Scalable On-Demand Collective Data Access in IaaS Clouds: An Adaptive Collaborative Content Exchange Proposal. Journal of Parallel and Distributed Computing. 87, (2016), 67–79. DOI:https://doi.org/10.1016/j.jpdc.2015.09.006.Details
Keywords: IaaS, scalable content dissemination, collective I/O, on-demand data access, high thoughput, collaborative I/O, adaptive prefetching Abstract: A critical feature of IaaS cloud computing is the ability to quickly disseminate the content of a shared dataset at large scale. In this context, a common pattern is collective read, i.e., accessing the same VM image or dataset from a large number of VM instances concurrently. Several approaches deal with this pattern either by means of pre-broadcast before access or on-demand concurrent access to the repository where the image or dataset is stored. We propose a different solution using a hybrid strategy that augments on-demand access with a collaborative scheme in which the VMs leverage similarities between their access pattern in order to anticipate future read accesses and exchange chunks between themselves in order to reduce contention to the remote repository. Large scale experiments show significant improvement over conventional approaches from multiple perspectives: completion time, sustained read throughput, fairness of I/O read operations and bandwidth utilization.
BibTex
[63]Tudoran, R., Nicolae, B. and Brasche, G. 2016. Data Multiverse : The Uncertainty Challenge of Future Big Data Analytics. IKC’16: 2nd International KEYSTONE Conference (Cluj-Napoca, Romania, 2016), 17–22.Details
Keywords: big data analytics, large scale data processing, data access model, data uncertainty, approximate computing Abstract: With the explosion of data sizes, extracting valuable insight out of big data becomes increasingly difficult. New challenges begin to emerge that complement traditional, long-standing challenges related to building scalable infrastructure and runtime systems that can deliver the desired level of performance and resource efficiency. This vision paper focuses on one such challenge, which we refer to as the analytics uncertainty: with so much data available from so many sources, it is difficult to anticipate what the data can be useful for, if at all. As a consequence, it is difficult to anticipate what data processing algorithms and methods are the most appropriate to extract value and insight. In this context, we contribute with a study on current big data analytics state-of-art, the use cases where the analytics uncertainty is emerging as a problem and future research directions to address them.
BibTex
[64]Nicolae, B. 2015. Techniques to improve the scalability of collective checkpointing at large scale. HPCS’15: The 2015 International Conference on High Performance Computing and Simulation (Amsterdam, The Netherlands, 2015), 660–661.Details
Keywords: checkpointing, checkpoint restart, redundancy, scalability, data resilience, high performance computing, adaptive I/O, collective I/O, deduplication Abstract: Scientific and data-intensive computing have matured over the last couple of years in all fields of science and industry. Their rapid increase in complexity and scale has prompted ongoing efforts dedicated to reach exascale infrastructure capability by the end of the decade. However, advances in this context are not homogeneous: I/O capabilities in terms of networking and storage are lagging behind computational power and are often considered a major limitation that that persists even at petascale [1]. A particularly difficult challenge in this context are collective I/O access patterns (which we henceforth refer to as collective checkpointing) where all processes simultaneously dump large amounts of related data simultaneously to persistent storage. This pattern is often exhibited by large-scale, bulk-synchronous applications in a variety of circumstances, e.g., when they use checkpoint-restart fault tolerance techniques to save intermediate computational states at regular time intervals [2] or when intermediate, globally synchronized results are needed during the lifetime of the computation (e.g. to understand how a simulation progresses during key phases). Under such circumstances, a decoupled storage system (e.g. a parallel file system such as GPFS [3] or a specialized storage system such as BlobSeer [4]) does not provide sufficient I/O bandwidth to handle the explosion of data sizes: for example, Jones et al. [5] predict dump times in the order of several hours. In order to overcome the I/O bandwidth limitation, one potential solution is to equip the compute nodes with local storage (i.e., HDDs, SSDs, NVMs, etc.) or use I/O forwarding nodes. Using this approach, a large part of the data can be dumped locally, which completely avoids the need to consume and compete for the I/O bandwidth of a decoupled storage system. However, this is not without drawbacks: the local storage devices or I/O forwarding nodes are prone to failures and as such the data they hold is volatile. Thus, a popular approach in practice is to wait until the local dump has finished, then let the application continue while the checkpoints are in turn dumped to a parallel file system in background. Such a straightforward solution can be effective at hiding the overhead incurred to due I/O bandwidth limitations, but this not necessarily the case: it may happen that there is not enough time to fully flush everything to the parallel file system before the next collective checkpoint request is issued. In fact, this a likely scenario with growing scale, as the failure rate increases, which introduces the need to checkpoint at smaller intervals in order to compensate for this effect. Furthermore, a smaller checkpoint interval also means local dumps are frequent and as such their overhead becomes significant itself.
BibTex
[65]Roman, R.-I., Nicolae, B., Costan, A. and Antoniu, G. 2015. Understanding Spark Performance in Hybrid and Multi-Site Clouds. BDAC-15: 6th International Workshop on Big Data Analytics: Challenges and Opportunities (Austin, USA, 2015).Details
Keywords: big data, Spark, hybrid cloud, network bottleneck Abstract: Recently, hybrid multi-site big data analytics (that combines on-premise with off-premise resources) has gained increasing popularity as a tool to process large amounts of data on-demand, without additional capital investment to increase the size of a single datacenter. However, making the most out of hybrid setups for big data analytics is challenging because on-premise resources can communicate with off-premise resources at significantly lower throughput and higher latency. Understanding the impact of this aspect is not trivial, especially in the context of modern big data an-alytics frameworks that introduce complex communication patterns and are optimized to overlap communication with computation in order to hide data transfer latencies. This paper contributes with a work-in-progress study that aims to identify and explain this impact in relationship to the known behavior on a single cloud. To this end, it analyses a representative big data workload on a hybrid Spark setup. Unlike previous experience that emphasized low end-impact of network communications in Spark, we found significant overhead in the shuffle phase when the bandwidth between the on-premise and off-premise resources is sufficiently small.
BibTex
[66]Nicolae, B., Riteau, P. and Keahey, K. 2015. Towards Transparent Throughput Elasticity for IaaS Cloud Storage: Exploring the Benefits of Adaptive Block-Level Caching. International Journal of Distributed Systems and Technologies. 6, 4 (2015), 21–44. DOI:https://doi.org/10.4018/IJDST.2015100102.Details
Keywords: IaaS, cloud computing, storage elasticity, adaptive I/O, virtual disk, block-level caching, performance prediction, cost prediction, modeling Abstract: Storage elasticity on IaaS clouds is a crucial feature in the age of data-intensive computing, especially when considering fluctuations of I/O throughput. This paper provides a transparent solution that automatically boosts I/O bandwidth during peaks for underlying virtual disks, effectively avoiding over-provisioning without performance loss. Our proposal relies on the idea of leveraging short-lived virtual disks of better performance characteristics (and thus more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored. We show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. Furthermore, we introduce a performance and cost prediction methodology that can be used both independently to estimate in advance what trade-off between performance and cost is possible, as well as an optimization technique that enables better cache size selection to meet the desired performance level with minimal cost. We demonstrate the benefits of our proposal both for microbenchmarks and for two real-life applications using large-scale experiments.
BibTex
[67]Clemente-Castello, F.J., Nicolae, B., Katrinis, K., Rafique, M.M., Mayo, R., Fernandez, J.C. and Loreti, D. 2015. Enabling Big Data Analytics in the Hybrid Cloud Using Iterative MapReduce. UCC’15: 8th IEEE/ACM International Conference on Utility and Cloud Computing (Limassol, Cyprus, 2015), 290–299.Details
Keywords: hybrid cloud, big data analytics, iterative, MapReduce, data locality, performance prediction Abstract: The cloud computing model has seen tremendous commercial success through its materialization via two prominent models to date, namely public and private cloud. Recently, a third model combining the former two service models as on-/off-premise resources has been receiving significant market traction: hybrid cloud. While state of art techniques that address workload performance prediction and efficient workload execution over hybrid cloud setups exist, how to address data-intensive workloads - including Big Data Analytics - in similar environments is nascent. This paper addresses this gap by taking on the challenge of bursting over hybrid clouds for the benefit of accelerating iterative MapReduce applications. We first specify the challenges associated with data locality and data movement in such setups. Subsequently, we propose a novel technique to address the locality issue, without requiring changes to the MapReduce framework or the underlying storage layer. In addition, we contribute with a performance prediction methodology that combines modeling with micro-benchmarks to estimate completion time for iterative MapReduce applications, which enables users to estimate cost-to-solution before committing extra resources from public clouds. We show through experimentation in a dual-Openstack hybrid cloud setup that our solutions manage to bring substantial improvement at predictable cost-control for two real-life iterative MapReduce applications: large-scale machine learning and text analysis.
BibTex
[68]Kochut, A., Karve, A. and Nicolae, B. 2015. Towards Efficient On-demand VM Provisioning: Study of VM Runtime I/O Access Patterns to Shared Image Content. IM’15: 13th IFIP/IEEE International Symposium on Integrated Network Management (Ottawa, Canada, 2015), 321–329.Details
Keywords: cloud computing, Iaas, content similarity, deduplication, correlations, I/O access pattern, virtual disk Abstract: IaaS clouds are becoming a standard way of providing elastic compute capacity at an affordable cost. To achieve that, VM provisioning system has to optimally allocate I/O and compute resources. One of the significant optimization opportunities is to leverage content similarity across VM images. While many studies have been devoted to de-duplication of VM images, this paper is to the best of our knowledge, the first one to comprehensively study the relationship between the VM image similarity structure and the runtime I/O access patterns. Our study focuses on block-level similarity and I/O access patterns, revealing correlations between common content of different images and application-level access semantics. Furthermore, it also zooms on several runtime I/O access pattern aspects, such as the similarity of the sequences in which common content is accessed. The results show a strong tendency for access pattern locality within common content clusters across VM images, regardless of the rest of the composition. Furthermore, it reveals a strong tendency for read accesses to refer to the same subsets of blocks within the common content clusters, while preserving the same ordering. These results provide important insights that can be used to optimize on-demand VM image content delivery under concurrency.
BibTex
[69]Nicolae, B. 2015. Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead. IPDPS ’15: 29th IEEE International Parallel and Distributed Processing Symposium (Hyderabad, India, 2015), 1023–1032.Details
Keywords: scalable I/O, checkpoint restart, checkpointing, data replication, deduplication, collective I/O, redundancy, data resilience, high availability Abstract: Dumping large amounts of related data simulta-neously to local storage devices instead of a parallel file system is a frequent I/O pattern of HPC applications running at large scale. Since local storage resources are prone to failures and have limited potential to serve multiple requests in parallel, techniques such as replication are often used to enable re-silience and high availability. However, replication introduces overhead, both in terms of network traffic necessary to distribute replicas, as well as extra storage space requirements. To reduce this overhead, state-of-art techniques often apply redundancy elimination (e.g. compression or deduplication) before replication, ignoring the natural redundancy that is already present. By contrast, this paper proposes a novel scheme that treats redundancy elimination and replication as a single co-optimized phase: remotely duplicated data is detected and directly leveraged to maintain a desired replication factor by keeping only as many replicas as needed and adding more if necessary. In this context, we introduce a series of high performance algorithms specifically designed to operate under tight and controllable constrains at large scale. We present how this idea can be leveraged in practice and demonstrate its viability for two real-life HPC applications.
BibTex
[70]Nicolae, B., Karve, A. and Kochut, A. 2015. Discovering and Leveraging Content Similarity to Optimize Collective On-Demand Data Access to IaaS Cloud Storage. CCGrid’15: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Shenzhen, China, 2015), 211–220.Details
Keywords: collective I/O, content similarity, deduplication, cloud storage, on-demand data access Abstract: A critical feature of IaaS cloud computing is the ability to quickly disseminate the content of a shared dataset at large scale. In this context, a common pattern is collective on-demand read, i.e., accessing the same VM image or dataset from a large number of V Minstances concurrently. There are various techniques that avoid I/Ocontention to the storage service where the dataset is located without relying on pre-broadcast. Most such techniques employ peer-to-peer collaborative behavior where the VM instances exchange information about the content that was accessed during runtime, such that it impossible to fetch the missing data pieces directly from each other rather than the storage system. However, such techniques are often limited within a group that performs a collective read. In light of high data redundancy on large IaaS data centers and multiple users that simultaneously run VM instance groups that perform collective reads, an important opportunity arises: enabling unrelated VMinstances belonging to different groups to collaborate and exchange common data in order to further reduce the I/O pressure on the storage system. This paper deals with the challenges posed by such absolution, which prompt the need for novel techniques to efficiently detect and leverage common data pieces across groups. To this end, we introduce a low-overhead fingerprint based approach that we evaluate and demonstrate to be efficient in practice for a representative scenario on dozens of nodes and a variety of group configurations.
BibTex
[71]Nicolae, B., Riteau, P. and Keahey, K. 2014. Bursting the Cloud Data Bubble: Towards Transparent Storage Elasticity in IaaS Clouds. IPDPS ’14: Proc. 28th IEEE International Parallel and Distributed Processing Symposium (Phoenix, USA, 2014), 135–144.Details
Keywords: adaptive I/O, cloud computing, elastic storage, utilization prediction Abstract: Storage elasticity on IaaS clouds is an important feature for data-intensive workloads: storage requirements can vary greatly during application runtime, making worst-case over-provisioning a poor choice that leads to unnecessarily tied-up storage and extra costs for the user. While the ability to adapt dynamically to storage requirements is thus attractive, how to implement it is not well understood. Current approaches simply rely on users to attach and detach virtual disks to the virtual machine (VM) instances and then manage them manually, thus greatly increasing application complexity while reducing cost efficiency. Unlike such approaches, this paper aims to provide a transparent solution that presents a unified storage space to the VM in the form of a regular POSIX file system that hides the details of attaching and detaching virtual disks by handling those actions transparently based on dynamic application requirements. The main difficulty in this context is to understand the intent of the application and regulate the available storage in order to avoid running out of space while minimizing the performance overhead of doing so. To this end, we propose a storage space prediction scheme that analyzes multiple system parameters and dynamically adapts monitoring based on the intensity of the I/O in order to get as close as possible to the real usage. We show the value of our proposal over static worst-case over-provisioning and simpler elastic schemes that rely on a reactive model to attach and detach virtual disks, using both synthetic benchmarks and real-life data-intensive applications. Our experiments demonstrate that we can reduce storage waste/cost by 30-40% with only 2-5% performance overhead.
BibTex
[72]Nicolae, B., Riteau, P. and Keahey, K. 2014. Transparent Throughput Elasticity for IaaS Cloud Storage Using Guest-Side Block-Level Caching. UCC’14: 7th IEEE/ACM International Conference on Utility and Cloud Computing (London, UK, 2014), 186–195.Details
Keywords: adaptive I/O, block-level caching, cloud computing, elastic storage, virtual disk, utilization prediction Abstract: Storage elasticity on IaaS clouds is a crucial feature in the age of data-intensive computing. However, the traditional provisioning model of leveraging virtual disks of fixed capacity and performance characteristics has limited ability to match the increasingly dynamic nature of I/O application requirements. This mismatch is particularly problematic in the context of scientific applications that interleave periods of I/O inactivity with I/O intensive bursts. In this context, overprovisioning for best performance during peaks leads to significant extra costs because of unnecessarily tied-up resources, while any other trade-off leads to performance loss. This paper provides a transparent solution that automatically boosts I/O bandwidth during peaks for underlying virtual disks, effectively avoiding overprovisioning without performance loss. Our proposal relies on the idea of leveraging short-lived virtual disks of better performance characteristics (and thus more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored. We show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. We demonstrate the benefits of our proposal both for microbenchmarks and for two real-life applications using large-scale experiments.
BibTex
[73]Petcu, D., Gonzalez-Velez, H., Nicolae, B., Garcia-Gomez, J.M., Fuster-Garcia, E. and Sheridan, C. 2014. Next Generation HPC Clouds: A View for Large-Scale Scientific and Data-Intensive Applications. DIHC’14: The 2nd Workshop on Dependability and Interoperability in Heterogeneous Clouds (Porto, Portugal, 2014), 26–37.Details
Keywords: cloud storage, data analytics, heterogeneous clouds, high performance computing Abstract: In spite of the rapid growth of Infrastructure-as-a-Service offers, support to run data-intensive and scientific applications large-scale is still limited. On the user side, existing features and programming models are insufficiently developed to express an application in such way that it can benefit from an elastic infrastructure that dynamically adapts to the requirements, which often leads to unnecessary over-provisioning and extra costs. On the provider side, key performance and scalability issues arise when having to deal with large groups of tightly coupled virtualized resources needed by such applications, which is especially challenging considering the multi-tenant dimension where sharing of physical resources introduces interference both inside and across large virtual machine deployments. This paper contributes with a holistic vision that imagines a tight integration between programming models, runtime middlewares and the virtualization infrastructure in order to provide a framework that transparently handles allocation and utilization of heterogeneous resources while dealing with performance and elasticity issues.
BibTex
[74]Ene, S., Nicolae, B., Costan, A. and Antoniu, G. 2014. To Overlap or Not to Overlap: Optimizing Incremental MapReduce Computations for On-Demand Data Upload. DataCloud ’14: The 5th International Workshop on Data-Intensive Computing in the Clouds (New Orleans, USA, 2014), 9–16.Details
Keywords: big data, data management, incremental processing, MapReduce Abstract: Research on cloud-based Big Data analytics has focused so far on optimizing the performance and cost-effectiveness of the computations, while largely neglecting an important as-pect: users need to upload massive datasets on clouds for their computations. This paper studies the problem of run-ning MapReduce applications when considering the simulta-neous optimization of performance and cost of both the data upload and its corresponding computation taken together. We analyze the feasibility of incremental MapReduce ap-proaches to advance the computation as much as possible during the data upload by using already transferred data to calculate intermediate results. Our key finding shows that overlapping the transfer time with as many incremental com-putations as possible is not always efficient: a better solution is to wait for enough to fill the computational capacity of the MapReduce cluster. Results show significant performance and cost reduction compared with state-of-the-art solutions that leverage incremental computations in a naive fashion.
BibTex
[75]Nicolae, B., Lemarinier, P. and Meneghin, M. 2014. Leveraging Naturally Distributed Data Redundancy to Optimize Collective Replication. SC ’14: 27th International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, USA, 2014).Details
Keywords: high performance computing, data resilience, high availability, replication, deduplication, collective I/O, redundancy, fault tolerance Abstract: Dumping large amounts of related data simultaneously to local storage devices instead of a parallel file system is a frequent I/O pattern of HPC applications running at large scale. Since local storage resources are prone to failures and have limited potential to serve multiple requests in parallel, techniques such as replication are often used to enable resilience and high availability. However, replication introduces overhead, both in terms of network traffic necessary to distribute replicas, as well as extra storage space requirements. To reduce this overhead, state-of-art techniques often apply redundancy elimination (e.g. compression or de-duplication) before replication, ignoring the natural redundancy that is already present. By contrast, this paper proposes a novel scheme that treats redundancy elimination and replication as a single co-optimized phase: remotely duplicated data is detected and directly leveraged to maintain a desired replication factor by keeping only as many replicas as neededand adding more if necessary. In this context, we introduce a series of high performance algorithms specifically designed to operate under tight and controllable constrains at large scale. We present how this idea can be leveraged in practice and demonstrate its viability for two real-life HPC applications.
BibTex
[76]Nicolae, B. 2013. Understanding Vertical Scalability of I/O Virtualization for MapReduce Workloads: Challenges and Opportunities. BigDataCloud ’13: 2nd Workshop on Big Data Management in Clouds (held in conjunction with EuroPar’13) (Aachen, Germany, 2013).Details
Keywords: I/O virtualization, big data, vertical I/O scalability, big data, IaaS, cloud computing Abstract: As the explosion of data sizes continues to push the limits of our abilities to efficiently store and process big data, next generation big data systems face multiple challenges. One such important challenge relates to the limited scalability of I/O, a determining factor in the overall performance of big data applications. Although paradigms like MapReduce have long been used to take advantage of local disks and avoid data movements over the network as much as possible, with increasing core count per node, local storage comes under increasing I/O pressure itself and prompts the need to equip nodes with multiple disks. However, given the rising need to virtualize large datacenters in order to provide a more flexible allocation and consolidation of physical resources (transforming them into public or private/hybrid clouds), the following questions arise: is it possible to take advantage of multiple local disks at virtual machine (VM) level in order to speed up big data analytics? If so, what are the best practices to achieve a high virtualized aggregated I/O throughput? This paper aims to answer these questions in the context of I/O intensive MapReduce workloads: it analyzes and characterizes their behavior under different virtualization scenarios in order to propose best practices for current approaches and speculate on future areas of improvement.
BibTex
[77]Antoniu, G. et al. 2013. Scalable Data Management for Map-Reduce-based Data-Intensive Applications: A View for Cloud and Hybrid Infrastructures. International Journal of Cloud Computing. 2, (2013), 150–170. DOI:https://doi.org/10.1504/IJCC.2013.055265.Details
Keywords: MapReduce, cloud computing, desktop grids, hybrid infrastructures, bioinformatics, task scheduling, fault tolerance, scalable data management, data-intensive, scalable storage, massive data, concurrency control, volatility. Abstract: As map-reduce emerges as a leading programming paradigm for data-intensive computing, today’s frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper, we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing map-reduce frameworks, in order to achieve scalable, concurrency-optimised, fault-tolerant map-reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.
BibTex
[78]Nicolae, B. 2013. Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal. IPDPS ’13: The 27th IEEE International Parallel and Distributed Processing Symposium (Boston, USA, 2013), 19–28.Details
Keywords: I/O load balancing, checkpoint restart, deduplication, fault tolerance, high performance computing, checkpointing Abstract: With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence. For a large class of applications that run for a long time and are tightly coupled, Checkpoint-Restart (CR) is the only feasible method to survive failures. However, exploding checkpoint sizes that need to be dumped to storage pose a major scalability challenge, prompting the need to reduce the amount of checkpointing data. This paper contributes with a novel collective memory contents deduplication scheme that attempts to identify and eliminate duplicate memory pages before they are saved to storage. Unlike previous approaches that concentrate on the checkpoints of the same process, our approach identifies duplicate memory pages shared by different processes (regardless whether on the same or different node). We show both how to achieve such a global deduplication in a scalable fashion and how to leverage it effectively to optimize the data layout in such way that it minimizes I/O bottlenecks. Large scale experiments show significant reduction of storage space consumption and performance overhead compared to several state-of-art approaches, both in synthetic benchmarks and for a real life high performance computing application.
BibTex
[79]Nicolae, B. and Cappello, F. 2013. BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds. Journal of Parallel and Distributed Computing. 73, 5 (2013), 698–711. DOI:https://doi.org/10.1016/j.jpdc.2013.01.013.Details
Keywords: checkpoint restart, high performance computing, IaaS, cloud computing, snapshotting, fault tolerance, file system rollback, virtual disk Abstract: Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back file system changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.
BibTex
[80]Nicolae, B. and Cappello, F. 2013. AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing. HPDC ’13: 22th International ACM Symposium on High-Performance Parallel and Distributed Computing (New York, USA, 2013), 155–166.Details
Keywords: scientific computing, high performance computing, cloud computing, fault tolerance, checkpoint restart, checkpointing, adaptive I/O Abstract: With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in past iterations. Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed to stable storage. Large scale experiments show up to 60% improvement when compared to state-of-art checkpointing approaches, all this achievable with an extra memory requirement of less than 5% of the total application memory.
BibTex
[81]Nicolae, B. and Rafique, M. 2013. Leveraging Collaborative Content Exchange for On-Demand VM Multi-Deployments in IaaS Clouds. Euro-Par ’13: 19th International Euro-Par Conference on Parallel Processing (Aachen, Germany, 2013).Details
Keywords: IaaS, cloud computing, multi-deployment, VM provisioning, collaborative content exchange Abstract: A critical feature of IaaS cloud computing is the ability to deploy, boot and terminate large groups of interdependent VMs very quickly, which enables users to efficiently exploit the on-demand nature and elasticity of clouds even for large-scale deployments. A common pattern in this context is multi-deployment, i.e., using the same VM image template to instantiate a large number of VMs in parallel. A difficult trade-off arises in this context: access the content of the template on demand but slowly due to I/O bottlenecks or pre -broadcast the full contents of the template on the local storage of the hosting nodes to avoid such bottlenecks. Unlike previous approaches that are biased towards either of the extremes, we propose a scheme that augments on-demand access through a collaborative scheme in which the VMs aim to leverage the similarity of access pattern in order to anticipate future accesses and exchange chunks between themselves in an attempt to reduce contention to the remote storage where the VM image template is stored. Large scale experiments show improvements in read throughput between 30%-40% compared to on demand access schemes that perform in isolation.
BibTex
[82]Antoniu, G. et al. 2012. Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures. ICACON ’12 : 1st International IBM Cloud Academy Conference (Research Triangle Park, USA, 2012).Details
Keywords: MapReduce, cloud computing, data-intensive computing, hybrid infrastructures, BlobSeer, BitDew, Nimbus, HLCM, Grid’5000 Abstract: As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today’s frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing Map-Reduce frameworks, in order to achieve scalable, concurrency-optimized, fault-tolerant Map-Reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.
BibTex
[83]Gomez, L.B., Nicolae, B., Maruyama, N., Cappello, F. and Matsuoka, S. 2012. Scalable Reed-Solomon-based Reliable Local Storage for HPC Applications on IaaS Clouds. Euro-Par ’12: 18th International Euro-Par Conference on Parallel Processing (Rhodes, Greece, 2012).Details
Keywords: Cloud computing, IaaS, storage systems, virtual disk, erasure codes, Reed Solomon Abstract: With increasing interest among mainstream users to run HPC applications, Infrastructure-as-a-Service (IaaS) cloud computing platforms represent a viable alternative to the acquisition and maintenance of expensive hardware, often out of the financial capabilities of such users. Also, one of the critical needs of HPC applications is an efficient, scalable and persistent storage. Unfortunately, storage options proposed by cloud providers are not standardized and typically use a different access model. In this context, the local disks on the compute nodes can be used to save large data sets such as the data generated by Checkpoint-Restart (CR). This local storage offers high throughput and scalability but it needs to be combined with persistency techniques, such as block replication or erasure codes. One of the main challenges that such techniques face is to minimize the overhead of performance and I/O resource utilization (i.e., storage space and bandwidth), while at the same time guaranteeing high reliability of the saved data. This paper introduces a novel persistency technique that leverages Reed-Solomon (RS) encoding to save data in a reliable fashion. Compared to traditional approaches that rely on block replication, we demonstrate about 50% higher throughput while reducing network bandwidth and storage utilization by a factor of 2 for the same targeted reliability level. This is achieved both by modeling and real life experimentation on hundreds of nodes.
BibTex
[84]Tran, V.-T., Nicolae, B. and Antoniu, G. 2012. Towards scalable array-oriented active storage: the Pyramid approach. SIGOPS Oper. Syst. Rev. 46, 1 (2012), 19–25. DOI:https://doi.org/10.1145/2146382.2146387.Details
Keywords: large scale data management, multi-dimensional I/O, concurrency control, parallel array processing, versioning Abstract: The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges. Experimental evaluation demonstrates substantial scalability improvements brought by Pyramid with respect to state-of-art approaches both in weak and strong scaling scenarios, with gains of 100% to 150%.
BibTex
[85]Nicolae, B. and Cappello, F. 2012. A hybrid local storage transfer scheme for live migration of I/O intensive workloads. HPDC ’12: 21th International ACM Symposium on High-Performance Parallel and Distributed Computing (Delft, The Netherlands, 2012), 85–96.Details
Keywords: virtualization, live migration, block migration, local storage transfer, I/O intensive workloads, IaaS, cloud computing, data intensive applications Abstract: Live migration of virtual machines (VMs) is key feature of virtualization that is extensively leveraged in IaaS cloud environments: it is the basic building block of several important features, such as load balancing, pro-active fault tolerance, power management, online maintenance, etc. While most live migration efforts concentrate on how to transfer the memory from source to destination during the migration process, comparatively little attention has been devoted to the transfer of storage. This problem is gaining increasing importance: due to performance reasons, virtual machines that run large-scale, data-intensive applications tend to rely on local storage, which poses a difficult challenge on live migration: it needs to handle storage transfer in addition to memory transfer. This paper proposes a memory-migration independent approach that addresses this challenge. It relies on a hybrid active push / prioritized prefetch strategy, which makes it highly resilient to rapid changes of disk state exhibited by I/O intensive workloads. At the same time, it is minimally intrusive in order to ensure a maximum of portability with a wide range of hypervisors. Large scale experiments that involve multiple simultaneous migrations of both synthetic benchmarks and a real scientific application show improvements of up to 10x faster migration time, 10x less bandwidth consumption and 8x less performance degradation over state-of-art.
BibTex
[86]Nicolae, B. 2011. On the Benefits of Transparent Compression for Cost-Effective Cloud Data Storage. Transactions on Large-Scale Data- and Knowledge-Centered Systems. 3, 3 (2011), 167–184. DOI:https://doi.org/10.1007/978-3-642-23074-5.Details
Keywords: IaaS, cloud computing, scalable storage, high throughput, compression Abstract: Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring computational resources: it allows users to deploy virtual machines (VMs) at large scale and pay only for the resources that were actually used throughout the runtime of the VMs. This new model raises new challenges in the design and development of IaaS middleware: excessive storage costs associated with both user data and VM images might make the cloud less attractive, especially for users that need to manipulate huge data sets and a large number of VM images. Storage costs result not only from storage space utilization, but also from bandwidth consumption: in typical deployments, a large number of data transfers between the VMs and the persistent storage are performed, all under high performance requirements. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on data access performance. Our solution builds on BlobSeer, a distributed data management service speciï¬cally designed to sustain a high throughput for concurrent accesses to huge data sequences that are distributed at large scale. Extensive experiments demonstrate that our approach achieves large reductions (at least 40%) of bandwidth and storage space utilization, while still attaining high performance levels that even surpass the original (no compression) performance levels in several data-intensive scenarios.
BibTex
[87]Nicolae, B., Antoniu, G., Bouge, L., Moise, D. and Carpen-Amarie, A. 2011. BlobSeer: Next-generation data management for large scale infrastructures. J. Parallel Distrib. Comput. 71, 2 (2011), 169–184. DOI:https://doi.org/10.1016/j.jpdc.2010.08.004.Details
Keywords: scalable storage, data management, high throughput, versioning, decentralized metadata, concurrency control, data model, BlobSeer Abstract: As data volumes increase at a high speed in more and more application fields of science, engineering, information services, etc., the challenges posed by data-intensive computing gain an increasing importance. The emergence of highly scalable infrastructures, e.g. for cloud computing and for petascale computing and beyond introduces additional issues for which scalable data management becomes an immediate need. This paper brings several contributions. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, we highlight the potentially large benefits of using versioning in this context. Second, based on these principles, we propose a set of versioning algorithms, both for data and metadata, that enable a high throughput under concurrency. Finally, we implement and evaluate these algorithms in the BlobSeer prototype, that we integrate as a storage backend in the Hadoop MapReduce framework. We perform extensive microbenchmarks as well as experiments with real MapReduce applications: they demonstrate that applying the principles defended in our approach brings substantial benefits to data intensive applications.
BibTex
[88]Nicolae, B., Bresnahan, J., Keahey, K. and Antoniu, G. 2011. Going Back and Forth: Efficient Multideployment and Multisnapshotting on Clouds. HPDC ’11: 20th International ACM Symposium on High-Performance Parallel and Distributed Computing (San José, USA, 2011), 147–158.Details
Keywords: Nimbus, Grid’5000, cloud computing, BlobSeer, VM storage, IaaS, multi-snaphotting, multi-deployment, large scale provisioning Abstract: Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring resources by introducing a simple change: allowing users to lease computational resources from the cloud provider’s datacenter for a short time by deploying virtual machines (VMs) on those resources. This new model raises new challenges in the design and development of IaaS middleware. One of those challenges is the need to deploy a large number (hundreds or even thousands) of VM instances simultaneously. Once the VM instances are deployed, another challenge is to simultaneously take a snapshot of many images and transfer them to persistent storage to support management tasks, such as suspend-resume and migration. With datacenters growing at a fast rate and configurations becoming heterogeneous, it is important to enable efficient concurrent deployment and snapshotting that is at the same time hypervisor independent and ensures a maximum of compatibility with different configurations. This paper addresses these challenges by proposing a virtual file system specifically optimized for virtual machine image storage. It is based on a lazy transfer scheme coupled with object-versioning that demonstrates excellent performance in terms of consumed resources: execution time, network traffic and storage space. Experiments on hundreds of nodes demonstrate performance improvements in concurrent VM deployments ranging from a factor of 2 up to 25 over state-of-art, with storage and bandwidth utilization reduction of as much as 90%, while at the same time keeping comparable snapshotting performance, which comes with the added benefit of high portability.
BibTex
[89]Nicolae, B. and Cappello, F. 2011. BlobCR: Efficient Checkpoint-Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots. SC ’11: 24th International Conference for High Performance Computing, Networking, Storage and Analysis (Seattle, USA, 2011), 34–1.Details
Keywords: IaaS, cloud computing, large scale multi-deployment, checkpoint restart, fault tolerance, virtual disk snapshots, BlobSeer Abstract: Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space performance overhead of checkpoint-restart. We introduce a framework that combines checkpoint-restart protocols at guest level with virtual machine (VM) disk-image multi-snapshotting and multi-deployment at host level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.
BibTex
[90]Nicolae, B., Cappello, F. and Antoniu, G. 2011. Optimizing multi-deployment on clouds by means of self-adaptive prefetching. Euro-Par ’11: 17th International Euro-Par Conference on Parallel Processing (Bordeaux, France, 2011), 503–513.Details
Keywords: IaaS, cloud computing, large scale multi-deployment, provisioning, adaptive I/O Abstract: With Infrastructure-as-a-Service (IaaS) cloud economics getting increasingly complex and dynamic, resource costs can vary greatly over short periods of time. Therefore, a critical issue is the ability to deploy, boot and terminate VMs very quickly, which enables cloud users to exploit elasticity to find the optimal trade-off between the computational needs (number of resources, usage time) and budget constraints. This paper proposes an adaptive prefetching mechanism aiming to reduce the time required to simultaneously boot a large number of VM instances on clouds from the same initial VM image (multi-deployment). Our proposal does not require any foreknowledge of the exact access pattern. It dynamically adapts to it at run time, enabling the slower instances to learn from the experience of the faster ones. Since all booting instances typically access only a small part of the virtual image along almost the same pattern, the required data can be pre-fetched in the background. Large scale experiments under concurrency on hundreds of nodes show that introducing such a prefetching mechanism can achieve a speed-up of up to 35% when compared to simple on-demand fetching.
BibTex
[91]Tran, V.-T., Nicolae, B., Antoniu, G. and Bouge, L. 2011. Efficient support for MPI-I/O atomicity based on versioning. CCGRID ’11: 11th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (Newport Beach, USA, 2011), 514–523.Details
Keywords: large scale, storage, MPI-IO, atomicity, non-contiguous I/O, versioning Abstract: We consider the challenge of building data management systems that meet an important requirement of today’s data-intensive HPC applications: to provide a high I/O throughput while supporting highly concurrent data accesses. In this context, many applications rely on MPI-IO and require atomic, non-contiguous I/O operations that concurrently access shared data. In most existing implementations the atomicity requirement is often implemented through locking-based schemes, which have proven inefficient, especially for non-contiguous I/O. We claim that using a versioning-enabled storage backend has the potential to avoid expensive synchronization as exhibited by locking-based schemes, which is much more efficient. We describe a prototype implementation on top of ROMIO along this idea, and report on promising experimental results with standard MPI-IO benchmarks specifically designed to evaluate the performance of non-contiguous, overlapped I/O accesses under MPI atomicity guarantees.
BibTex
[92]Tran, V.-T., Nicolae, B., Antoniu, G. and Bouge, L. 2011. Pyramid: A large-scale array-oriented active storage system. LADIS ’11: Proceedings of the 5th Workshop on Large-Scale Distributed Systems and Middleware (Newport Beach, USA, 2011).Details
Keywords: large scale data management, multi-dimensional I/O, concurrency control, parallel array processing, versioning Abstract: The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges and shows promising results in our initial evaluation.
BibTex
[93]Suciu, A., Nicolae, B., Antoniu, G., Istvan, Z. and Szakats, I. 2010. Gathering Entropy at Large Scale with HAVEGE and BlobSeer. Automat. Comput. Appl. Math. 19, (2010), 3–11.Details
Keywords: random number generation, large scale, high throughput, high entropy, Blobseer, HAVEGE Abstract: Large sequences of random information are the foundation for a large class of applications: security, online gambling games, large scale Monte-Carlo simulations, etc. Many such applications are distributed and run on large-scale infrastructures such as clouds and grids. In this context, the random generator plays a crucial role: it needs to achieve a high entropy, a high throughput and last but not least a high degree of security. Several ways to generate high-entropy random information securely exist. For example, HAVEGE generates random information by gathering entropy from internal processor states of the machine where it is running alongside the user applications. These internal states are inheritably volatile and impossible to tamper with in a controlled fashion by the applications running on it. A centralized approach however does not scale to the high throughput requirement in a large scale setting. In order to do so, the output of several such instances needs to be combined into a single output stream. While this certainly has a good potential to solve the high throughput requirement, the way the outputs of the instances are combined in a single stream becomes a new weak link that can negatively impact all three requirements and therefore has to be addressed properly. In this paper we propose a distributed random number generator that efficiently addresses the aforementioned issue. We introduce a series of mechanisms to preserve a high entropy and degree of security for the combined output result and implement them on top of BlobSeer, a data storage service specifically designed to offer a high throughput in large-scale deployments even under heavy access concurrency. Large-scale experiments were performed on the G5K testbed and demonstrate substantial benefits for our approach.
BibTex
[94]Montes, J., Nicolae, B., Antoniu, G., Sanchez, A. and Perez, M. 2010. Using Global Behavior Modeling to Improve QoS in Cloud Data Storage Services. CloudCom ’10: Proc. 2nd IEEE International Conference on Cloud Computing Technology and Science (Indianapolis, USA, 2010), 304–311.Details
Keywords: QoS, cloud computing, data storage, behavioral modeling, throughput stabilization, GloBeM, BlobSeer, MapReduce Abstract: The cloud computing model aims to make large-scale data-intensive computing affordable even for users with limited financial resources, that cannot invest into expensive infrastructures necesssary to run them. In this context, MapReduce is emerging as a highly scalable programming paradigm that enables high-throughput data-intensive processing as a cloud service. Its performance is highly dependent on the underlying storage service, responsible to efficiently support massively parallel data accesses by guaranteeing a high throughput under heavy access concurrency. In this context, quality of service plays a crucial role: the storage service needs to sustain a stable throughput for each individual accesss, in addition to achieving a high aggregated throughput under concurrency. In this paper we propose a technique to address this problem using component monitoring, application-side feedback and behavior pattern analysis to automatically infer useful knowledge about the causes of poor quality of service and provide an easy way to reasonin about potential improvements. We apply our proposal to BlobSeer, a representative data storage service specifically designed to achieve high aggregated throughputs and show through extensive experimentation substantial improvements in the stability of individual data read accesses under MapReduce workloads.
BibTex
[95]Nicolae, B. 2010. High Throughput Data-Compression for Cloud Storage. Globe ’10: Proc. 3rd International Conference on Data Management in Grid and P2P Systems (Bilbao, Spain, 2010), 1–12.Details
Keywords: cloud computing, distributed data storage, high throughput, adaptive I/O, data intensive applications Abstract: As data volumes processed by large-scale distributed data-intensive applications grow at high-speed, an increasing I/O pressure is put on the underlying storage service, which is responsible for data management. One particularly difficult challenge, that the storage service has to deal with, is to sustain a high I/O throughput in spite of heavy access concurrency to massive data. In order to do so, massively parallel data transfers need to be performed, which invariably lead to a high bandwidth utilization. With the emergence of cloud computing, data intensive applications become attractive for a wide public that does not have the resources to maintain expensive large scale distributed infrastructures to run such applications. In this context, minimizing the storage space and bandwidth utilization is highly relevant, as these resources are paid for according to the consumption. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on I/O throughput when under heavy access concurrency. Our solution builds on BlobSeer, a highly parallel distributed data management service specifically designed to enable reading, writing and appending huge data sequences that are fragmented and distributed at a large scale. We demonstrate the benefits of our approach by performing extensive experimentations on the Grid’5000 testbed.
BibTex
[96]Nicolae, B. 2010. BlobSeer: Efficient Data Management for Data-Intensive Applications Distributed at Large-Scale. IPDPS ’10: Proc. 24th IEEE International Symposium on Parallel and Distributed Processing: Workshops and Phd Forum (Atlanta, USA, 2010), 1–4.Details
Keywords: data intensive applications, large scale, distributed data storage, high throughput, heavy access concurrency, versioning, efficient concurrency control, data striping, distributed metadata management Abstract: Large-scale data-intensive applications are a class of applications that acquire and maintain massive datasets, while performing distributed computations on these datasets. In this context, a a key factor is the storage service responsible for the data management, as it has to efficiently deal with massively parallel data access in order to ensure scalability and performance for the whole system itself. This PhD thesis proposes BlobSeer, a data management service specifically designed to address the needs of large-scale data-intensive applications. Three key design factors: data striping, distributed metadata management and versioning-based concurrency control enable BlobSeer not only to provide efficient support for features commonly used to exploit data-level parallelism, but also enable exploring a set of new features that can be leveraged to further improve parallel data access. Extensive experimentations, both in scale and scope, on the Grid5000 testbed demonstrate clear benefits of using BlobSeer as the underlying storage for a variety of scenarios: data-intensive grid applications, grid file systems, MapReduce datacenters, desktop grids. Further work targets providing efficient storage solutions for cloud computing as well.
BibTex
[97]Nicolae, B. 2010. BlobSeer: Towards Efficient Data Storage Management for Large-Scale, Distributed Systems. University of Rennes 1.Details
Keywords: large scale data storage, cloud storage, versioning, decentralized metadata management, high throughput, heavy access concurrency Abstract: With data volumes increasing at a high rate and the emergence of highly scalable infrastructures (cloud computing, petascale computing), distributed management of data becomes a crucial issue that faces many challenges. This thesis brings several contributions in order to address such challenges. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, it highlights the potentially large benefits of using versioning in this context. Second, based on these principles, it introduces a series of distributed data and metadata management algorithms that enable a high throughput under concurrency. Third, it shows how to efficiently implement these algorithms in practice, dealing with key issues such as high-performance parallel transfers, efficient maintenance of distributed data structures, fault tolerance, etc. These results are used to build BlobSeer, an experimental prototype that is used to demonstrate both the theoretical benefits of the approach in synthetic benchmarks, as well as the practical benefits in real-life, applicative scenarios: as a storage backend for MapReduce applications, as a storage backend for deployment and snapshotting of virtual machine images in clouds, as a quality-of-service enabled data storage service for cloud applications. Extensive experimentation on the Grid’5000 testbed shows that BlobSeer remains scalable and sustains a high throughput even under heavy access concurrency, outperforming by a large margin several state-of-art approaches.
BibTex
[98]Nicolae, B., Moise, D., Antoniu, G., Bouge, L. and Dorier, M. 2010. BlobSeer: Bringing High Throughput under Heavy Concurrency to Hadoop Map/Reduce Applications. IPDPS ’10: Proc. 24th IEEE International Parallel and Distributed Processing Symposium (Atlanta, USA, 2010), 1–12.Details
Keywords: large-scale distributed computing, data-intensive, MapReduce, distributed file systems, high throughput, heavy access concurrency, Hadoop, BlobSeer Abstract: Hadoop is a software framework supporting the Map-Reduce programming model. It relies on the Hadoop Distributed File System (HDFS) as its primary storage system. The efficiency of HDFS is crucial for the performance of Map-Reduce applications. We substitute the original HDFS layer of Hadoop with a new, concurrency-optimized data storage layer based on the BlobSeer data management service. Thereby, the efficiency of Hadoop is significantly improved for data-intensive Map-Reduce applications, which naturally exhibit a high degree of data access concurrency. Moreover, BlobSeer’s features (built-in versioning, its support for concurrent append operations) open the possibility for Hadoop to further extend its functionalities. We report on extensive experiments conducted on the Grid’5000 testbed. The results illustrate the benefits of our approach over the original HDFS-based implementation of Hadoop.
BibTex
[99]Nicolae, B., Antoniu, G. and Bouge, L. 2009. BlobSeer: How to Enable Efficient Versioning for Large Object Storage under Heavy Access Concurrency. EDBT/ICDT ’09 Workshops (Saint-Petersburg, Russia, 2009), 18–25.Details
Keywords: large scale data storage, concurrency control, versioning, decentralized metadata Abstract: To accommodate the needs of large-scale distributed P2P systems, scalable data management strategies are required, allowing appli cations to efficiently cope with continuously growing, highly dis tributed data. This paper addresses the problem of efficiently stor ing and accessing very large binary data objects (blobs). It proposesan efficient versioning scheme allowing a large number of clients to concurrently read, write and append data to huge blobs that are fragmented and distributed at a very large scale. Scalability under heavy concurrency is achieved thanks to an original metadata scheme, based on a distributed segment tree built on top of a Distributed Hash Table (DHT). Our approach has been implemented and experimented within our BlobSeer prototype on the Grid’5000 testbed, using up to 175 nodes.
BibTex
[100]Nicolae, B., Antoniu, G. and Bouge, L. 2009. Enabling High Data Throughput in Desktop Grids Through Decentralized Data and Metadata Management: The BlobSeer Approach. Euro-Par ’09 : Proc. 15th International Euro-Par Conference on Parallel Processing (Delft, The Netherlands, 2009), 404–416.Details
Keywords: desktop grids, distributed metadata management, data intensive applications, large data size, heavy access concurrency, high speed writes Abstract: Whereas traditional Desktop Grids rely on centralized servers for data management, some recent progress has been made to enable distributed, large in- put data, using to peer-to-peer (P2P) protocols and Content Distribution Networks (CDN). We make a step further and propose a generic, yet efficient data storage which enables the use of Desktop Grids for applications with high output data re- quirements, where the access grain and the access patterns may be random. Our solution builds on a blob management service enabling a large number of con- current clients to efficiently read/write and append huge data that are fragmented and distributed at a large scale. Scalability under heavy concurrency is achieved thanks to an original metadata scheme using a distributed segment tree built on top of a Distributed Hash Table (DHT). The proposed approach has been imple- mented and its benefits have successfully been demonstrated within our BlobSeer prototype on the Grid’5000 testbed.
BibTex
[101]Tran, V.-T., Antoniu, G., Nicolae, B. and Bouge, L. 2009. Towards A Grid File System Based On A Large-Scale BLOB Management Service. Grids, P2P and Service Computing (Delft, The Netherlands, 2009), 7–19.Details
Keywords: data intensive applications, large scale, distributed data storage, high throughput, heavy access concurrency, versioning, efficient concurrency control, data striping, distributed metadata management Abstract: Large-scale data-intensive applications are a class of applications that acquire and maintain massive datasets, while performing distributed computations on these datasets. In this context, a a key factor is the storage service responsible for the data management, as it has to efficiently deal with massively parallel data access in order to ensure scalability and performance for the whole system itself. This PhD thesis proposes BlobSeer, a data management service specifically designed to address the needs of large-scale data-intensive applications. Three key design factors: data striping, distributed metadata management and versioning-based concurrency control enable BlobSeer not only to provide efficient support for features commonly used to exploit data-level parallelism, but also enable exploring a set of new features that can be leveraged to further improve parallel data access. Extensive experimentations, both in scale and scope, on the Grid5000 testbed demonstrate clear benefits of using BlobSeer as the underlying storage for a variety of scenarios: data-intensive grid applications, grid file systems, MapReduce datacenters, desktop grids. Further work targets providing efficient storage solutions for cloud computing as well.
BibTex
[102]Nicolae, B., Antoniu, G. and Bouge, L. 2008. Enabling lock-free concurrent fine-grain access to massive distributed data: Application to supernovae detection. Cluster ’08 : Proc. IEEE International Conference on Cluster Computing: Poster Session (Tsukuba, Japan, 2008), 310–315.Details
Keywords: large scale data management, object storage, huge file, versioning, heavy access concurrency Abstract: We consider the problem of efficiently managing massive data in a large-scale distributed environment. We consider data strings of size in the order of Terabytes, shared and accessed by concurrent clients. On each individual access, a segment of a string, of the order of Megabytes, is read or modified. Our goal is to provide the clients with efficient fine-grain access the data string as concurrently as possible, without locking the string itself. This issue is crucial in the context of applications in the field of astronomy, databases, data mining and multimedia. We illustrate these requiremens with the case of an application for searching supernovae. Our solution relies on distributed, RAM-based data storage, while leveraging a DHT-based, parallel metadata management scheme. The proposed architecture and algorithms have been validated through a software prototype and evaluated in a cluster environment.
BibTex
[103]Nicolae, B., Antoniu, G. and Bouge, L. 2008. Distributed Management of Massive Data: An Efficient Fine-Grain Data Access Scheme. VECPAR ’08 : Proc. 8th International Meeting on High Performance Computing for Computational Science (Toulouse, France, 2008), 532–543.Details
Keywords: high performance distributed computing, large scale data sharing, distributed data management, lock-free, fine grain access Abstract: This paper addresses the problem of efficiently storing and accessing massive data blocks in a large-scale distributed environment, while providing efficient fine-grain access to data subsets. This issue is crucial in the context of applications in the field of databases, data mining and multimedia. We propose a data sharing service based on distributed, RAM-based storage of data, while leveraging a DHT-based, natively parallel metadata management scheme. As opposed to the most commonly used grid storage infrastructures that provide mechanisms for explicit data localization and transfer, we provide a transparent access model, where data are accessed through global identifiers. Our proposal has been validated through a prototype implementation whose preliminary evaluation provides promising results.