Authors: S.Jayashree Ananth, Mr Naveen VS

Abstract: Due to their evolving nature and growing complexity, cloud computing systems require efficient fault management techniques that would assure availability and reliability of provided services. This paper introduces a deep learning-based fault prediction framework designed specifically for clouds. It utilizes a combination of Bidirectional Gated Recurrent Units (Bi-GRU), attentional mechanisms, and Graph Neural Networks (GNNs). The proposed model incorporates temporal dependency from cloud system telemetry data and also accounts for dependencies between microservices within the system. Testing on real-life datasets such as Google Cluster Trace and Alibaba Cluster data showed 96.2% prediction accuracy, 92.8% precision and 91.5% recall, which outperforms current fault prediction techniques by 8-12%. Additionally, due to its attention-based architecture, the model is capable of providing explainability by highlighting important temporal parts and specific services at risk. Results show that the proposed approach allows for implementing proactive fault prevention techniques that reduce SLA violation rate by 65% and cut down recovery times by 55%.

DOI: https://doi.org/10.5281/zenodo.19698844