Site Reliability Engineering

أتمتة دليل الإجراءات

The automated execution of operational procedures and incident response workflows using predefined runbooks, scripts, or orchestration platforms to ensure standardized and repeatable responses across IT systems, integrations, or service monitoring environments.

View term

أرشيف الحوادث

A centralized, searchable repository storing records, timelines, and postmortem analyses for all past incidents, supporting compliance, audit, and learning.

View term

إدارة التعب

The process of reducing operational overload and alert fatigue for IT and SRE teams by optimizing on-call scheduling, alert thresholds, and workload distribution.

View term

إرهاق التنبيهات

A state where operators or engineers become desensitized due to excessive or repetitive alerts, risking delayed or missed responses to real incidents.

View term

اجتماع مراجعة غرفة الحرب

A retrospective meeting following an incident warroom to review actions taken, identify lessons learned, and assign follow-up items for process improvement.

View term

التزام الجهوزية

A contractual or operational guarantee specifying the minimum service availability, often expressed as a percentage (e.g., 99.9%) over a defined period.

View term

التعلم من الحوادث

The process of analyzing incidents to identify root causes, systemic gaps, and actionable improvements for reliability, feeding back into SRE practices and runbooks.

View term

انتهاك SLO

An event where the service's actual reliability or performance falls short of the agreed Service Level Objective, triggering error budget consumption and potential incident response.

View term

تتبع الجهوزية

Continuous monitoring and reporting of system or integration availability, often presented as dashboards and compliance metrics against SLAs and SLOs.

View term

تحليل ما بعد الحادثة

A structured review conducted after a major incident to analyze root causes, assess response effectiveness, and document corrective actions for future prevention.

View term

تدريب الفشل

A planned exercise simulating system or integration failure scenarios to validate incident response procedures, team readiness, and recovery playbooks.

View term

تدريب على الحوادث

A simulated incident event conducted by IT operations or SRE teams to practice and evaluate readiness, coordination, and effectiveness of incident response procedures.

View term

تدقيق الموثوقية

A formal assessment of systems, integrations, and processes to verify compliance with reliability standards, SLOs, and operational policies.

View term

تدوير الاستجابة

A scheduled system for rotating on-call or incident response responsibilities among team members to ensure fair distribution of operational workload.

View term

تدوير الحوادث

A scheduled process in site reliability and IT operations where on-call responsibilities for incident response are rotated among team members according to a predefined roster or policy.

View term

تسليم جهاز التنبيه

The formal process of transferring incident response and alert monitoring responsibilities to the next scheduled on-call engineer, typically at shift change.

View term

تصعيد الخدمة

A formal process in IT operations and incident management by which unresolved service issues or incidents are elevated to higher support tiers, subject-matter experts, or management levels according to escalation policy.

View term

تصعيد الخدمة

A formal incident management process by which unresolved issues are elevated to higher-tier support or management according to predefined escalation policy.

View term

تغطية SLA

The proportion of services or integrations governed by a formal Service Level Agreement, typically tracked for compliance and contractual obligations.

View term

تقليل الأعمال الروتينية

The ongoing effort to minimize repetitive, manual, and low-value operational work in site reliability engineering by automating tasks and streamlining workflows.

View term

ثقافة بلا لوم

An SRE and incident management principle that emphasizes learning and systemic improvement over assigning personal blame for incidents or failures.

View term

جدول المناوبات

A published schedule that allocates on-call duties, support shifts, and escalation responsibilities among operations and SRE team members.

View term

حدث السعة

An incident or alert triggered by a system, application, or integration exceeding its predefined resource limits, such as CPU, memory, or connection thresholds.

View term

حقن الفوضى

The deliberate introduction of faults, errors, or failures into a system to test and validate its resilience, observability, and incident response mechanisms.

View term

خرق SLO

An event or period in which actual service performance falls below the established Service Level Objective, often triggering incident response and escalation.

View term

خطة التخفيف

A documented set of actions designed to reduce the impact or likelihood of future incidents, typically produced during incident response or postmortem analysis.

View term

درجة الموثوقية

A quantitative indicator representing the operational reliability of a service, integration, or system, often based on uptime, incident frequency, and SLO compliance.

View term

دليل الاستجابة

A detailed set of documented procedures and best practices for handling specific incident types or operational anomalies in IT and integration environments.

View term

سلم التصعيد

A structured hierarchy that defines the sequence and roles for escalating unresolved incidents to higher support or management levels within IT operations or SRE organizations.

View term

سياسة الاستدعاء

A formal set of rules and expectations in site reliability engineering that governs the assignment, escalation, and response to pager or alert notifications for operational incidents.

View term

سير عمل التخفيف

A defined sequence of steps and automated tasks for responding to, containing, and resolving incidents, usually documented in runbooks or incident playbooks.

View term

غرفة حرب الحوادث

A dedicated virtual or physical space where stakeholders, engineers, and incident commanders collaborate in real time to coordinate resolution of major incidents.

View term

محاكاة الفوضى

The controlled practice of deliberately introducing faults or failures into production-like environments to test system resilience, observability, and incident response procedures.

View term

مراجعة الفشل

A collaborative meeting or process to analyze a system or service failure, uncover contributing factors, and recommend improvements for reliability.

View term

مرجع رنبوك

A specific pointer or hyperlink to a documented runbook procedure for rapid access during incident diagnosis or resolution.

View term

مسار التصعيد

A predefined path or sequence of roles to which incidents are escalated based on severity, ensuring prompt resolution and accountability.

View term

مسار التصعيد

The documented sequence of escalation steps, contacts, and actions to be followed as an incident increases in severity or complexity.

View term

مسجل الحوادث

A tool or service for real-time recording and tracking of incident events, communications, actions, and status updates during active outages.

View term

مصفوفة التأثير

A visual or tabular tool mapping the severity and probability of incidents or risks, guiding prioritization in response and remediation planning.

View term

مصفوفة التصعيد

A documented framework detailing incident severity levels, responsible roles, and the stepwise escalation path for critical incidents across support tiers.

View term

معايرة SRE

A formal review and alignment process where SRE teams adjust monitoring, incident response, and reliability metrics to ensure consistent standards across services.

View term

معدل الاستهلاك

The rate at which error budget or SLO allowance is consumed over time, indicating the pace of reliability degradation in SRE practice and the urgency of corrective actions.

View term

مكتبة رنبوك

A centralized repository containing versioned, actionable runbooks for routine tasks, troubleshooting, and incident response in IT operations and SRE.

View term

ميزانية الصحة

A quantitative limit representing allowable error or degradation in system health, used in SRE for monitoring, alerting, and balancing feature deployment risk.

View term

ميزانية الكمون

The maximum allowable time allocated for a transaction, operation, or request in an IT system before breaching service-level objectives or user expectations.

View term

نوبة الاستدعاء

A defined time period during which a designated team member is responsible for responding to alerts and incidents as the primary on-call engineer.

View term

هدف الاستقرار

A quantifiable goal specifying the desired level of system stability, typically measured by the frequency and impact of failures, for SRE and integration operations.

View term

هدف التوفر

A specific level of service uptime expressed as a percentage (e.g., 99.99%), used as an operational or contractual benchmark for system reliability.

View term

هدف الموثوقية

A quantifiable goal set for the availability, performance, or correctness of a system or integration, typically expressed as a percentage (e.g., 99.9% uptime) to guide engineering and operational decisions.

View term

Languages

أتمتة دليل الإجراءات

أرشيف الحوادث

إدارة التعب

إرهاق التنبيهات

اجتماع مراجعة غرفة الحرب

التزام الجهوزية

التعلم من الحوادث

انتهاك SLO

تتبع الجهوزية

تحليل ما بعد الحادثة

تدريب الفشل

تدريب على الحوادث

تدقيق الموثوقية

تدوير الاستجابة

تدوير الحوادث

تسليم جهاز التنبيه

تصعيد الخدمة

تصعيد الخدمة

تغطية SLA

تقليل الأعمال الروتينية

ثقافة بلا لوم

جدول المناوبات

حدث السعة

حقن الفوضى

خرق SLO

خطة التخفيف

درجة الموثوقية

دليل الاستجابة

سلم التصعيد

سياسة الاستدعاء

سير عمل التخفيف

غرفة حرب الحوادث

محاكاة الفوضى

مراجعة الفشل

مرجع رنبوك

مسار التصعيد

مسار التصعيد

مسجل الحوادث

مصفوفة التأثير

مصفوفة التصعيد

معايرة SRE

معدل الاستهلاك

مكتبة رنبوك

ميزانية الصحة

ميزانية الكمون

نوبة الاستدعاء

هدف الاستقرار

هدف التوفر

هدف الموثوقية