WFWarpFleet / документация Личный кабинет Консоль

Каталог плейбуков

Библиотека восстановления: 270 плейбуков. Каждый — YAML с прекчеками, шагами, верификацией результата и политикой отката; исполняются только команды из allowlist самого плейбука.

Риск: низкий — авто, средний — авто по политике, высокий — только после approve оператора.
Kubernetes и контейнеры 50
Класс проблемыЧто делаетРиск
container_image_bloatPrune unused container images and old exited containersсредний
service_downRecover containerdвысокий
service_downRecover containerd on Oracle Linuxвысокий
coredns_unhealthyCoreDNS — clear local cache + force pod restartсредний
coredns_unhealthyRollout-restart CoreDNS deploymentсредний
etcd_alarm_activeDisarm etcd NOSPACE alarm after defragсредний
etcd_db_pressureDefragment etcd backend and disarm NOSPACE alarmсредний
etcd_member_downRestart local etcd memberсредний
k3s_disabled_addon_conflictDiagnose and disable conflicting k3s addonсредний
k3s_agent_unable_to_joinDiagnose k3s-agent join failure and collect evidenceнизкий
k3s_kubeconfig_driftRestore k3s kubeconfig after CA rotation or cluster restoreвысокий
k3s_server_token_corruptK3s — regenerate corrupted server token and restart clusterвысокий
k3s_sqlite_db_bloatCompact k3s embedded SQLite backend (VACUUM)высокий
kubernetes_api_connectivity_failureDiagnose and repair Kubernetes API connectivity failureвысокий
kube_apiserver_crashloopReload kubelet to recover stuck kube-apiserver static podвысокий
kube_apiserver_crashloopSurgical restart of crash-looping kube-apiserver static pod via crictlсредний
kube_controller_manager_crashloopSurgical restart of crash-looping kube-controller-manager via crictlсредний
kube_deployment_progress_deadline_exceededUndo stuck Kubernetes deployment rolloutвысокий
kube_etcd_compaction_lagCompact etcd at current revision and defragment (HIGH RISK)высокий
kube_evicted_pods_accumulatedDelete pods stuck in Failed/Evicted phase across the clusterнизкий
kube_hpa_stuckDiagnose HPA stuck — ScalingActive=False or metrics absentнизкий
kube_image_pull_backoffDiagnose ImagePullBackOff (manual_only — operator review)низкий
kube_pod_init_container_failedDiagnose imagePullSecrets behind ImagePullBackOff (read-only)низкий
kube_ingress_controller_unhealthyRollout-restart ingress controller (ingress-nginx or traefik)средний
kube_kubelet_down_localRestart local kubeletсредний
kube_namespace_terminating_stuckForce-strip finalizers off a stuck Terminating Namespace (HIGH RISK)высокий
kube_node_disk_pressure_localClean container logs and images to relieve node disk pressureвысокий
kube_node_not_readyDiagnose Kubernetes node NotReady stateнизкий
kube_node_pressureDiagnose Kubernetes node memory/CPU pressure conditionsнизкий
kube_oom_killed_containerDiagnose containers killed by OOMнизкий
kube_pod_liveness_probe_failedForce-delete a Pod stuck in Terminating (grace-period=0)высокий
kube_pod_pending_unschedulableAdvisor — Kubernetes pod stuck Pending, scheduler can't placeнизкий
kube_pod_readiness_probe_failedDiagnose Kubernetes pod readiness probe failuresнизкий
kube_proxy_unhealthyRecover kube-proxy daemonset via rollout restartсредний
kube_pv_orphanedRelease orphaned PV (reclaimPolicy=Retain) for re-bindingсредний
kube_pvc_pendingDiagnose PVC stuck in Pending stateнизкий
kube_replicaset_orphanedDelete orphaned ReplicaSets with zero desired/ready replicasсредний
kube_scheduler_crashloopSurgical restart of crash-looping kube-scheduler static pod via crictlсредний
kube_service_endpoint_emptyDiagnose empty Service endpoints (selector vs ready pods)низкий
kube_statefulset_rollout_stuckDiagnose StatefulSet rollout stuck (ordered update wedged)низкий
kube_static_pod_crashRestore crashed static pod manifest and restart kubeletвысокий
kube_static_pod_surgical_restartSurgically restart a crashed static pod by moving its manifestвысокий
cert_expiredEmergency kubeadm cert renewal when API server is already down (CRITICAL)высокий
cert_expiry_soonkubeadm-style cert renewal with PKI backup (HIGH RISK)высокий
service_downRecover kubelet on ALT Linux / Astra Linuxсредний
service_downRecover kubelet on Oracle Linuxвысокий
service_downRecover kubelet on Ubuntu / Debianсредний
kubernetes_cni_failureRecover standard Kubernetes CNI plugins on Oracle Linux x86_64высокий
kubernetes_cni_failureRecover standard Kubernetes CNI plugins on Oracle Linux arm64высокий
kubernetes_control_plane_pressureDiagnose Kubernetes control-plane pressure on Oracle Linuxсредний
PostgreSQL 20
Класс проблемыЧто делаетРиск
pg_extension_outdatedUpdate outdated PostgreSQL extension to default versionнизкий
pg_stat_statements_bloatReset pg_stat_statements to reclaim shared memoryнизкий
pg_table_autovacuum_disabledRe-enable autovacuum for table that has it disabledвысокий
pg_2pc_stuckRoll back orphan prepared transactions older than the freeze horizonвысокий
pg_archive_command_failingCapture pg_stat_archiver evidence when archive_command keeps failingнизкий
pg_table_bloatRun VACUUM ANALYZE on a bloated table to relieve autovacuum debtсредний
pg_autovacuum_disabledRe-enable PostgreSQL autovacuum (cluster-wide)высокий
pg_buffer_hit_ratio_lowSnapshot shared-buffers hit ratio and sizing for operator reviewнизкий
pg_checkpoint_too_frequentSnapshot checkpoint tuning parameters when forced checkpoints dominateнизкий
pg_idle_in_txnTerminate long-running idle-in-transaction sessionsвысокий
pg_index_bloatSnapshot bloated indexes for operator-scheduled REINDEX CONCURRENTLYнизкий
pg_lock_contentionCapture lock-wait tree when many backends are blocked on locksнизкий
pg_long_running_query_criticalCancel active queries running longer than 30 minutesвысокий
pg_max_connections_reachedAdvisor — PostgreSQL near max_connections, recommend pgbouncer / app poolingнизкий
pg_replica_idle_disconnectAdvisor — PostgreSQL streaming replica idle/disconnectedнизкий
pg_replication_lagResume paused WAL replay on a lagging PostgreSQL standbyсредний
pg_replication_slot_orphanedDrop orphaned replication slots that pin WAL beyond 1 GiBвысокий
pg_temp_files_highAdvisor — PostgreSQL spilling to temp files, work_mem likely too smallнизкий
pg_wal_fill_criticalReclaim space on a near-full pg_wal partitionвысокий
pg_xid_wraparound_riskVACUUM FREEZE the database with the oldest XID horizonвысокий
MySQL / MariaDB 9
Класс проблемыЧто делаетРиск
mariadb_galera_node_non_primaryBootstrap MariaDB Galera primary node from non-Primary stateвысокий
mariadb_health_failureRecover MariaDB service healthвысокий
mysql_health_failureRecover MySQL service healthвысокий
mysql_innodb_log_fullCapture InnoDB log stall evidence and current tuning valuesнизкий
mysql_long_running_queryKill the top 3 long-running MySQL queriesвысокий
mysql_max_connections_reachedKill idle MySQL connections older than the thresholdсредний
mysql_replication_io_thread_downRestart MySQL replication IO threadсредний
mysql_replication_sql_thread_downSkip one event and resume MySQL replication SQL threadвысокий
mysql_replication_lag_highDiagnose MySQL/MariaDB replication lag and collect statusнизкий
ClickHouse 6
Класс проблемыЧто делаетРиск
clickhouse_max_memory_usage_exceededRelieve ClickHouse memory pressure by dropping server-side cachesсредний
clickhouse_disk_full_data_dirDrop the oldest ClickHouse partition to relieve disk pressureвысокий
clickhouse_mutations_stuckKill a stuck ClickHouse mutationвысокий
clickhouse_too_many_partsOPTIMIZE TABLE FINAL on the most fragmented ClickHouse tableвысокий
clickhouse_replica_max_queue_sizeUnwedge a stuck ClickHouse replication queue by reinitializing the replicaсредний
clickhouse_zookeeper_session_expiredRestart the ClickHouse replica session to recover Zookeeper connectivityсредний
Elasticsearch / Redis / MongoDB 3
Класс проблемыЧто делаетРиск
elasticsearch_circuit_breaker_trippedRaise ES request and parent circuit-breaker limits (transient)средний
elasticsearch_too_many_open_filesRaise Elasticsearch open-file limit via systemd drop-inвысокий
elasticsearch_unassigned_shardsReroute unassigned ES shards (after consulting allocation/explain)средний
Веб-серверы и прокси 35
Класс проблемыЧто делаетРиск
apache_modproxy_backend_failedRecover Apache mod_proxy 502/503 by restarting local upstreamсредний
apache_config_syntax_errorDiagnose Apache config syntax error (apache2ctl/httpd/httpd2 -t)низкий
apache_excessive_500Diagnose Apache 500-flood (upstream app sick)низкий
apache_health_failureRecover Apache service healthсредний
apache_health_failureRecover Apache (httpd2) service health on ALT Linuxсредний
apache_health_failureRecover Apache service health on Astra Linuxсредний
apache_health_failureRecover Apache (httpd) service health on Oracle Linuxсредний
apache_health_failureRecover Apache (httpd-prefork) service health on openSUSE / SLESсредний
apache_keepalive_too_highDiagnose Apache KeepAlive holding workers idleнизкий
apache_log_growing_fastForce logrotate when Apache access.log grows >100 MB/min (DDoS / bot flood)средний
apache_module_missingDiagnose Apache LoadModule references whose .so is not on diskнизкий
apache_worker_mpm_overloadedRestart Apache after MPM worker exhaustion / OOMсредний
apache_php_fpm_socket_unavailableRecover Apache mod_proxy_fcgi when PHP-FPM socket is missing or unreadableсредний
apache_rate_limiting_excessiveDiagnose Apache mod_security / mod_evasive over-blockingнизкий
apache_segfaultDiagnose Apache child SIGSEGV (loaded module crash)низкий
apache_ssl_cert_expiry_soonRenew Let's Encrypt SSL cert for Apache and reloadвысокий
cert_chain_brokenCapture broken TLS chain evidence for operator reviewнизкий
haproxy_health_failureRecover HAProxy service healthсредний
nginx_excessive_502_503Recover from excessive nginx 502/503 errorsсредний
nginx_config_syntax_errorDiagnose nginx config syntax error (no auto-fix)низкий
nginx_health_failureRecover nginx healthсредний
nginx_health_failureRecover nginx health on ALT Linuxсредний
nginx_health_failureRecover nginx health on Astra Linuxсредний
nginx_access_log_growing_fastForce rotate nginx logs (USR1 + logrotate force, no nginx restart)средний
nginx_no_active_listeningDiagnose and reload nginx when master has no listening socketвысокий
nginx_worker_oomRestart nginx workers after OOM kill (preserve connections when possible)средний
nginx_rate_limit_exceededDiagnose nginx limit_req saturation (no auto-tune)низкий
nginx_ssl_cert_already_expiredRenew expired nginx TLS cert via certbot and reload nginxвысокий
nginx_ssl_cert_expiry_soonRenew Let's Encrypt cert for nginx and reloadсредний
nginx_upstream_failedDiagnose nginx upstream failures (read-only, multi-tool)низкий
nginx_worker_too_many_open_filesRaise nginx worker_rlimit_nofile after fd exhaustionсредний
php_fpm_health_failureRecover PHP-FPM service healthвысокий
tomcat_connector_threads_exhaustedDouble Tomcat Connector maxThreads (capped at 1000)высокий
tomcat_jdbc_connection_pool_exhaustedDouble Tomcat JDBC connection pool size (maxActive / maxTotal)высокий
tomcat_session_storage_fullClear Tomcat session storage + stale temp filesсредний
Java / JVM 15
Класс проблемыЧто делаетРиск
java_class_loader_leakCapture classloader leak evidence via jcmd VM.metaspaceнизкий
java_deadlock_detectedCapture thread dump then restart Java service in deadlockвысокий
java_dns_caching_staleFix infinite JVM DNS cache by setting networkaddress.cache.ttlсредний
java_gc_overhead_limitRotate Java GC log (mv + runtime VM.log reconfig on JDK11+)средний
java_gc_overhead_limitTake heap dump then restart Java service hitting GC overhead limitвысокий
java_heap_oomCollect Java heap dump for offline analysisнизкий
java_heap_oomCapture Java heap dump with disk-space pre-flight and size-aware strategyсредний
java_high_thread_countCollect read-only Java diagnostic bundle (threads + heap histogram + GC stats)низкий
java_jfr_recording_stuckStop stuck JFR recording and dump flight dataсредний
java_metaspace_oomDiagnose Metaspace OOM and capture JVM flag evidenceнизкий
java_native_memory_leakEnable Native Memory Tracking + 5min diff snapshotнизкий
java_old_gen_full_consistentCapture heap dump before restarting JVM with consistent Old Gen OOMсредний
java_safepoint_long_pauseCapture safepoint pause evidence and JIT countersнизкий
java_thread_blocked_on_lockCapture thread dump for JVM with lock contentionнизкий
java_truststore_corruptedRebuild $JAVA_HOME/lib/security/cacerts from system CA bundleвысокий
Сеть и DNS 23
Класс проблемыЧто делаетРиск
bind9_health_failureRecover BIND9 service healthсредний
bind9_health_failureRecover BIND (named) service health on Oracle Linuxсредний
dns_dnssec_failureCapture DNSSEC validation failure evidenceнизкий
dns_resolution_failureRevalidate DNS stackсредний
dns_resolution_failureRevalidate DNS stack on ALT Linuxсредний
dns_resolution_failureRevalidate DNS stack on Astra Linuxсредний
dns_resolution_failureRevalidate DNS stack on Oracle Linuxсредний
conntrack_table_fullBump nf_conntrack_max to avoid table fillup dropsсредний
conntrack_table_fullBump nf_conntrack_max on ALT Linuxсредний
conntrack_table_fullBump nf_conntrack_max on Astra Linuxсредний
network_default_route_lostRestart ALT etcnet to restore lost default routeвысокий
network_default_route_lostRestart Astra networking to restore lost default routeвысокий
network_default_route_lostRestart systemd-networkd to restore lost default routeвысокий
network_default_route_lostReload NetworkManager to restore lost default routeвысокий
network_iface_errorsDiagnose NIC errors via ethtool and interface statisticsнизкий
network_link_flapReset a flapping network interface (link down/up cycle)высокий
network_link_flapReset a flapping network interface on ALT Linuxвысокий
network_link_flapReset a flapping network interface on Astra Linuxвысокий
network_mtu_blackholeLower MTU on affected interface to resolve PMTU blackholeсредний
network_route_failureRecheck network routeсредний
network_route_failureRecheck network route on ALT Linuxсредний
network_route_failureRecheck network route on Astra Linuxсредний
network_route_failureRecheck network route on Oracle Linuxсредний
Диски, ФС и хранилище 15
Класс проблемыЧто делаетРиск
cifs_mount_credential_failedRefresh CIFS credential file and remount failed network shareвысокий
disk_fullCleanup disk pressureсредний
disk_fullCleanup disk pressure on ALT Linuxсредний
disk_fullCleanup disk pressure on Astra Linuxсредний
fs_corruption_markerDiagnose filesystem corruption markers (capture evidence, no fix)низкий
fs_quota_exceededReport filesystem quota exceeded — identify user and pathsнизкий
inode_exhaustionCleanup inode pressureсредний
inode_exhaustionCleanup inode pressure on ALT Linuxсредний
inode_exhaustionCleanup inode pressure on Astra Linuxсредний
io_wait_sustainedCapture top IO-wait processes and disk stats (diagnostic only)низкий
lvm_volume_inactiveReactivate inactive LVM logical volumeсредний
lvm_metadata_damageRestore LVM metadata from archive backupвысокий
lvm_snapshot_fullDiagnose nearly-full LVM snapshot (log state, no auto-extend/auto-merge)средний
lvm_thin_pool_fullExtend an LVM thin pool when the parent VG has free PEсредний
raid_array_recoverableRe-add a removed mdadm array memberсредний
Память и процессы 13
Класс проблемыЧто делаетРиск
cgroup_oom_detectedRestart workload after cgroup OOM killвысокий
core_service_inactiveStart auth/session service after dependency-cascade outageвысокий
core_service_inactiveStart critical network service after dependency-cascade outageсредний
cpu_pressureObserve CPU pressure (no destructive action)низкий
fd_exhaustion_processAdvisory — process near rlimit-NOFILE capнизкий
fd_exhaustion_systemRaise fs.file-max via sysctl drop-inсредний
oom_detectedRecover from OOM pressureвысокий
oom_detectedRecover from OOM pressure on ALT Linuxвысокий
oom_detectedRecover from OOM pressure on Astra Linuxвысокий
oom_victim_recurringAdvisory — recurring OOM victim needs MemoryHigh tuningнизкий
swap_exhaustionAdvisory — swap entry saturated, OOM imminentнизкий
swap_thrashingRelieve swap thrashing via swapoff/swapon cycle (capture top RAM users first)высокий
zombie_process_buildupReap zombie processes by signalling their parents (SIGCHLD)средний
Службы и systemd 30
Класс проблемыЧто делаетРиск
arp_table_overflowRelieve ARP table overflow by raising neighbour GC thresholdsсредний
boot_disk_fullPurge old kernels from /boot (Debian/Ubuntu/Astra)высокий
boot_disk_fullPurge old kernels from /boot (Rocky/Alma/Oracle/openSUSE)высокий
btrfs_scrub_errorsStart btrfs scrub to detect and repair filesystem errorsсредний
system_cert_bundle_corruptRefresh system CA bundle (Debian update-ca-certificates / RHEL update-ca-trust)средний
config_driftManual review of changed critical config fileнизкий
config_missingRestore missing critical config file from backupсредний
kube_containerd_down_localRecover container runtime when kubelet sees no pods (multi-tool)высокий
kube_daemonset_pod_crashRollout-restart kube-system DaemonSetсредний
dstate_processesCapture D-state (uninterruptible IO wait) process evidenceнизкий
filesystem_read_only_remountRemount a kernel-forced read-only filesystem back to read-writeвысокий
firewalld_reload_failureReload firewalld safely; restore zone config on failureсредний
transparent_hugepages_pressureDiagnose transparent/explicit hugepages pressure (log state, no auto-tune)низкий
journal_corruptedRotate corrupt active journalнизкий
nfs_mount_staleRecover stale NFS mount via lazy unmount + remountсредний
pam_auth_failureRefresh sssd cache + clear pam_tally2 lockoutsсредний
sssd_realm_failureRecover PBIS domain integration on Oracle Linuxвысокий
postgresql_health_failureRecover PostgreSQL service healthвысокий
process_priority_misuseRenice high-priority processes abusing CPU schedulingсредний
pvc_stuckDiagnose stuck PersistentVolumeClaim and suggest remediationнизкий
resource_undersizedRecommendation — node hardware envelope is too small for its workloadнизкий
service_crash_loopRecover crash looping serviceвысокий
service_crash_loopRecover crash looping service on ALT Linuxвысокий
service_crash_loopRecover crash looping service on Astra Linuxвысокий
service_downCapture failure evidence then restart failed serviceвысокий
service_downRestart failed service on ALT Linuxвысокий
service_downCapture failure evidence then restart failed service on Astra Linuxвысокий
tcp_syn_floodMitigate SYN flood by enabling tcp_syncookies + bumping backlogвысокий
tmpfs_fullFind largest files on full tmpfs and clean stale temporariesсредний
sssd_realm_failureRecover winbind domain integration on Oracle Linuxвысокий
Пакеты и обновления 10
Класс проблемыЧто делаетРиск
dnf_module_conflictDetect dnf module conflict; recommend module reset (no auto-fix)низкий
package_manager_failureRecover package managerсредний
package_manager_failureRecover package manager (ALT Sisyphus)средний
package_manager_failureRecover package manager on Astra Linuxсредний
package_manager_failureRecover package manager on Oracle Linuxсредний
package_manager_failureRecover package manager (SUSE / zypper)средний
package_state_inconsistentdpkg --configure -a after half-applied apt upgrade (Debian/Ubuntu/Astra)средний
package_state_inconsistentdnf check + rpm verify after partial transaction (Rocky/Alma/Oracle/openSUSE)средний
repository_mirror_failureAdvisory — package repository mirror unreachableнизкий
rhsm_subscription_unknownDetect paid Red Hat subscription failure (alert only)низкий
Active Directory / домен 11
Класс проблемыЧто делаетРиск
ad_clock_skew_kerberosForce chrony makestep to recover Kerberos auth after clock skewсредний
ad_clock_skew_kerberosForce chrony makestep on ALT Linux to recover Kerberos authсредний
ad_clock_skew_kerberosForce chrony makestep on Astra Linux to recover Kerberos authсредний
ad_clock_skew_kerberosRestart systemd-timesyncd to recover Kerberos auth after clock skewсредний
ad_dns_srv_missingRepair /etc/resolv.conf to restore AD DNS SRV record resolutionсредний
ad_keytab_corruptRe-fetch Kerberos keytab for AD-joined hostвысокий
ad_machine_password_expiredRe-join Active Directory domain after machine password expiryвысокий
ad_sssd_cache_corruptPurge SSSD cache and restart to recover AD user resolutionсредний
java_runtime_failureClear Keycloak realm/user/keys cache via kcadm.shнизкий
sssd_realm_failureRecover SSSD realm integrationвысокий
sssd_realm_failureRecover SSSD realm integration on Oracle Linuxвысокий
Время и синхронизация 7
Класс проблемыЧто делаетРиск
chrony_no_sources_syncedRecover chrony with no synced sources (makestep + restart)низкий
service_downRecover NTP time synchronizationсредний
tcp_time_wait_exhaustedEnable tcp_tw_reuse to relieve TIME_WAIT socket exhaustionсредний
time_skewForce NTP resync to correct clock skew (timesyncd)средний
time_skewForce NTP resync on ALT Linuxсредний
time_skewForce NTP resync on Astra Linuxсредний
time_skewForce NTP resync to correct clock skew (chrony)средний
Почта 2
Класс проблемыЧто делаетРиск
mail_health_failureRecover mail stack healthвысокий
mail_queue_stuckFlush postfix mail queue (force retry)средний
Безопасность и доступ 2
Класс проблемыЧто делаетРиск
selinux_deniedDiagnose recent SELinux AVC denials and recommend a fixнизкий
selinux_deniedRestore SELinux contexts for known service paths when drift is detectedнизкий
Astra Linux / ALT Linux 16
Класс проблемыЧто делаетРиск
alt_kernel_modules_failedRebuild initramfs after failed kernel module on ALT Linuxвысокий
alt_rpm_bdb_corruptedRecover ALT rpmdb Berkeley DB corruptionвысокий
alt_rpm_bdb_severe_corruptSevere rpmdb recovery (Packages corrupt) — rpm.org canonical procedureвысокий
alt_tcb_password_expiredRecover ALT TCB service account locked by password expiryсредний
alt_tcb_password_corruptRestore corrupted TCB shadow file from backup on ALT Linuxвысокий
alt_apt_rpm_partial_stateReconcile broken apt-rpm dependency state (ALT)высокий
package_manager_failureRecover stale apt-rpm and rpmdb locks on ALT Linuxсредний
alt_control_facility_misconfigApply ALT control(8) facility stateсредний
alt_etcnet_iface_misconfigRestart an etcnet-managed interface (ALT)средний
alt_fcron_job_failureRestart fcron after repeated job failures (ALT)низкий
alt_initramfs_corruptRebuild initramfs with make-initrd (ALT)высокий
alt_kernel_update_pendingApply pending ALT kernel update via update-kernelвысокий
alt_sisyphus_repo_unreachableSwitch ALT sources.list to a reachable mirrorвысокий
alt_sysconfig_driftRestore /etc/sysconfig file from owning rpm package (ALT)средний
astra_parsec_audit_healthAudit Astra PARSEC log presence and rotationнизкий
astra_mac_label_conflictDiagnose Astra SE MAC/PARSEC label mismatch on critical pathsнизкий
RHEL / SUSE специфика 3
Класс проблемыЧто делаетРиск
oracle_health_failureDiagnose Oracle Database health and restart listener/instanceвысокий
suse_btrfs_snapshot_rollback_requestDiagnose btrfs/snapper state for manual rollbackвысокий
suse_transactional_update_pendingApply pending SUSE transactional-update batchвысокий