Facebook Instagram Twitter Vimeo Youtube
Sign in
  • Home
  • About
  • Team
  • Buy now!
Sign in
Welcome!Log into your account
Forgot your password?
Privacy Policy
Password recovery
Recover your password
Search
Logo
Sign in
Welcome! Log into your account
Forgot your password? Get help
Privacy Policy
Password recovery
Recover your password
A password will be e-mailed to you.
Friday, April 3, 2026
Sign in / Join
  • Contact Us
  • Our Team
Facebook
Instagram
Twitter
Vimeo
Youtube
Logo
  • Home
  • News
    • News

      Interview With David Kosmayer – Bookmark by Aviva Zacks

      25 December 2025
      News

      House Democrats Official Online Resume Bank Exposed the PII of Thousands of Government Job Seekers by

      6 December 2025
      News

      House Democrats Official Online Resume Bank Exposed the PII of Thousands of Government Job Seekers by

      29 October 2025
      News

      Cloudflare Thwarts Record-Breaking 22.2 Tbps DDoS Attack by Paige Henley

      3 October 2025
      News

      Ransomware Attack Hits Major European Airports via Collins Aerospace Software by Husain Parvez

      3 October 2025
  • Data Modelling & AI
    • AllBig dataBusiness AnalyticsData ScienceData Structure & AlgorithmDatabasesVector DatabaseDeep LearningEthical HackingGenerative AIMachine Learning
      Big data

      Adding Persistent Memory to Claude Code with the Lightweight memsearch Plugin

      14 February 2026
      Big data

      GLM-5 vs. MiniMax M2.5 vs. Gemini 3 Deep Think: Which Model Fits Your AI Agent Stack?

      14 February 2026
      Big data

      We Extracted OpenClaw’s Memory System and Open-Sourced It (memsearch)

      14 February 2026
      Big data

      OpenClaw (Formerly Clawdbot & Moltbot) Explained: A Complete Guide to the Autonomous AI Agent

      11 February 2026
    • Big data
    • Business Analytics
    • Databases
    • Data Structure & Algorithm
    • Data Science
    • Deep Learning
    • Ethical Hacking
    • Generative AI
    • Machine Learning
    • Security & Testing
  • Mobile
    • AllAndroidIOS
      Android

      Android’s next major update will change how you multitask

      2 April 2026
      Android

      Android’s new sideloading delay won’t be as frustrating as you feared

      2 April 2026
      Android

      Samsung hands amazing new customization options to One UI 8.5 phones

      2 April 2026
      Android

      My default phone recommendation. [Video]

      2 April 2026
    • Android
    • IOS
  • Languages
    • AllAjaxAngularDynamic ProgrammingGolangJavaJavascriptPhpPythonReactVue
      Languages

      Working with Titles and Heading – Python docx Module

      25 June 2025
      Languages

      Creating a Receipt Calculator using Python

      25 June 2025
      Languages

      One Liner for Python if-elif-else Statements

      25 June 2025
      Languages

      Add Years to datetime Object in Python

      25 June 2025
    • Java
    • Python
    • Ajax
    • Php
    • Python
    • Golang
    • Dynamic Programming
    • React
    • Vue
    • Java
    • Javascript
    • NodeJS
    • Angular
  • Guest Blogs
  • Discussion
  • Our Team
HomeData Modelling & AIBig dataHow to Handle Failures in Distributed Systems
Big dataGuest Blogs

How to Handle Failures in Distributed Systems

Algomaster
By Algomaster
15 June 2025
0
1
Share
Facebook
Twitter
Pinterest
WhatsApp

    How to Handle Failures in Distributed Systems

    Ashish Pratap Singh's avatar

    Ashish Pratap Singh
    Apr 03, 2025
    ∙ Paid

    In a distributed system, failures aren’t a possibility—they’re a certainty.

    Your database might go down. A service might become unresponsive. A network call might time out. The question is not if these things will happen—but when.

    As engineers, our job is to design systems that embrace this reality and gracefully handle failures.

    In this article, we’ll cover:

    • Types of failures in distributed systems

    • 12 best strategies for handling failures


    Types of Failures in Distributed Systems

    Distributed systems involve multiple independent components communicating over a network.

    And each of these introduces potential failure points:

    1. Network Failures

    The network is the most unreliable component in any distributed architecture.

    • Packets get dropped

    • Connections time out

    • DNS resolution fails

    • Latency spikes suddenly

    • Firewalls misbehave

    Even if two services are running in the same data center, network glitches can still occur.

    2. Node Failures

    A single machine (or container) can go down due to:

    • Power failure

    • OS crash

    • Disk corruption

    • Out-of-memory (OOM)

    • Hardware failure

    In distributed systems, every node is potentially a single point of failure unless redundancy is built in.

    3. Service Failures

    A service may fail even if the machine it’s running on is healthy.

    Common reasons:

    • Code bugs (null pointers, unhandled exceptions)

    • Deadlocks or resource exhaustion

    • Memory leaks causing the service to slow down or crash

    • Misconfigurations (e.g., bad environment variables)

    4. Dependency Failures

    Most services depend on:

    • Databases

    • Caches (like Redis or Memcached)

    • External APIs (payment gateways, 3rd-party auth providers)

    • Message queues (like Kafka, RabbitMQ)

    If any of these are unavailable, misbehaving, or inconsistent, it can cause cascading failures across the system.

    Example: Your checkout service calls the payment API, which calls a bank API, which calls a fraud-detection microservice. Each hop is a potential point of failure.

    5. Data Inconsistencies

    Data replication across systems (like DB sharding, caching layers, or eventual consistency models) can introduce:

    • Out-of-sync states

    • Stale reads

    • Phantom writes

    • Lost updates due to race conditions

    Example: A user updates their address, but due to replication lag, the shipping system fetches the old address and sends the package to the wrong place.

    6. Configuration & Deployment Errors

    Failures aren’t always caused by bugs—they’re often caused by mis-configurations and human errors:

    • Misconfigured load balancers

    • Missing secrets in the environment

    • Incompatible library updates

    • Deleting the wrong database

    • Rolling out a new version without backward compatibility

    According to multiple incident postmortems (e.g., AWS, Google), a large number of production outages are triggered by bad config changes—not code.

    7. Time-Related Issues (Clock Skew, Timeouts)

    Distributed systems often rely on time for:

    • Cache expiration

    • Token validation

    • Event ordering

    • Retry logic

    But system clocks on different machines can drift out of sync (called clock skew), which can wreak havoc.

    Example:

    Machine A: 12:00:01
    Machine B: 11:59:59

    A token generated on Machine B might be considered “expired” when validated by Machine A, even if it was just created.


    12 Best Strategies for Handling Failures

    Let’s look at the 12 best strategies that make your system resilient when parts of it inevitably fail.

    1. Set Timeouts for Remote Calls

    A timeout is the maximum time you’re willing to wait for a response from another service. If a service doesn’t respond in that time window, you abort the operation and handle it as a failure.

    Every network call whether it’s to a REST API, database, message queue, or third-party service should have a timeout.

    Why?

    Waiting too long can hog threads, pile up requests, and cause cascading failures. It’s better to fail fast and try again (smartly).

    Timeout Best Practices

    To be effective, timeouts should be:

    • Short enough to fail fast

    • Long enough for the request to realistically complete

    • Vary depending on the operation (e.g., reads vs writes, internal vs external calls)

    A good practice is to base timeouts on the service’s typical latency (e.g., use the 99th percentile response time or service SLO, plus a safety margin)​.

    Example:

    If your downstream service has a p99 latency of 450ms:

    Recommended Timeout = 450ms + 50ms buffer = 500ms

    This ensures most successful responses arrive before the timeout, while truly slow or hung requests get aborted.

    What to Avoid:

    • Never use infinite or unbounded timeouts

    • Don’t assume the caller will enforce a timeout for you

    Code Example


    2. Retry Intelligently, Not Blindly

    This post is for paid subscribers

    Already a paid subscriber? Sign in
    Share
    Facebook
    Twitter
    Pinterest
    WhatsApp
      Previous article
      How PostgreSQL Works: Internal Architecture Explained
      Next article
      7 Samsung One UI features I wish all Android phones had
      Algomaster
      Algomasterhttps://blog.algomaster.io
      RELATED ARTICLES
      Guest Blogs

      Cloud Security in the Age of Assumptions: Where Responsibility Really Lies by Petar Vojinovic

      2 April 2026
      Guest Blogs

      The Most Overlooked Cybersecurity Threats and How to Defend Against Them by Petar Vojinovic

      2 April 2026
      Guest Blogs

      What Is a Man-in-the-Middle Attack? Full Guide 2026 by Ana Jovanovic

      17 March 2026

      LEAVE A REPLY Cancel reply

      Log in to leave a comment

      Most Popular

      Android’s next major update will change how you multitask

      2 April 2026

      Android’s new sideloading delay won’t be as frustrating as you feared

      2 April 2026

      Samsung hands amazing new customization options to One UI 8.5 phones

      2 April 2026

      My default phone recommendation. [Video]

      2 April 2026
      Load more
      Algomaster
      Algomaster
      202 POSTS0 COMMENTS
      https://blog.algomaster.io
      Calisto Chipfumbu
      Calisto Chipfumbu
      6875 POSTS0 COMMENTS
      http://cchipfumbu@gmail.com
      Dominic
      Dominic
      32512 POSTS0 COMMENTS
      http://wardslaus.com
      Milvus
      Milvus
      131 POSTS0 COMMENTS
      https://milvus.io/
      Nango Kala
      Nango Kala
      6885 POSTS0 COMMENTS
      neverop
      neverop
      0 POSTS0 COMMENTS
      https://geeksforgeeks.org
      Nicole Veronica
      Nicole Veronica
      12006 POSTS0 COMMENTS
      Nokonwaba Nkukhwana
      Nokonwaba Nkukhwana
      12100 POSTS0 COMMENTS
      Safety Detectives
      Safety Detectives
      2882 POSTS0 COMMENTS
      https://www.safetydetectives.com/
      Shaida Kate Naidoo
      Shaida Kate Naidoo
      7015 POSTS0 COMMENTS
      Ted Musemwa
      Ted Musemwa
      7259 POSTS0 COMMENTS
      Thapelo Manthata
      Thapelo Manthata
      6971 POSTS0 COMMENTS
      Umr Jansen
      Umr Jansen
      6960 POSTS0 COMMENTS

      EDITOR PICKS

      Android’s next major update will change how you multitask

      2 April 2026

      Android’s new sideloading delay won’t be as frustrating as you feared

      2 April 2026

      Samsung hands amazing new customization options to One UI 8.5 phones

      2 April 2026

      POPULAR POSTS

      Android’s next major update will change how you multitask

      2 April 2026

      Android’s new sideloading delay won’t be as frustrating as you feared

      2 April 2026

      Samsung hands amazing new customization options to One UI 8.5 phones

      2 April 2026

      POPULAR CATEGORY

      • Languages45985
      • Data Modelling & AI17616
      • Android16281
      • Java15156
      • Mobile12983
      • Guest Blogs12970
      • Javascript12713
      • Data Structure & Algorithm10077
      Logo

      ABOUT US

      We provide you with the latest breaking news and videos straight from the technology industry.

      Contact us: hello@geeksforgeeks.org

      FOLLOW US

      Blogger
      Facebook
      Flickr
      Instagram
      VKontakte

      © NeverOpen 2022

      • Home
      • News
      • Data Modelling & AI
      • Mobile
      • Languages
      • Guest Blogs
      • Discussion
      • Our Team