Facebook Instagram Twitter Vimeo Youtube
Sign in
  • Home
  • About
  • Team
  • Buy now!
Sign in
Welcome!Log into your account
Forgot your password?
Privacy Policy
Password recovery
Recover your password
Search
Logo
Sign in
Welcome! Log into your account
Forgot your password? Get help
Privacy Policy
Password recovery
Recover your password
A password will be e-mailed to you.
Thursday, October 9, 2025
Sign in / Join
  • Contact Us
  • Our Team
Facebook
Instagram
Twitter
Vimeo
Youtube
Logo
  • Home
  • News
    • News

      Cloudflare Thwarts Record-Breaking 22.2 Tbps DDoS Attack by Paige Henley

      3 October 2025
      News

      Ransomware Attack Hits Major European Airports via Collins Aerospace Software by Husain Parvez

      3 October 2025
      News

      Steam Pulls Game After Malware Steals Over $150,000 in Crypto by Husain Parvez

      3 October 2025
      News

      Mexican Senate Advances Framework for National Cybersecurity Law by Husain Parvez

      1 October 2025
      News

      CBK Launches Sector-Wide Cybersecurity Centre Amid Rising Attacks by Husain Parvez

      27 September 2025
  • Data Modelling & AI
    • AllBig dataBusiness AnalyticsData ScienceData Structure & AlgorithmDatabasesVector DatabaseDeep LearningEthical HackingGenerative AIMachine Learning
      Big data

      From Word2Vec to LLM2Vec: How to Choose the Right Embedding Model for RAG

      8 October 2025
      Big data

      How to Debug Slow Search Requests in Milvus

      4 October 2025
      Big data

      When Context Engineering Is Done Right, Hallucinations Can Be the Spark of AI Creativity

      2 October 2025
      Big data

      Getting Started with langgraph-up-react: A Practical LangGraph Template

      14 September 2025
    • Big data
    • Business Analytics
    • Databases
    • Data Structure & Algorithm
    • Data Science
    • Deep Learning
    • Ethical Hacking
    • Generative AI
    • Machine Learning
    • Security & Testing
  • Mobile
    • AllAndroidIOS
      Android

      Google’s underrated AI shopping tool just made online shoe hunts surprisingly easy

      8 October 2025
      Android

      Google Home chief says he wants to earn back love and trust

      8 October 2025
      Android

      UFS 5.0 is on the way, but most Android users probably won’t notice

      8 October 2025
      Android

      Android’s backup settings are returning to their former glory

      8 October 2025
    • Android
    • IOS
  • Languages
    • AllAjaxAngularDynamic ProgrammingGolangJavaJavascriptPhpPythonReactVue
      Languages

      Working with Titles and Heading – Python docx Module

      25 June 2025
      Languages

      Creating a Receipt Calculator using Python

      25 June 2025
      Languages

      One Liner for Python if-elif-else Statements

      25 June 2025
      Languages

      Add Years to datetime Object in Python

      25 June 2025
    • Java
    • Python
    • Ajax
    • Php
    • Python
    • Golang
    • Dynamic Programming
    • React
    • Vue
    • Java
    • Javascript
    • NodeJS
    • Angular
  • Guest Blogs
  • Discussion
  • Our Team
HomeData Modelling & AIBig dataHow to Handle Failures in Distributed Systems
Big dataGuest Blogs

How to Handle Failures in Distributed Systems

Algomaster
By Algomaster
15 June 2025
0
1
Share
Facebook
Twitter
Pinterest
WhatsApp

    How to Handle Failures in Distributed Systems

    Ashish Pratap Singh's avatar

    Ashish Pratap Singh
    Apr 03, 2025
    ∙ Paid

    In a distributed system, failures aren’t a possibility—they’re a certainty.

    Your database might go down. A service might become unresponsive. A network call might time out. The question is not if these things will happen—but when.

    As engineers, our job is to design systems that embrace this reality and gracefully handle failures.

    In this article, we’ll cover:

    • Types of failures in distributed systems

    • 12 best strategies for handling failures


    Types of Failures in Distributed Systems

    Distributed systems involve multiple independent components communicating over a network.

    And each of these introduces potential failure points:

    1. Network Failures

    The network is the most unreliable component in any distributed architecture.

    • Packets get dropped

    • Connections time out

    • DNS resolution fails

    • Latency spikes suddenly

    • Firewalls misbehave

    Even if two services are running in the same data center, network glitches can still occur.

    2. Node Failures

    A single machine (or container) can go down due to:

    • Power failure

    • OS crash

    • Disk corruption

    • Out-of-memory (OOM)

    • Hardware failure

    In distributed systems, every node is potentially a single point of failure unless redundancy is built in.

    3. Service Failures

    A service may fail even if the machine it’s running on is healthy.

    Common reasons:

    • Code bugs (null pointers, unhandled exceptions)

    • Deadlocks or resource exhaustion

    • Memory leaks causing the service to slow down or crash

    • Misconfigurations (e.g., bad environment variables)

    4. Dependency Failures

    Most services depend on:

    • Databases

    • Caches (like Redis or Memcached)

    • External APIs (payment gateways, 3rd-party auth providers)

    • Message queues (like Kafka, RabbitMQ)

    If any of these are unavailable, misbehaving, or inconsistent, it can cause cascading failures across the system.

    Example: Your checkout service calls the payment API, which calls a bank API, which calls a fraud-detection microservice. Each hop is a potential point of failure.

    5. Data Inconsistencies

    Data replication across systems (like DB sharding, caching layers, or eventual consistency models) can introduce:

    • Out-of-sync states

    • Stale reads

    • Phantom writes

    • Lost updates due to race conditions

    Example: A user updates their address, but due to replication lag, the shipping system fetches the old address and sends the package to the wrong place.

    6. Configuration & Deployment Errors

    Failures aren’t always caused by bugs—they’re often caused by mis-configurations and human errors:

    • Misconfigured load balancers

    • Missing secrets in the environment

    • Incompatible library updates

    • Deleting the wrong database

    • Rolling out a new version without backward compatibility

    According to multiple incident postmortems (e.g., AWS, Google), a large number of production outages are triggered by bad config changes—not code.

    7. Time-Related Issues (Clock Skew, Timeouts)

    Distributed systems often rely on time for:

    • Cache expiration

    • Token validation

    • Event ordering

    • Retry logic

    But system clocks on different machines can drift out of sync (called clock skew), which can wreak havoc.

    Example:

    Machine A: 12:00:01
    Machine B: 11:59:59

    A token generated on Machine B might be considered “expired” when validated by Machine A, even if it was just created.


    12 Best Strategies for Handling Failures

    Let’s look at the 12 best strategies that make your system resilient when parts of it inevitably fail.

    1. Set Timeouts for Remote Calls

    A timeout is the maximum time you’re willing to wait for a response from another service. If a service doesn’t respond in that time window, you abort the operation and handle it as a failure.

    Every network call whether it’s to a REST API, database, message queue, or third-party service should have a timeout.

    Why?

    Waiting too long can hog threads, pile up requests, and cause cascading failures. It’s better to fail fast and try again (smartly).

    Timeout Best Practices

    To be effective, timeouts should be:

    • Short enough to fail fast

    • Long enough for the request to realistically complete

    • Vary depending on the operation (e.g., reads vs writes, internal vs external calls)

    A good practice is to base timeouts on the service’s typical latency (e.g., use the 99th percentile response time or service SLO, plus a safety margin)​.

    Example:

    If your downstream service has a p99 latency of 450ms:

    Recommended Timeout = 450ms + 50ms buffer = 500ms

    This ensures most successful responses arrive before the timeout, while truly slow or hung requests get aborted.

    What to Avoid:

    • Never use infinite or unbounded timeouts

    • Don’t assume the caller will enforce a timeout for you

    Code Example


    2. Retry Intelligently, Not Blindly

    This post is for paid subscribers

    Already a paid subscriber? Sign in
    Share
    Facebook
    Twitter
    Pinterest
    WhatsApp
      Previous article
      How PostgreSQL Works: Internal Architecture Explained
      Next article
      7 Samsung One UI features I wish all Android phones had
      Algomaster
      Algomasterhttps://blog.algomaster.io
      RELATED ARTICLES
      Big data

      From Word2Vec to LLM2Vec: How to Choose the Right Embedding Model for RAG

      8 October 2025
      Guest Blogs

      7 Best Cheap VPNs for Torrenting in 2025: Safe Downloads by Ivan Stevanovic

      6 October 2025
      Guest Blogs

      Interview with Paul Azorin – Co-Founder and Managing Partner Europe at BairesDev by Shauli Zacks

      5 October 2025

      LEAVE A REPLY Cancel reply

      Log in to leave a comment

      Most Popular

      Google’s underrated AI shopping tool just made online shoe hunts surprisingly easy

      8 October 2025

      Google Home chief says he wants to earn back love and trust

      8 October 2025

      UFS 5.0 is on the way, but most Android users probably won’t notice

      8 October 2025

      Android’s backup settings are returning to their former glory

      8 October 2025
      Load more
      Algomaster
      Algomaster
      202 POSTS0 COMMENTS
      https://blog.algomaster.io
      Calisto Chipfumbu
      Calisto Chipfumbu
      6728 POSTS0 COMMENTS
      http://cchipfumbu@gmail.com
      Dominic
      Dominic
      32342 POSTS0 COMMENTS
      http://wardslaus.com
      Milvus
      Milvus
      87 POSTS0 COMMENTS
      https://milvus.io/
      Nango Kala
      Nango Kala
      6712 POSTS0 COMMENTS
      neverop
      neverop
      0 POSTS0 COMMENTS
      https://geeksforgeeks.org
      Nicole Veronica
      Nicole Veronica
      11875 POSTS0 COMMENTS
      Nokonwaba Nkukhwana
      Nokonwaba Nkukhwana
      11937 POSTS0 COMMENTS
      Safety Detectives
      Safety Detectives
      2671 POSTS0 COMMENTS
      https://www.safetydetectives.com/
      Shaida Kate Naidoo
      Shaida Kate Naidoo
      6833 POSTS0 COMMENTS
      Ted Musemwa
      Ted Musemwa
      7092 POSTS0 COMMENTS
      Thapelo Manthata
      Thapelo Manthata
      6786 POSTS0 COMMENTS
      Umr Jansen
      Umr Jansen
      6789 POSTS0 COMMENTS

      EDITOR PICKS

      Google’s underrated AI shopping tool just made online shoe hunts surprisingly easy

      8 October 2025

      Google Home chief says he wants to earn back love and trust

      8 October 2025

      UFS 5.0 is on the way, but most Android users probably won’t notice

      8 October 2025

      POPULAR POSTS

      Google’s underrated AI shopping tool just made online shoe hunts surprisingly easy

      8 October 2025

      Google Home chief says he wants to earn back love and trust

      8 October 2025

      UFS 5.0 is on the way, but most Android users probably won’t notice

      8 October 2025

      POPULAR CATEGORY

      • Languages45985
      • Data Modelling & AI17572
      • Java15156
      • Android14793
      • Mobile12983
      • Guest Blogs12717
      • Javascript12713
      • Data Structure & Algorithm10077
      Logo

      ABOUT US

      We provide you with the latest breaking news and videos straight from the technology industry.

      Contact us: hello@geeksforgeeks.org

      FOLLOW US

      Blogger
      Facebook
      Flickr
      Instagram
      VKontakte

      © NeverOpen 2022

      • Home
      • News
      • Data Modelling & AI
      • Mobile
      • Languages
      • Guest Blogs
      • Discussion
      • Our Team