Facebook Instagram Twitter Vimeo Youtube
Sign in
  • Home
  • About
  • Team
  • Buy now!
Sign in
Welcome!Log into your account
Forgot your password?
Privacy Policy
Password recovery
Recover your password
Search
Logo
Sign in
Welcome! Log into your account
Forgot your password? Get help
Privacy Policy
Password recovery
Recover your password
A password will be e-mailed to you.
Thursday, September 4, 2025
Sign in / Join
  • Contact Us
  • Our Team
Facebook
Instagram
Twitter
Vimeo
Youtube
Logo
  • Home
  • News
    • News

      Anthropic Confirms Claude AI Was Weaponized in Major Cyberattacks by Husain Parvez

      3 September 2025
      News

      Over 30,000 Malicious IPs Target Microsoft Remote Desktop in Global Surge by Husain Parvez

      31 August 2025
      News

      Cyber Threat-Sharing Law Nears Expiration: Experts Warn of Risks by Husain Parvez

      31 August 2025
      News

      North Korean Hacking Tools Leak Online, Including Advanced Linux Rootkit by Paige Henley

      28 August 2025
      News

      iiNet Cyberattack Exposes Data of 280,000 Customers by Husain Parvez

      28 August 2025
  • Data Modelling & AI
    • AllBig dataBusiness AnalyticsData ScienceData Structure & AlgorithmDatabasesVector DatabaseDeep LearningEthical HackingGenerative AIMachine Learning
      Big data

      LangExtract + Milvus: A Practical Guide to Building a Hybrid Document Processing and Search System

      30 August 2025
      Big data

      Stop Your AI Assistant from Writing Outdated Code with Milvus SDK Code Helper

      26 August 2025
      Big data

      A Practical Guide for Choosing the Right Vector Database for Your AI Applications

      26 August 2025
      Big data

      Why I’m Against Claude Code’s Grep-Only Retrieval? It Just Burns Too Many Tokens

      26 August 2025
    • Big data
    • Business Analytics
    • Databases
    • Data Structure & Algorithm
    • Data Science
    • Deep Learning
    • Ethical Hacking
    • Generative AI
    • Machine Learning
    • Security & Testing
  • Mobile
    • AllAndroidIOS
      Android

      It’s your last chance to score a $50 Samsung credit before tomorrow’s big product announcement

      4 September 2025
      Android

      The Samsung Health app now puts a licensed doctor right in your pocket

      3 September 2025
      Android

      Google’s NotebookLM is giving Audio Overviews new personalities

      3 September 2025
      Android

      MediaTek’s next flagship chip may give future Android phones faster cores and a beefed-up NPU

      3 September 2025
    • Android
    • IOS
  • Languages
    • AllAjaxAngularDynamic ProgrammingGolangJavaJavascriptPhpPythonReactVue
      Languages

      Working with Titles and Heading – Python docx Module

      25 June 2025
      Languages

      Creating a Receipt Calculator using Python

      25 June 2025
      Languages

      One Liner for Python if-elif-else Statements

      25 June 2025
      Languages

      Add Years to datetime Object in Python

      25 June 2025
    • Java
    • Python
  • Guest Blogs
  • Discussion
  • Our Team
HomeData Modelling & AIBig dataHow to Handle Failures in Distributed Systems
Big dataGuest Blogs

How to Handle Failures in Distributed Systems

Algomaster
By Algomaster
15 June 2025
0
1
Share
Facebook
Twitter
Pinterest
WhatsApp

    How to Handle Failures in Distributed Systems

    Ashish Pratap Singh's avatar

    Ashish Pratap Singh
    Apr 03, 2025
    ∙ Paid

    In a distributed system, failures aren’t a possibility—they’re a certainty.

    Your database might go down. A service might become unresponsive. A network call might time out. The question is not if these things will happen—but when.

    As engineers, our job is to design systems that embrace this reality and gracefully handle failures.

    In this article, we’ll cover:

    • Types of failures in distributed systems

    • 12 best strategies for handling failures


    Types of Failures in Distributed Systems

    Distributed systems involve multiple independent components communicating over a network.

    And each of these introduces potential failure points:

    1. Network Failures

    The network is the most unreliable component in any distributed architecture.

    • Packets get dropped

    • Connections time out

    • DNS resolution fails

    • Latency spikes suddenly

    • Firewalls misbehave

    Even if two services are running in the same data center, network glitches can still occur.

    2. Node Failures

    A single machine (or container) can go down due to:

    • Power failure

    • OS crash

    • Disk corruption

    • Out-of-memory (OOM)

    • Hardware failure

    In distributed systems, every node is potentially a single point of failure unless redundancy is built in.

    3. Service Failures

    A service may fail even if the machine it’s running on is healthy.

    Common reasons:

    • Code bugs (null pointers, unhandled exceptions)

    • Deadlocks or resource exhaustion

    • Memory leaks causing the service to slow down or crash

    • Misconfigurations (e.g., bad environment variables)

    4. Dependency Failures

    Most services depend on:

    • Databases

    • Caches (like Redis or Memcached)

    • External APIs (payment gateways, 3rd-party auth providers)

    • Message queues (like Kafka, RabbitMQ)

    If any of these are unavailable, misbehaving, or inconsistent, it can cause cascading failures across the system.

    Example: Your checkout service calls the payment API, which calls a bank API, which calls a fraud-detection microservice. Each hop is a potential point of failure.

    5. Data Inconsistencies

    Data replication across systems (like DB sharding, caching layers, or eventual consistency models) can introduce:

    • Out-of-sync states

    • Stale reads

    • Phantom writes

    • Lost updates due to race conditions

    Example: A user updates their address, but due to replication lag, the shipping system fetches the old address and sends the package to the wrong place.

    6. Configuration & Deployment Errors

    Failures aren’t always caused by bugs—they’re often caused by mis-configurations and human errors:

    • Misconfigured load balancers

    • Missing secrets in the environment

    • Incompatible library updates

    • Deleting the wrong database

    • Rolling out a new version without backward compatibility

    According to multiple incident postmortems (e.g., AWS, Google), a large number of production outages are triggered by bad config changes—not code.

    7. Time-Related Issues (Clock Skew, Timeouts)

    Distributed systems often rely on time for:

    • Cache expiration

    • Token validation

    • Event ordering

    • Retry logic

    But system clocks on different machines can drift out of sync (called clock skew), which can wreak havoc.

    Example:

    Machine A: 12:00:01
    Machine B: 11:59:59

    A token generated on Machine B might be considered “expired” when validated by Machine A, even if it was just created.


    12 Best Strategies for Handling Failures

    Let’s look at the 12 best strategies that make your system resilient when parts of it inevitably fail.

    1. Set Timeouts for Remote Calls

    A timeout is the maximum time you’re willing to wait for a response from another service. If a service doesn’t respond in that time window, you abort the operation and handle it as a failure.

    Every network call whether it’s to a REST API, database, message queue, or third-party service should have a timeout.

    Why?

    Waiting too long can hog threads, pile up requests, and cause cascading failures. It’s better to fail fast and try again (smartly).

    Timeout Best Practices

    To be effective, timeouts should be:

    • Short enough to fail fast

    • Long enough for the request to realistically complete

    • Vary depending on the operation (e.g., reads vs writes, internal vs external calls)

    A good practice is to base timeouts on the service’s typical latency (e.g., use the 99th percentile response time or service SLO, plus a safety margin)​.

    Example:

    If your downstream service has a p99 latency of 450ms:

    Recommended Timeout = 450ms + 50ms buffer = 500ms

    This ensures most successful responses arrive before the timeout, while truly slow or hung requests get aborted.

    What to Avoid:

    • Never use infinite or unbounded timeouts

    • Don’t assume the caller will enforce a timeout for you

    Code Example


    2. Retry Intelligently, Not Blindly

    This post is for paid subscribers

    Already a paid subscriber? Sign in
    Share
    Facebook
    Twitter
    Pinterest
    WhatsApp
      Previous article
      How PostgreSQL Works: Internal Architecture Explained
      Next article
      7 Samsung One UI features I wish all Android phones had
      Algomaster
      Algomasterhttps://blog.algomaster.io
      RELATED ARTICLES
      Guest Blogs

      7 Best 123Movies Alternatives in 2025: Free & Safe Sites by Ivan Stevanovic

      3 September 2025
      Guest Blogs

      Interview with Tyson Garrett – CTO of TrustOnCloud – Making Cloud Threat Modeling Executable by Shauli Zacks

      2 September 2025
      Big data

      LangExtract + Milvus: A Practical Guide to Building a Hybrid Document Processing and Search System

      30 August 2025

      LEAVE A REPLY Cancel reply

      Log in to leave a comment

      Most Popular

      It’s your last chance to score a $50 Samsung credit before tomorrow’s big product announcement

      4 September 2025

      The Samsung Health app now puts a licensed doctor right in your pocket

      3 September 2025

      Google’s NotebookLM is giving Audio Overviews new personalities

      3 September 2025

      MediaTek’s next flagship chip may give future Android phones faster cores and a beefed-up NPU

      3 September 2025
      Load more
      Algomaster
      Algomaster
      202 POSTS0 COMMENTS
      https://blog.algomaster.io
      Calisto Chipfumbu
      Calisto Chipfumbu
      6637 POSTS0 COMMENTS
      http://cchipfumbu@gmail.com
      Dominic
      Dominic
      32260 POSTS0 COMMENTS
      http://wardslaus.com
      Milvus
      Milvus
      81 POSTS0 COMMENTS
      https://milvus.io/
      Nango Kala
      Nango Kala
      6625 POSTS0 COMMENTS
      neverop
      neverop
      0 POSTS0 COMMENTS
      https://geeksforgeeks.org
      Nicole Veronica
      Nicole Veronica
      11795 POSTS0 COMMENTS
      Nokonwaba Nkukhwana
      Nokonwaba Nkukhwana
      11855 POSTS0 COMMENTS
      Safety Detectives
      Safety Detectives
      2594 POSTS0 COMMENTS
      https://www.safetydetectives.com/
      Shaida Kate Naidoo
      Shaida Kate Naidoo
      6747 POSTS0 COMMENTS
      Ted Musemwa
      Ted Musemwa
      7023 POSTS0 COMMENTS
      Thapelo Manthata
      Thapelo Manthata
      6694 POSTS0 COMMENTS
      Umr Jansen
      Umr Jansen
      6714 POSTS0 COMMENTS

      EDITOR PICKS

      It’s your last chance to score a $50 Samsung credit before tomorrow’s big product announcement

      4 September 2025

      The Samsung Health app now puts a licensed doctor right in your pocket

      3 September 2025

      Google’s NotebookLM is giving Audio Overviews new personalities

      3 September 2025

      POPULAR POSTS

      It’s your last chance to score a $50 Samsung credit before tomorrow’s big product announcement

      4 September 2025

      The Samsung Health app now puts a licensed doctor right in your pocket

      3 September 2025

      Google’s NotebookLM is giving Audio Overviews new personalities

      3 September 2025

      POPULAR CATEGORY

      • Languages45985
      • Data Modelling & AI17566
      • Java15156
      • Android14049
      • Mobile12983
      • Javascript12713
      • Guest Blogs12669
      • Data Structure & Algorithm10077
      Logo

      ABOUT US

      We provide you with the latest breaking news and videos straight from the technology industry.

      Contact us: hello@geeksforgeeks.org

      FOLLOW US

      Blogger
      Facebook
      Flickr
      Instagram
      VKontakte

      © NeverOpen 2022

      • Home
      • News
      • Data Modelling & AI
      • Mobile
      • Languages
      • Guest Blogs
      • Discussion
      • Our Team