“I’ll just choose X, it’s the DB I know and worked with”.
Most of the developers and students use this statement when it comes to choosing a database for a project. Working with a database you’re already familiar with is perfectly fine if performance is not an important requirement for your system, but consider a situation when your application grows and after a couple of years your app starts facing some issues. It will become a headache for developers and administrators to fix the issue. It doesn’t matter you are working on a project from scratch or you’re already working on a mature project, it’s important to know the limitations of the database and identify when to add another type of DB in your project.
There are more than 300 database management systems available in the market and choosing the one can be overwhelming for developers. You have a variety of options available in relational (MySQL, PostgreSQL, Oracle DB, etc) and non-relational (MongoDB, Apache HBase, Cassandra, etc) database but you need to understand none of them fits on all kinds of projects requirement. Each one of them has some strengths and weaknesses. Let’s look at some case study that illustrates how you should choose the right database for your application.
Choosing the Right Database
How do you make this decision when you’re architecting a given system?.
There are so many databases are available and picking up one database over another is a complicated decision. Well, there is no real formula you can follow but there are a few things you should think about. It’s not an easy decision but people who are good at it make the big bucks. Firstly set aside the idea that you are going to find the one true database that is better than everything else. Now before considering a specific database take some time and ask a few important questions related to your project…
- How much data do you expect to store when the application is mature?
- How many users do you expect to handle simultaneously at peak load?
- What availability, scalability, latency, throughput, and data consistency does your application need?
- How often will your database schemas change?
- What is the geographic distribution of your user population?
- What is the natural “shape” of your data?
- Does your application need online transaction processing (OLTP), analytic queries (OLAP), or both?
- What ratio of reads to writes do you expect in production?
- What are your preferred programming languages?
- Do you have a budget? If so, will it cover licenses and support contracts?
- How strict are you with invalid data being sent to your database? (Ideally, you are very strict and do server-side data validation before persisting it to your database)
Now let’s talk about some key aspects that will answer the above questions and will help you to choose the right database for your application…
1. Integration
The most important thing to consider while choosing the right database is what system you need to integrate together? Make sure that your database management system can be integrated with other tools and services within your project. Different technologies have different connectors for different other technologies. For example, if you have a big analytics job that’s currently running an Apache spark then probably you want to limit yourself to external databases that can connect easily to apache spark. Now suppose you have some frontend system that actually depends on having a SQL interface to a backend and you’re thinking about moving from a monolithic database to a non-relational database. It will be only a good choice if the non-relational database you’re moving to offer some sort of SQL like interface that can be easily migrated to from your frontend application. So think about the pieces that need to talk together in your system and see if they can actually talk together or not with existing off-the-shelf components and whether those components are actually well maintained and up-to-date.
Another example is ArangoDB which has excellent performance but libraries for this DBMS are still young and lack support. Using ArangoDB in combination with other tools may be risky, so the community suggests avoiding ArangoDB for complex projects.
2. Scaling Requirement
It’s important to know the scaling requirement before installing your production database. How much data are you really talking about? Is it really going to grow unbounded over time? if so then you need some sort of database technology that is not limited to the data that you can store on one PC. You need to look at something like Cassandra or MongoDB or HBase where you can actually distribute the storage of your data across an entire cluster and scale horizontally instead of vertically. Many databases can’t handle thousands of users querying terabytes or petabytes of data, because of scaling issues.
While choosing a database you also need to think about the transaction rate or throughput which means how many requests you intend to get per second. Databases with high throughput can support many simultaneous users. If we are talking about thousands then again a single database service is not going to work out. This is especially important when you are working on some big websites where we have a lot of web servers that are serving a lot of people at the same time. You will have to choose a database that is distributed and allows you to spread out a load of those transactions more evenly. In those situations, NoSQL databases are a good choice instead of RDBMS.
3. Support Consideration
Think about the supports you might need for your database. Do you have the in-house expertise to spin up this new technology and actually configure it properly? It’s going to be harder than you think especially if you’re using this in the real world or any sort of situation where you have personally identifiable information in the mix from your end-users. In that case, you need to make sure you’re thinking about the security of your system. The truth is most of the NoSQL database we’ve talked about if you configure them with their default settings there will be no security at all. Anybody at all can connect to these things and retrieve data and write data into them. So make sure you have someone available who knows what they’re doing for setting this up in a secure manner. If you are in a big organization that has these experts in-house then it’s great, but if you’re in a smaller organization you may have to choose the technology that offers professional paid support who can guide you through initial setup decisions in the initial administration of your server over time. You can also outsource the administrators for support. A more corporate solution like MongoDB has paid support and if we talk about the Apache projects then there are some companies that offer paid professional support.
4. CAP Consideration
CAP stands for Consistency, Availability, and Partition tolerance. The theorem states that you cannot achieve all the properties at the best level in a single database, as there are natural trade offs between the items. You can only pick two out of three at a time and that totally depends on your prioritize based on your requirements. For example, if your system needs to be available and partition tolerant, then you must be willing to accept some latency in your consistency requirements.
Traditional relational databases are a natural fit for the CA side whereas Non-relational database engines mostly satisfy AP and CP requirements.
- Consistency means that any read request will return the most recent write. Data consistency is usually “strong” for SQL databases and for NoSQL database consistency may be anything from “eventual” to “strong”.
- Availability means that a non-responding node must respond in a reasonable amount of time. Not every application needs to run 24/7 with 99.999% availability but most likely you will prefer a database with higher availability.
- Partition tolerance means the system will continue to operate despite network or node failures.
The type of application will determine what you want there and only you know the actual requirements. Is it actually ok if your system goes down for a few seconds or a few minutes, if not then availability should be your prime concern? If you’re dealing with something with real transactional information like a stock transaction or financial transactions you might value consistency above all. Try to choose the technology that is best suited to the trade-offs that you want to make.
5. Schemas or Data Model
Relational databases store data in a fixed and predefined structure. It means when you start development you will have to define your data schema in terms of tables and columns. You have to change the schema every time the requirements change. This will lead to creating new columns, defining new relations, reflecting the changes in your application, discussing with your database administrators, etc.
NoSQL database provides much more flexibility when it comes to handling data. There is no requirement to specify the schema to start working with the application. Also, the NoSQL database doesn’t put a restriction on the types of data you can store together. It allows you to add more new types as your needs change. In the application building process, most of the developers prefer high coding velocity and great agility. NoSQL databases have proven to be a much better choice in that regard especially for agile development which requires fast implementation.
You really need to take care of all the 5 points mentioned but above all, the most important advice is to keep everything simple. Don’t choose a database just because it is shiny and trendy in the market. If you don’t need to set up a highly complex NoSQL cluster or something that needs a lot of maintenance like MongoDB or HBase where you have all these external servers that maintain the configuration don’t do it if you don’t need to. Think about the minimum requirements that you need for your system. If you don’t need to deal with the massive scale then there is no need to use a NoSQL database, you can choose MySQL and somewhere it’ll be fine. There is no point to deploy a whole new system that does not have good expertise within your organization unless you really need to. Simple technologies and simple architectures are going to be a lot easier to maintain. After all, you’re not going to be happy when you wake up in the morning at 3:00 am because some random server went down on this overly complex database system that you set up for no good reasons. So keep everything simple as much as possible.