Open-source software (OSS) relational database management systems (RDBMSes) are becoming incredibly popular in the cloud computing world. In this article, I’ll discuss one of the most popular OSS relational databases, PostgreSQL, and the options you have for this database in the Microsoft Azure cloud. It’s worth noting these products are large with many detailed aspects, so consider this article a cursory, high-level overview of each offering.
What is Azure Database for PostgreSQL?
Azure Database for PostgreSQL is the Microsoft platform as a service (PaaS) implementation of the PostgreSQL relational database, hosted on the Azure cloud platform. The PostgreSQL product running on the Microsoft PaaS platform gives OSS developers almost complete feature parity (with fewer available extensions) with the on-premises version of PostgreSQL while automating a large proportion of the daily tasks associated with administering a relational database.
One of the biggest advantages of a PaaS offering is the fact most of the hard work of administration is done for you by the cloud provider. For example, PaaS services for RDBMSes typically take care of the backups and provide some form of high availability, security, maintenance, and logging. Because PaaS takes care of most of the day-to-day administration needs, DBAs and developers have more time to focus on providing business value instead of mundane tasks.
Another advantage of a PaaS service is you can scale up and scale back resources as necessary. This can be incredibly useful if your application has cyclical usage patterns, such as needing more processing resources to complete work during the holiday shopping season.
In Azure, you also have the option to install the PostgreSQL engine on an Azure VM and handle the administration and daily needs as though you were managing the server in your own data center. The benefit of this Infrastructure as a service (IaaS) approach is you still have the highest degree of control over how the machine and database service is administered. You’re responsible for not only the administration of the PostgreSQL database engine but also any required patching for the underlying OS (which will most likely be Linux since we’re discussing PostgreSQL).
High availability for the VM can be provided by Azure, and you also have access to Azure’s underlying storage system. This provides a high degree of availability, as much throughput as your system requires, and security and redundancy.
Azure Database for PostgreSQL–Single Server
Azure Database for PostgreSQL Single Server (which I’ll refer to as Single Server) is Azure’s first PostgreSQL offering on the PaaS platform. With Single Server, most of the regular administration of the database and underlying OS is taken care of for you. For example, backups are automatic, with the ability to restore the database to a point in time if needed.
The underlying OS (Windows for Single Server) is administered for you, and as an administrator, you have no access to it.
There are three pricing tiers to choose when configuring your Single Server:
- Basic – Basic is great for getting a feel for using the service and developing applications. It’s unlikely you’ll use this tier for a production-level application unless the application generates little activity.
- General Purpose – Better suited for production workloads.
- Memory Optimized – The Memory Optimized tier is on a faster set of VMs with more virtual cores available and more memory per virtual core. Also better asuited for production workloads.
The main difference between the General Purpose and Memory Optimized tiers are the resources available to each SKU.
I believe Single Server will eventually be discontinued in lieu of Azure Database for PostgreSQL Flexible Server (discussed below). The reason for this is because of the present limitations of the Single Server product. One big issue is connections coming to Single Server are established through a gateway responsible for the routing to the physical location of the database service. Because this gateway service is additional work in establishing the connection to the database service, this can slow down some connections.
However, the most significant limitation I see for the Single Server service is the fact you’re not able to co-locate your Azure application tier in the same physical location (called an Availability Zone—think ‘data center’) as the database tier. Ideally, your application and your database would be in the same data center to ensure the most negligible possible latency for communication between the two. Unfortunately, you don’t have the option to specify this for Single Server, so performance problems can occur.
For migrating to Azure PostgreSQL from on-premises PostgreSQL or PostgreSQL running on a VM in a cloud provider, your options are to dump the PostgreSQL database and restore the database to Single Server or to use the database migration service.
Azure Database for PostgreSQL–Flexible Server
Azure Database for PostgreSQL Flexible Server provides more granular control of the PostgreSQL database engine configuration and has some useful features not present in Single Server. One main benefit of choosing Flexible Server is you can co-locate your VMs running your applications with the PostgreSQL database engine. This ability is probably the single most attractive feature of using Flexible Server as it avoids the performance problem I described above.
Another great feature of Flexible Server, pgBouncer, is built into the service. pgBouncer is a connection pooling service specific to PostgreSQL, providing a lightweight mechanism for the PostgreSQL engine to maintain a pool of connections for reuse. PostgreSQL connections are notoriously expensive to create in terms of resources, and using pgBouncer (or some other connection pooler) is critical for any application needing to establish and maintain many connections to the database.
Flexible Server also has a configurable Maintenance Window to apply regular patching and periodic maintenance for the PostgreSQL database, which you can configure to suit your application schedule. This lets the administrator choose a specific date and time window to apply these updates, generally during a time of low traffic to the database, ensuring a minimal amount of disruption to your applications when the maintenance occurs.
One of the main drawbacks I see with Flexible Server is the storage size is still restricted to 16TB. While most applications aren’t going to approach this large size limit, there are some applications out there, especially those looking to migrate from Oracle to PostgreSQL, where the 16TB limit is a showstopper. This is where the final tier of PostgreSQL comes into play–Hyperscale.
Azure Database for PostgreSQL–Hyperscale
In early 2019 Microsoft acquired a company named Citus Data, which created an extension for the PostgreSQL database engine to horizontally scale database tables–an activity sometimes referred to as sharding. This feature, since renamed Azure Database for PostgreSQL Hyperscale, enables database developers to design tables so portions of the data are distributed among different worker nodes (i.e., servers) in the topology, allowing for more processing capability than would be possible on a single server.
For each table being distributed, a single distribution column must be selected and used to determine how to deterministically map values from the table to different nodes. Because this distribution column must be specified per table, migration from on-premises or a cloud version of PostgreSQL or from Oracle to Azure Database for PostgreSQL Hyperscale is no trivial task.
There are two types of nodes involved in the Hyperscale configuration (also known as a server group):
- Coordinator node: This is the entry point into the Hyperscale cluster. It accepts application connections and relays SQL queries sent from applications to the different worker nodes, which returns the results to the end user.
- Worker nodes: These are the nodes in the server group storing the distributed data. The query engine knows how the data is distributed among the different worker nodes and pulls the necessary information from each node to the coordinator node to satisfy the queries being issued. You can scale resources on existing worker nodes as demand increases or add worker nodes as necessary. Through the Azure portal, you can scale out your workload up to 20 nodes, with the possibility of expanding to more nodes if you work with Microsoft directly to do so.
High availability for Hyperscale happens through replication to standby replicas. When enabling it, each node (coordinator node and each worker node) receives a standby replica where data is replicated to. As you’re doubling the number of servers involved in the topology, be aware, the cost will also double.
There are also three different table taxonomies in Azure PostgreSQL Hyperscale, and it’s essential to understand when to use which table type:
- Distributed tables: These are horizontally partitioned among the different worker nodes, with each node having a subset of the data in the table. To distribute a table, a distribution column must be chosen when defining the table. Since values from the distribution column map to different worker nodes, much care must be taken when defining the distribution column. Poorly chosen columns can result in poor colocation of joins, which means a fair amount of data movement will need to occur when joins between tables occur. This will be one of the biggest hurdles to using Azure Database for PostgreSQL Hyperscale because most applications will have related tables.
- Reference tables: These are generally smaller lookup system tables frequently joined to larger tables and are copied to each of the worker nodes. While the data is duplicated, the advantage is any join is a local join requiring no data from any other worker node.
- Local tables: These are tables existing solely on the coordinator node and not on any worker node. Examples are small administrative tables never joined to other tables.
Choosing the Right Azure PaaS for PostgreSQL
When migrating to Azure Database for PostgreSQL, it’s very important to choose the right PaaS deployment option. The Single Server offering is aimed at smaller applications where a high degree of throughput isn’t required. Flexible Server is the newer, more advanced offering with better overall throughput than Single Server, along with more flexibility for management such as patching Windows and has built-in features such as pgBouncer. Hyperscale is a horizontal scale-out offering in which data is distributed among worker nodes, with a coordinator node responsible for issuing queries against the worker nodes and coordinating the results to bring the data back to the end user.
Whatever the PaaS deployment option, SolarWinds can provide a PostgreSQL performance monitoring solution to help with query tuning, I/O tuning, and more.
Paul S. Randal is the CEO of SQLskills.com, which he runs with his wife Kimberly L. Tripp. Both Paul and Kimberly are widely-known and respected experts in the SQL Server world, and both are long-time SQL Server MVPs. Paul was a Contributing Editor for TechNet Magazine, where he wrote the bi-monthly SQL Q&A column and feature articles. He also had #1 top-rated workshops and sessions at the PASS Summit and TechEd. Paul is active in the SQL Server community, from user groups to online forums to helping out on Twitter (@PaulRandal – check out the #sqlhelp tag). His popular and widely-referenced blog can be found at https://www.sqlskills.com/blogs/paul/ and he can be reached at paul@sqlskills.com.