Significantly growing apps and websites will eventually need to grow to accommodate the increased traffic. For data-driven applications and websites, it is imperative to scale in a way that ensures the security and integrity of your data. It’s hard to predict how popular a website or application will be, or how long this popularity will last. As a result, some organizations are choosing a database architecture that allows them to grow their databases dynamically
This conceptual article describes such a database architecture, the shard database. Shard databases have been the focus of attention in recent years, but many people don’t know what a shard database is and in what scenarios the shard database makes sense. Here’s what sharding is, some of its main strengths and weaknesses, and some common sharding methods.
What is sharding?
Fragmentation is a database schema pattern associated with horizontal partitioning. That is, it splits the rows of a table into several different tables called partitions. The patterns and columns of each partition are the same, but the rows are completely different. Similarly, the data in each partition is unique and independent of the data in the other partitions.
It may be helpful to consider horizontal partitions in terms of the relationship between horizontal and vertical partitions. In a vertically split table, the entire column is separated and placed in a new separate table. The data in the vertical partition is independent of the data in all other partitions, and each partition holds different rows and columns. The following figure shows how to split the table horizontally and vertically.
Sharding involves splitting your personal data into two or less pieces, called logical sharding. Logical shards are then distributed across different database nodes, called physical shards, to accommodate multiple logical shards. Nevertheless, the data held in all shards represents the entire logical dataset together.
Database sharding embodies the shared-nothing architecture. This means that the shard is autonomous. Shards do not share the same data or computing resources. However, in some cases, it may make sense to copy a particular table as a reference table to each shard. For example, suppose you have an application whose database relies on the conversion rate of fixed weight measurements. Copying a table containing the required conversion rate data to each shard ensures that all the data needed for the query is stored in each shard.
Generally, sharding is implemented at the application level. That is, the application contains code that defines the shards that transfer reads and writes. However, some database management systems have built-in sharding capabilities that allow you to implement sharding directly at the database level.
Let’s take a look at some of the positive and negative factors associated with this database architecture, given a general overview of sharding.
Benefits of sharding
The main attraction of sharded databases is that they can facilitate horizontal expansion, also known as expansion. Horizontal scaling is the process of adding machines to an existing stack to balance the load, increase traffic, and speed up processing. This is in contrast to vertical expansion, which is also commonly referred to as expansion. Vertical expansion typically adds memory or CPU to upgrade the hardware of an existing server.
It’s relatively easy to run a relational database on your machine, upgrade its computing resources, and scale as needed. But after all, non-distributed databases have limited storage and computing capabilities, which gives them more flexibility in configuration because they can scale freely horizontally.
Another reason is that some people choose the shard database architecture to reduce query response time. If you submit a query in the database without sharding, you may need to find all the rows in the table you are querying before you can find the desired result set. For applications that use large monolithic databases, queries can be very slow. However, splitting the table into multiple tables reduces the number of rows processed by the query and speeds up the return of the result set.
Sharding can also increase the reliability of your application by reducing the effects of failures. If your application or website relies on an unsharded database, a failure can render your entire application unusable. However, in a sharded database, a failure can affect only one shard. This can prevent some users from using certain parts of your application or website, but the overall impact remains less than a database-wide crash.
Disadvantages of sharding
Sharded databases can facilitate scaling and improve performance, but they also introduce certain limitations. Here are some of these issues and why they avoid fragmentation altogether
The first problem people encounter when using sharding is the complexity of correctly implementing a sharded database architecture. Improper operation poses a great risk to the fragmentation process and can lead to data loss and table damage. However, even if done correctly, sharding can have a significant impact on your team’s workflow. Users do not access and manage their data from a single entry point, but they do need to manage their data in multiple shard locations. This can cause interference for some teams.
One of the problems that users sometimes encounter after sharding a database is that sharding eventually becomes imbalanced. For example, suppose your database has two independent shards, one with customers starting with A through M and the other with customers starting with N through Z. However, the application serves a large number of people starting with G. A-M sharding has become a so-called database hotspot. In this case, the benefits of a sharded database are offset by slowdowns and crashes. In most cases, the database will need to be repaired and re-sharded to distribute the data more evenly.
Another major drawback is that once a database is fragmented, it becomes difficult to revert to a non-fragmented architecture. The database backup before sharding does not include the data written after sharding. Therefore, to rebuild the original non-sharded architecture, you must either merge the new shard data with the old backup or convert the shard database to a single database. Both are costly and time consuming tasks.
The final drawback to consider is that sharding is not supported by all database engines. For example, PostgreSQL does not include automatic sharding, but you can shard it manually. There are several Postgres branches that include automatic sharding, but these branches often lag behind the latest PostgreSQL version and lack certain other features. Some specialized database technologies, such as certain database-as-a-service products such as MySQL Cluster and MongoDB Atlas, include automatic sharding, but fictitious versions of these database management systems do. not. For this reason, sharding usually requires a “do it yourself” approach. This means that fragmented documentation and troubleshooting tips are often difficult to find.
Of course, these are some common issues that need to be considered before sharding. Depending on how the database is used, sharding has many potential drawbacks.
Now that we’ve introduced some of the disadvantages and advantages of sharding, we’ll show you some different architectures for shard databases.
Once you decide to shard your database, the next thing you need to understand is how to do it. It is important to enter the correct shard when querying or distributing incoming data to a shard table or database. Failure to do so can result in data loss and slow queries. This section introduces some common sharding architectures. Each architecture uses a slightly different process to distribute data across shards.
Key-based sharding, also known as hash-based sharding, inserts using values extracted from newly written data, such as the customer’s ID number, customer’s application IP address, and zip code. Enter in the hash function to determine which shard to put the data in. A hash function is a function that takes data (such as a customer’s email) as input and outputs a discrete value called a hash value. For sharding, the hash value is the shard ID used to determine which shard the received data will be stored in. In general, the process looks like this:
All values entered into the hash function must be from the same column to ensure that the entries are placed in the correct shard and are consistent. This column is called the shard key. Simply put, shard keys are similar to primary keys, both columns used to create unique identifiers for individual rows. Roughly speaking, the shard key should be static. That is, it must not contain values that may change over time. Failure to do so will increase the workload of update operations and can lead to chronic performance degradation.
Key-based sharding is a fairly common sharding architecture, but trying to dynamically add or remove servers in a database can cause problems. When adding servers, each server needs a corresponding hash value, remap many (if not all) existing entries to the new correct hash value, and then migrate to the appropriate server. is needed. When you start rebalancing your data, both the new and old hash functions are disabled. Therefore, during the migration process, the server cannot write new data and the application can be affected by downtime.
The main attraction of this strategy is that it can be used to evenly distribute the data to prevent hotspots. In addition, because the data is distributed in an algorithmic way, it is not necessary to maintain a map where all the data is located, as with other strategies such as range-based or catalog-based sharding.
Range-based slicing involves slicing data according to a range of specific values. For example, suppose you have a database that stores information about all the products in a retailer’s catalog. In this way, you can create several different shards and split that information according to the price range of each product.
The main advantage of range-based sharding is that it is relatively easy to implement. Each shard holds a different dataset, but they all have the same schema and the same original database as each other. The application code simply reads the range to which the data belongs and writes it to the corresponding shard.
Range-based sharding, on the other hand, does not prevent the data from being unevenly distributed, resulting in the database hotspots mentioned above. See the example graph. Even if each shard holds the same amount of data, a particular product may be more noticeable than the others. Each shard gets an imbalanced amount of readings.
To implement directory-based sharding, you need to create and maintain a lookup table and use shard keys to keep track of which shards hold which data. In other words, a lookup table is a table that holds a static set of information about a particular data location. The following figure shows a simple example of directory-based sharding.
Here, the distribution area column is defined as a shard key. The shard key data is written to the lookup table along with the shard to which each row is written. This is similar to range-based sharding, but instead of determining the range to which the shard key data belongs, it associates each key with its own specific shard. We recommend that you choose directory-based sharding over range-based sharding. In this case, the cardinality of the shard key is very low and the range of the shard storage key is meaningless. It also differs from key-based sharding because it does not process sharding keys through hash functions. Just check the key against the lookup table to see where you need to write the data.
The main attraction of directory-based sharding is its flexibility. The range-based sharding architecture limits the range of values you specify, while the key-based sharding architecture limits the use of fixed hash functions. As mentioned earlier, it can be very difficult to change in the future. Directory-based sharding, on the other hand, allows you to use any system or algorithm that assigns data items to shards, and it is relatively easy to add shards dynamically using this method.
Directory-based sharding is the most flexible of the sharding methods described here, but it requires you to connect to a lookup table before each query or write, which negatively impacts application performance. In addition, lookup tables can be a single point of failure. Corruption of the look-up table or other failures affect the ability to write new data or access existing data.
Need to be fragmented?
Whether to implement a shard database architecture is almost always controversial. Some people think that sharding is an inevitable result of the database reaching a certain size, while others think that this is a headache and should be avoided unless absolutely necessary. Sharding adds to the complexity of the operation.
Due to this added complexity, fragmentation is usually only performed when processing very large amounts of data. Below are some common situations where database sharding can be beneficial.
- The amount of data in your application has grown beyond the storage capacity of a single database node.
- The amount of writes or reads to the database exceeds the amount that a single node or its read copy can handle, resulting in slow response times and timeouts.
- The network bandwidth required by the application exceeds the available bandwidth of a single database node and read replica, resulting in slow response times and timeouts.
All other options must be exhausted to optimize the database before sharding. The optimization measures you want to consider include:
- Set up a remote database. If you are working with a single application and all its components are on the same server, you can improve database performance by moving the database to your own machine. The database tables remain unchanged, so this does not add to the complexity of sharding. However, you can still grow your database vertically, apart from the rest of your infrastructure.
- Implement the cache. If your application’s read performance is causing problems, caching is a strategy that can help improve it. The cache includes temporarily storing the requested data in memory so that it can be accessed more quickly in the future.
- Make one or more read copies. Another strategy that can help improve read performance involves copying data from one database server (primary server) to one or more secondary servers. After this, all new writes are sent first to the primary server, then replicated to the secondary server, and reads are performed only on the secondary server. These distributed reads and writes do not overload a single machine and prevent slowdowns and crashes. Keep in mind that making a reading copy requires more computing resources and therefore costs more. This can be an important limitation for some people.
- Upgrade to a larger server. In most cases, upgrading a database server to a machine with more resources requires less effort than sharding. As with creating a read replica, upgrading to a server with more resources can cost more. Therefore, you only need to do it if resizing is really the best choice.
Keep in mind that these strategies alone are not sufficient to improve performance once your application or website has grown to some extent. In this case, sharding may actually be the best choice.
Sharding can be a good solution if you want to grow your database horizontally. However, it adds complexity and increases potential points of failure for your application. Some people need sharding, while others need more time and resources to create and maintain a sharding architecture.
By reading this conceptual article, you should be able to better understand the strengths and weaknesses of sharding. In the future, you can use this insight to make more informed decisions about whether your shard database architecture is suitable for your application.