Tech

How table partitioning helped us to get data faster

2023-02-09 • 7 min read

Table partitioning is a mechanism that can help you speed up an app

While working on one of our projects, we faced a problem with the speed of getting data needed for generating statistics displayed over a time interval in the form of graphs for the Model entity.

Below you can see how much time it took us to get data. The selection time could take up to 5-6 seconds for large requests, which significantly slowed down the application and was completely unacceptable.

Example of a slow application and how to fix it with table partitioning — Further, we’ll show you SQL table partitioning best practices

To store data, we use MariaDB. When we came across the issue, there were about 4 billion records in themodel_statistic table. Of course, we entered all the corresponding indexes in the table and made sure they were used at the appropriate time. It was also important that the table is only used in a particular part of the app and only for the Model entity. Due to financial reasons, our client didn’t want to store the data in a separate database made specifically for statistics and asked us to solve the problem in existing conditions.

To solve the problem, we decided to use such a mechanism as partitioning. It allowed us to split tables with multiple records into logical parts sorted by the administrator’s criteria. Partitioning divides the entire scope of data processing operations into several independent and parallel threads, which significantly speeds up the work of DBMS (stands for database management system).

Problems you may face when implementing partitioning

The selection can slow down if the partitions overlap.
Bulk data insertion can slow down.
Input data may get lost.

To complete the query acceleration procedure successfully, you need to choose the correct keys for partitioning.

Why is partitioning suitable for our particular task?

Our table is only used in one section of the application
The request always has a column, to which we apply partitioning.

If at least one of these points doesn’t match with your case, we wouldn’t recommend using this mechanism, as most likely it can only slow down the speed of data acquisition or insertion.

Data splitting methods provided by MySQL

RANGE

By range of values

  PARTITION BY RANGE (store_id) (
    PARTITION p0 VALUES LESS THAN (10),
    PARTITION p1 VALUES LESS THAN (20),
    PARTITION p3 VALUES LESS THAN (30)
  );

LIST
By the exact list of values
```
  PARTITION BY LIST(store_id) (
    PARTITION pNorth VALUES IN (3 ,5, 6, 9, 17),
    PARTITION pEast VALUES IN (1, 2, 10, 11, 19, 20)
  );
```
Why use such a method? Most often, it’s necessary to partition for optimization of selection, and less often for optimization of insertion. Accordingly, the ideal option is to split the table into the maximum possible number of partitions so that 90% of all selections occur within one partition. And if you have a complex logic of selection (e. g. objects located in the northern quarters of the city that have inconsistent IDs) it may make sense to manually enter data validation list items.
HASH
```
  PARTITION BY HASH(store_id)
  PARTITIONS 4;
```
In this case, you don’t control SQL data partitioning, you just specify which field to build the hash on and how many “subtables” to create. What for? Selecting by the specified field is much faster. In some cases, it allows you to achieve an “even dispersion” and speed up data recording.
KEY
```
  PARTITION BY KEY(s1)
  PARTITIONS 10;
```
Very similar to HASH but is more logical. Selecting by the specified key field is as efficient as possible.
First of all, it’s necessary to decide on the method of partitioning on SQL server. KEY is well suited for the counter of visitors, when the login is the only identity needed to select all other data.

Which method we chose for our task

In our case, we use one unique column for the request — ID of the entity that requires the data, model_id. That’s why we decided to choose the method of partitioning by hash.

When executing a request, the selection starts with the model_id column. SQL searches for the required partition which contains model_id, and then gets the rest of the data — it significantly reduces the execution time, since data acquisition is conducted within a single partition.

Moreover, we constantly get new records that are also tied to the specific entity ID (model_id), so the data only comes into one partition.

And finally, we don’t need to think which parts of the table should contain certain entities as SQL can figure out how to divide the table the best way. Further it will evenly spread the new records within the partitions independently.

If you want to get more of our experience in web and mobile development, check out our Stories and articles.

Implementation

First of all, we create a new table that completely repeats the structure of the old one and divide it into partitions using the following command:

  CREATE TABLE 'model_statistic' (
    'id' bigint(20) NOT NULL AUTO_INCREMENT,
    'model_id' int(10) unsigned NOT NULL,
    'cam_site_id' int(10) unsigned NOT NULL,
    'minutes_free_chat' int(10) unsigned DEFAULT NULL,
    'minutes_private_chat' int(10) unsigned DEFAULT NULL,
    'minutes_group_chat' int(10) unsigned DEFAULT NULL,
    'minutes_nude_chat' int(10) unsigned DEFAULT NULL,
    'minutes_member_chat' int(10) unsigned DEFAULT NULL,
    'minutes_gold_chat' int(10) unsigned DEFAULT NULL,
    'minutes_p_gold_chat' int(10) unsigned DEFAULT NULL,
    'minutes_away' int(10) unsigned DEFAULT NULL,
    'minutes_true_private' int(10) unsigned DEFAULT NULL,
    'minutes_happy' int(10) unsigned DEFAULT NULL,
    'minutes_paid' int(10) unsigned DEFAULT NULL,
    'created_at' timestamp NULL DEFAULT NULL,
    'updated_at' timestamp NULL DEFAULT NULL,
    'creation_date' date DEFAULT NULL,
    PRIMARY KEY ('id','model_id'),
    KEY 'model_statistic_with_partition_created_at_index' ('created_at'),
    KEY 'model_statistic_with_partition_model_id_index' ('model_id'),
    KEY 'model_stat_w_part_model_id_creation_date_cam_site_id_index' ('model_id','creation_date')
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

  ALTER TABLE model_statistic_with_partition PARTITION BY HASH(model_id) PARTITIONS 100

The number of partitions is selected based on performance tests.

Data is constantly coming into our table, and we get from 2 to 1,000 new records every 4–6 minutes. We cannot stop data acquisition since it comes from our client’s third-party service.
To avoid losing the incoming data, we write a small script that will copy data from the old table to the new one with partitioning. At the end of the script, we need to change the tables’ names.
We change the name of the old model_statistic table to model_statistic_backup — in case of unforeseen problems, we’ll be able to do a data backup and recover everything by simply changing the name back. As for the new model_statistic_with_partition table, we change the name to model_statistic.
Renaming tables in SQL happens very quickly, and we don’t lose a single record from the incoming data stream and preserve all keys and dependencies which is also important.

Results

We managed to divide the table into partitions.
Accelerated the data acquisition time by about 5 times.
Didn’t lose any data.

Wrapping up

Partitioning is a very useful mechanism for storing data. However, it works well only if you have a reasonable need for it, otherwise partitioning can only slow down the selection and insertion of data.

If you have any issues with the speed of getting data or you want to find out why your app is so slow, you can check the Services Ronas IT page to learn more. Our team of experts can help you design and develop applications, as well as implement any technologies your business needs.