Comparing PostgreSQL Batch Insertion (Multi Value Insert, Prepared Statement, and Copy)
Have you ever need to do batch insertion with PostgreSQL? Which one do you need to use? Write the query directly? Or maybe use prepared statement? Or maybe using copy command? In this post, we will try to inspect which method is the fastest. Well of course you want the fastest possible result for your customer right? Bear in mind that this is not a silver bullet solution for every insert scenario but instead only focusing on the batch insertion process. But before we start, let's understand what is batch insertion option is available in PostgreSQL.
Batch Insertion in PostgreSQL
Batch insertion is a process to insert multiple data at once. PostgreSQL provide some option if you want to perform batch insertion in PostgreSQL. Of course not all of them are created equal and some are better than others in certain situation. Some of the method that's supported are listed below :
- The standard INSERT query
- The multi value INSERT query
- The prepared statement INSERT query
- The COPY query
- The pipelined prepared statement INSERT query
Let's see the difference between all of them and pick the best for our scenario. Bear in mind that we will be using transaction when executing this since otherwise it will be a normal insert and not a batch one. We will demonstrate it by using Golang so we can easily test it.
The INSERT query
This is probably the most basic way to insert value in PostgreSQL where you just do normal INSERT
inside a transaction. Below are the code used for testing the insertion process on this query :
And below are the result of the code execution :
Count | Second Elapsed |
---|---|
1 | 0.013899 |
10 | 0.003612 |
100 | 0.021393 |
1000 | 0.183846 |
10000 | 1.856627 |
100000 | 19.278281 |
1000000 | Cancelled Too Long |
10000000 | Cancelled Too Long |
Looking at the result we can clearly see that the execution time grows linearly with the input, the first result is a bit anomaly tough. Maybe it takes more time to print the result instead of executing it. Also, as we can see on the code we can handle each insertion failure by ourself, maybe if we need to allow one of them to succeed we can just commit it and be done with it. The major drawback on this scenario is that there is a planning step on each call which is kinda wasting time. Let's move to the next step which is multi value insert.
The Multi Value INSERT query
This is just like the simple query, but on this one, instead of executing it one at a time we will insert all of it at once. Below are the code used to run this benchmark for this scenario :
And below are the result of it :
Count | Second Elapsed |
---|---|
1 | 0.015370 |
10 | 0.001171 |
100 | 0.002548 |
1000 | 0.014822 |
10000 | 0.096488 |
100000 | 1.005193 |
1000000 | 9.290946 |
10000000 | Cancelled Too Long |
Well look at that, by building the query first in go and send it as a single query we just got so much performance gain that i decided to add extra test cases for it. The obvious drawback from this method is that we increase the network load by sending huge payload at one. But those cost are worth it since we just need to make one call to DB. There is also one single problem with this query, is that this query didn't use placeholder parameter which mean it might be a subject to sql injection if you deal with string in it. Or you might have to escape the string manually which again might take some time to compute but usually it's not that expensive compared to the RTT (round trip time) between your server and your DB.
The Prepared Statement INSERT query
Well using prepared statement we can reduce the planning cost of each execution with prepared statement. Basically in this scenario Postgres will prepared some sort of function which you can call repeatedly since it already plan in at front. Let's just see the code used to run this :
And below are the result of the query :
Count | Second Elapsed |
---|---|
1 | 0.014564 |
10 | 0.002910 |
100 | 0.015437 |
1000 | 0.108514 |
10000 | 1.013381 |
100000 | 9.599144 |
1000000 | Cancelled Too Long |
10000000 | Cancelled Too Long |
Well comparing it with single insert is that this one do the same thing in half the time, this is maybe because we just bypass the some part of the insertion process. We still have to pay the RTT cost of each call since we got control on each individual insert. Well the good news, is that you can now control each insertion process's success and failure based on your needs.
The Copy query
And now we are on the copy query. This one has similarity with the multi value insert query but this one uses special syntax in postgres which is called copy. Anyway here is the code used to test this :
And here is the result for it :
Count | Second Elapsed |
---|---|
1 | 0.013255 |
10 | 0.002039 |
100 | 0.001680 |
1000 | 0.009167 |
10000 | 0.051179 |
100000 | 0.433359 |
1000000 | 3.460290 |
10000000 | 60.348227 |
Look at that, it even beat the multi value insert, also this one doesn't require you to perform the query building yourself. But it still has the same problem with the multi value insert one which is you can't control what to do when it fails to perform insertion. If one fails, all of them fails. Also you can't do INSERT ... ON CONFLICT
in copy which is another drawback. But for this performance gains, it might be worth it especially if you need all of them executed.
Remarks
Well let's get to the remarks since it's getting long here.
- Use COPY when you only need guaranteed execution on all of the data or not at all and have no need for conflict resolution.
- Use Multi Value Insert when you only need guaranteed execution on all of the data.
- Use Prepared Statement when you need individual control of the insert operation.
- Use Single Insert query when you only perform single insert.
I think that's all from me, may the knowledge you gain benefit you in the future ~
Reference
- Image by Pexel : https://www.pexels.com/id-id/foto/aneka-warna-nips-718759/