Comparing PostgreSQL Batch Insertion (Multi Value Insert, Prepared Statement, and Copy)

Updated At Thu, 29 Sep 2022 16:52:02 GMT

Have you ever need to do batch insertion with PostgreSQL? Which one do you need to use? Write the query directly? Or maybe use prepared statement? Or maybe using copy command? In this post, we will try to inspect which method is the fastest. Well of course you want the fastest possible result for your customer right? Bear in mind that this is not a silver bullet solution for every insert scenario but instead only focusing on the batch insertion process. But before we start, let's understand what is batch insertion option is available in PostgreSQL.

Batch Insertion in PostgreSQL

Batch insertion is a process to insert multiple data at once. PostgreSQL provide some option if you want to perform batch insertion in PostgreSQL. Of course not all of them are created equal and some are better than others in certain situation. Some of the method that's supported are listed below :

The standard INSERT query
The multi value INSERT query
The prepared statement INSERT query
The COPY query
The pipelined prepared statement INSERT query

Let's see the difference between all of them and pick the best for our scenario. Bear in mind that we will be using transaction when executing this since otherwise it will be a normal insert and not a batch one. We will demonstrate it by using Golang so we can easily test it.

The INSERT query

This is probably the most basic way to insert value in PostgreSQL where you just do normal INSERT inside a transaction. Below are the code used for testing the insertion process on this query :

package%20main%0A%0Aimport%20(%0A%09%22database/sql%22%0A%09%22fmt%22%0A%09%22time%22%0A%0A%09_%20%22github.com/lib/pq%22%0A)%0A%0Afunc%20main()%20%7B%0A%09dsn%20:=%20%22REDACTED%22%0A%09db,%20err%20:=%20sql.Open(%22postgres%22,%20dsn)%0A%09if%20err%20!=%20nil%20%7B%0A%09%09panic(err)%0A%09%7D%0A%09cases%20:=%20%5B%5Dint%7B%7D%0A%09value%20:=%201%0A%09for%20i%20:=%200;%20i%20%3C%206;%20i++%20%7B%0A%09%09cases%20=%20append(cases,%20value)%0A%09%09value%20*=%2010%0A%09%7D%0A%09for%20_,%20currentCase%20:=%20range%20cases%20%7B%0A%09%09currentTime%20:=%20time.Now()%0A%09%09tx,%20err%20:=%20db.Begin()%0A%09%09if%20err%20!=%20nil%20%7B%0A%09%09%09panic(err)%0A%09%09%7D%0A%09%09for%20i%20:=%200;%20i%20%3C=%20currentCase;%20i++%20%7B%0A%09%09%09if%20_,%20err%20:=%20tx.Exec(%22INSERT%20INTO%20multi_insert_bench%20(value)%20VALUES%20($1)%22,%20i);%20err%20!=%20nil%20%7B%0A%09%09%09%09panic(err)%0A%09%09%09%7D%0A%09%09%7D%0A%09%09tx.Commit()%0A%09%09fmt.Printf(%22Case%20%25d,%20Second%20Elapsed%20:%20%25f%5Cn%22,%20currentCase,%20time.Since(currentTime).Seconds())%0A%09%7D%0A%7D%0A

And below are the result of the code execution :

Count	Second Elapsed
1	0.013899
10	0.003612
100	0.021393
1000	0.183846
10000	1.856627
100000	19.278281
1000000	Cancelled Too Long
10000000	Cancelled Too Long

Looking at the result we can clearly see that the execution time grows linearly with the input, the first result is a bit anomaly tough. Maybe it takes more time to print the result instead of executing it. Also, as we can see on the code we can handle each insertion failure by ourself, maybe if we need to allow one of them to succeed we can just commit it and be done with it. The major drawback on this scenario is that there is a planning step on each call which is kinda wasting time. Let's move to the next step which is multi value insert.

The Multi Value INSERT query

This is just like the simple query, but on this one, instead of executing it one at a time we will insert all of it at once. Below are the code used to run this benchmark for this scenario :

package%20main%0A%0Aimport%20(%0A%09%22database/sql%22%0A%09%22fmt%22%0A%09%22strconv%22%0A%09%22strings%22%0A%09%22time%22%0A%0A%09_%20%22github.com/lib/pq%22%0A)%0A%0Afunc%20main()%20%7B%0A%09dsn%20:=%20%22REDACTED%22%0A%09db,%20err%20:=%20sql.Open(%22postgres%22,%20dsn)%0A%09if%20err%20!=%20nil%20%7B%0A%09%09panic(err)%0A%09%7D%0A%09cases%20:=%20%5B%5Dint%7B%7D%0A%09value%20:=%201%0A%09for%20i%20:=%200;%20i%20%3C%207;%20i++%20%7B%0A%09%09cases%20=%20append(cases,%20value)%0A%09%09value%20*=%2010%0A%09%7D%0A%09for%20_,%20currentCase%20:=%20range%20cases%20%7B%0A%09%09currentTime%20:=%20time.Now()%0A%09%09tx,%20err%20:=%20db.Begin()%0A%09%09if%20err%20!=%20nil%20%7B%0A%09%09%09panic(err)%0A%09%09%7D%0A%09%09builder%20:=%20strings.Builder%7B%7D%0A%09%09builder.WriteString(%22INSERT%20INTO%20multi_insert_bench%20(value)%20VALUES%20%22)%0A%09%09for%20i%20:=%200;%20i%20%3C=%20currentCase;%20i++%20%7B%0A%09%09%09builder.WriteRune('(')%0A%09%09%09builder.WriteString(strconv.Itoa(i))%0A%09%09%09if%20i%20!=%20currentCase%20%7B%0A%09%09%09%09builder.WriteString(%22),%22)%0A%09%09%09%7D%20else%20%7B%0A%09%09%09%09builder.WriteRune(')')%0A%09%09%09%7D%0A%09%09%7D%0A%09%09if%20_,%20err%20:=%20tx.Exec(builder.String());%20err%20!=%20nil%20%7B%0A%09%09%09panic(err)%0A%09%09%7D%0A%09%09tx.Commit()%0A%09%09fmt.Printf(%22Case%20%25d,%20Second%20Elapsed%20:%20%25f%5Cn%22,%20currentCase,%20time.Since(currentTime).Seconds())%0A%09%7D%0A%7D%0A

And below are the result of it :

Count	Second Elapsed
1	0.015370
10	0.001171
100	0.002548
1000	0.014822
10000	0.096488
100000	1.005193
1000000	9.290946
10000000	Cancelled Too Long

Well look at that, by building the query first in go and send it as a single query we just got so much performance gain that i decided to add extra test cases for it. The obvious drawback from this method is that we increase the network load by sending huge payload at one. But those cost are worth it since we just need to make one call to DB. There is also one single problem with this query, is that this query didn't use placeholder parameter which mean it might be a subject to sql injection if you deal with string in it. Or you might have to escape the string manually which again might take some time to compute but usually it's not that expensive compared to the RTT (round trip time) between your server and your DB.

The Prepared Statement INSERT query

Well using prepared statement we can reduce the planning cost of each execution with prepared statement. Basically in this scenario Postgres will prepared some sort of function which you can call repeatedly since it already plan in at front. Let's just see the code used to run this :

package%20main%0A%0Aimport%20(%0A%09%22database/sql%22%0A%09%22fmt%22%0A%09%22time%22%0A%0A%09_%20%22github.com/lib/pq%22%0A)%0A%0Afunc%20main()%20%7B%0A%09dsn%20:=%20%22REDACTED%22%0A%09db,%20err%20:=%20sql.Open(%22postgres%22,%20dsn)%0A%09if%20err%20!=%20nil%20%7B%0A%09%09panic(err)%0A%09%7D%0A%09cases%20:=%20%5B%5Dint%7B%7D%0A%09value%20:=%201%0A%09for%20i%20:=%200;%20i%20%3C%206;%20i++%20%7B%0A%09%09cases%20=%20append(cases,%20value)%0A%09%09value%20*=%2010%0A%09%7D%0A%09for%20_,%20currentCase%20:=%20range%20cases%20%7B%0A%09%09currentTime%20:=%20time.Now()%0A%09%09tx,%20err%20:=%20db.Begin()%0A%09%09if%20err%20!=%20nil%20%7B%0A%09%09%09panic(err)%0A%09%09%7D%0A%09%09stmt,%20err%20:=%20tx.Prepare(%22INSERT%20INTO%20multi_insert_bench%20(value)%20VALUES%20($1)%22)%0A%09%09if%20err%20!=%20nil%20%7B%0A%09%09%09panic(err)%0A%09%09%7D%0A%09%09for%20i%20:=%200;%20i%20%3C=%20currentCase;%20i++%20%7B%0A%09%09%09if%20_,%20err%20:=%20stmt.Exec(i);%20err%20!=%20nil%20%7B%0A%09%09%09%09panic(err)%0A%09%09%09%7D%0A%09%09%7D%0A%09%09tx.Commit()%0A%09%09fmt.Printf(%22Case%20%25d,%20Second%20Elapsed%20:%20%25f%5Cn%22,%20currentCase,%20time.Since(currentTime).Seconds())%0A%09%7D%0A%7D%0A

And below are the result of the query :

Count	Second Elapsed
1	0.014564
10	0.002910
100	0.015437
1000	0.108514
10000	1.013381
100000	9.599144
1000000	Cancelled Too Long
10000000	Cancelled Too Long

Well comparing it with single insert is that this one do the same thing in half the time, this is maybe because we just bypass the some part of the insertion process. We still have to pay the RTT cost of each call since we got control on each individual insert. Well the good news, is that you can now control each insertion process's success and failure based on your needs.

The Copy query

And now we are on the copy query. This one has similarity with the multi value insert query but this one uses special syntax in postgres which is called copy. Anyway here is the code used to test this :

package%20main%0A%0Aimport%20(%0A%09%22database/sql%22%0A%09%22fmt%22%0A%09%22time%22%0A%0A%09%22github.com/lib/pq%22%0A)%0A%0Afunc%20main()%20%7B%0A%09dsn%20:=%20%22REDACTED%22%0A%09db,%20err%20:=%20sql.Open(%22postgres%22,%20dsn)%0A%09if%20err%20!=%20nil%20%7B%0A%09%09panic(err)%0A%09%7D%0A%09cases%20:=%20%5B%5Dint%7B%7D%0A%09value%20:=%201%0A%09for%20i%20:=%200;%20i%20%3C%208;%20i++%20%7B%0A%09%09cases%20=%20append(cases,%20value)%0A%09%09value%20*=%2010%0A%09%7D%0A%09for%20_,%20currentCase%20:=%20range%20cases%20%7B%0A%09%09currentTime%20:=%20time.Now()%0A%09%09tx,%20err%20:=%20db.Begin()%0A%09%09if%20err%20!=%20nil%20%7B%0A%09%09%09panic(err)%0A%09%09%7D%0A%0A%09%09stmt,%20err%20:=%20tx.Prepare(pq.CopyIn(%22multi_insert_bench%22,%20%22value%22))%0A%09%09if%20err%20!=%20nil%20%7B%0A%09%09%09panic(err)%0A%09%09%7D%0A%0A%09%09for%20i%20:=%200;%20i%20%3C=%20currentCase;%20i++%20%7B%0A%09%09%09if%20_,%20err%20:=%20stmt.Exec(i);%20err%20!=%20nil%20%7B%0A%09%09%09%09panic(err)%0A%09%09%09%7D%0A%09%09%7D%0A%09%09if%20_,%20err%20:=%20stmt.Exec();%20err%20!=%20nil%20%7B%0A%09%09%09panic(err)%0A%09%09%7D%0A%09%09tx.Commit()%0A%09%09fmt.Printf(%22Case%20%25d,%20Second%20Elapsed%20:%20%25f%5Cn%22,%20currentCase,%20time.Since(currentTime).Seconds())%0A%09%7D%0A%7D%0A

And here is the result for it :

Count	Second Elapsed
1	0.013255
10	0.002039
100	0.001680
1000	0.009167
10000	0.051179
100000	0.433359
1000000	3.460290
10000000	60.348227

Look at that, it even beat the multi value insert, also this one doesn't require you to perform the query building yourself. But it still has the same problem with the multi value insert one which is you can't control what to do when it fails to perform insertion. If one fails, all of them fails. Also you can't do INSERT ... ON CONFLICT in copy which is another drawback. But for this performance gains, it might be worth it especially if you need all of them executed.

Remarks

Well let's get to the remarks since it's getting long here.

Use COPY when you only need guaranteed execution on all of the data or not at all and have no need for conflict resolution.
Use Multi Value Insert when you only need guaranteed execution on all of the data.
Use Prepared Statement when you need individual control of the insert operation.
Use Single Insert query when you only perform single insert.

I think that's all from me, may the knowledge you gain benefit you in the future ~

Reference

Image by Pexel : https://www.pexels.com/id-id/foto/aneka-warna-nips-718759/