Why You Need to Stop Worrying about UPDATE Statements

There seems to be a myth perpetuated out there in the database community that UPDATE statements are somehow “bad” and should be avoided in data warehouses.

Let us have a look at the facts for a moment and weigh up if this myth has any merit.

Transaction Logged Data

In traditional, relational, ACID property, rows stored in pages, relational databases we typically distinguish between two types of DML operations from a transaction logging perspective.

Row Logged
Allocation Logged

Row logged operations will write a transaction log entry every time a row/tuple is modified. This means that the amount of transaction log traffic generated is proportional to the number of rows touched.

Allocation Logged (called: “Minimal logging” in SQL Server) operations only write the physical allocations to the transaction log, if at all. This means the log traffic (if any) is proportional to the size of the data touched. This typically generates at least an order of magnitude fewer log entries than row logged, and is thus faster… Or is it? Read on…

Typically, ACID databases only allow bulk style loads, index builds and large table/partition truncations and drops to be allocation logged. UPDATE and DELETE statements tend to be fully row logged.

Myth: This difference in allocation structures makes UPDATE “bad” because the transaction log is the bottleneck.

Reality: The picture a quite a bit more nuanced than that.

First of all, INSERT in bulk mode is not always allocation logged. Typically a lot of conditions have to be true for allocation logging to work. For those of you interested in SQL Server, I have written about this extensively here: The Data Loading Performance Guide.

Second, the transaction log bottleneck ”wall” is widely exaggerated. I have personally driven 750MB/sec write log traffic into a single database in SQL Server using a FusionIO card. I have seen colleagues do 120MB/sec with traditional, 15K spindles. True: I have also driven 3GB/sec (around 10TB in an hour) using allocation logged INSERT, which is faster than its row logged sibling. However, you have to ask yourself the question: Do you need to go that fast in a single database? In an MPP system you will also have several log files that work together to provide linear scale of log traffic with no relevant roof.

Third, it is perfectly conceivable that you are running a warehouse database that does NOT need to serialize DML operations or may even write entire blocks directly to the database instead of the transaction log. Such systems simply do not have the above bottleneck. For example, Neteeza uses a “logless” implementation and so does many noSQL database systems and HADOOP/HIVE. MySQL with the MyISAM engine also allows non logged operations.

INSERT/DELETE vs. in-place UPDATE

In an traditional, relational database an update can be implemented either as a transactionally wrapped INSERT of the new data followed by a DELETE of the old data (or the other way around).

Myth: Because UPDATE is an INSERT and a DELETE this can double the amount of data that needs to be written during an UPDATE operation and will make me hit the transaction log wall even faster.

Reality: If you are not modifying the keys in an index or making the column size wider, UPDATE statements can be executed as in-place modifications of the row. This allows the database to only write a special old/new value into the transaction log. This optimization can even be applied on a column by column basis, further reducing the transaction log footprint. This leads us to:

UPDATE vs. INSERT speed

This myth is an amalgam of the above arguments.

Myth: INSERT of row logged data is faster than UPDATE of row logged data

Reality: Let us first settle one thing which I will type on its own line and in red to make it easy to remember:

An in-place, row logged, UPDATE operation on a non compressed page is faster than doing a row logged INSERT of the same data.

Why is this? Because the INSERT operation has to allocate new physical structures in the index, while the UPDATE can simply reuse database pages without having to allocate more space.

And here are the numbers to prove it where I am running INSERT vs. UPDATE of large dataset in SQL Server (smaller is faster):

True: if your UPDATE statement has to do the INSERT/DELETE trick, it will likely be slower. But if you are NOT changing the row size and you get the in-place UPDATE, it might just be FASTER to run an UPDATE than an insert.

Also true: compression can change the game quite significantly depending on the compression algorithm you use for the table structure. This is again implementation dependent.

Summary

The myth about UPDATE statements being bad or slow is too simple a way to look at this crucial DML operation. In fact, the myth is outright false in some cases.

Avoiding UPDATE statements should not generally be a major driver for design guidance, or used as the basis for drawing any conclusions about the data modeling techniques you should apply. There picture is quite a bit more nuanced than this.

First of all, the speed of UPDATE statement as compared to bulk inserts will depend on the database engine you run on.

Second, we have seen that even when UPDATE is fully row logged, the transaction log “wall” is very far away on proper hardware and not much of a concern to 99% of all the installations out there. There are of course cases where you will hit the “wall”, but those are largely mitigated in MPP systems or other sharded deployments that have more than one transaction log.

Third, there are cases where UPDATE statements are actually faster than (row logged) INSERT statements. These typically occur when you change columns in such a way that they don’t grow larger than they already are, allowing in-place UPDATE operations. An interesting and highly relevant example of such a case is UPDATE statements that target fact table keys – which (if you follow my guidance) are integers and therefore have constant width.

Bonus Exercise and chance to win (for SQL Server people): Here is an interesting experiment. In theory, it might be possible to create a workload where you UPDATE a very wide table and where the equivalent, minimally logged, super optimized “copy to new table” BULK INSERT or SELECT INTO statement is actually slower than the UPDATE. Where is the crossover point on table width? I will offer a free, 1 hour teaching session. Database subject of your choice, you host me, I bring the coffee in the London City area to the first person who can provide a test script and the data to show this crossover point. Alternatively the price can also be claimed if you can conclusively prove that the crossover point does not exist.

More myth busting on the: DW and Big Data page

The post Why You Need to Stop Worrying about UPDATE Statements appeared first on Fighting Bad Data Modeling.

Why You Need to Stop Worrying about UPDATE Statements

Transaction Logged Data

INSERT/DELETE vs. in-place UPDATE

UPDATE vs. INSERT speed

Summary

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112