I was recently working on development of a fact table with 24 partitions and needed to hold up to 5 billion rows of data.
The data received in the warehouse is through a file being loaded into a staging heap table. One heap table is created per file, and there are several thousand files per day.
In the process of testing the load, I came across what I think was an interesting find.
For this part we are going to adopt loading one slice into the mail fact table at a time. This way we have a very controllable method of consuming server resources along with making the process as parallel as we desire. For the initial cut, I will build an SSIS package that has 24 execute sql statements, where each one will firstly select into a variable the insert statement (this makes testing much simpler) and then each of these will flow to another 24 execute sql tasks which will run the insert.
Example of 4 threads shown here
If this is a successful method then we can easily add/change the control flow to operate off a stack of slices to load into the fact table.
On running the first test, while I expected some drop off when loading concurrently over 24 threads (one per partition) I noticed straight away that there was a large drop in loading speeds when, which was unexpected.
Load times for the run of 3.2 million rows was around 17 seconds over a number of runs, which given 1 thread running on it’s own took 1 second, we clearly had a problem. I then decided to try running the package on the server itself. Given that we are only doing two execute sql statements per thread I didn’t imagine to yield much improvement. This actually resulted in a slower execution time of 26.8 seconds.
Interestingly, the CPU usage was quite a lot higher (around 10% higher) when this was executing on the server compared with my 2 core desktop.
I then started looking into wait stats, latches and spins which showed very little until it came to the spinlocks. Below is an screenshot of latches, waits and spins
It is clear from the spinlock SOS_OBJECT_STORE that something is happening that perhaps shouldn’t be. We have a spin count of over 2 billion. Other runs of the test actually showed a much higher number of over 4 billion.
At this point I decided to perform some research on this particular type of Spinlock and found very little information. However, a blog by Thomas Kejser from SQLCAT Team (http://blog.kejser.org/2011/05/30/diagnosing-and-fixing-sos_object_store-spins-for-singleton-inserts/) Although this is related to Singleton inserts, following the diagnostic path is useful.
I thought from here that xPerf would be a good tool to give me a hint of which direction to go in, so after installing Windows Performance Toolkit on the server, I started logging in xPerf with the command
xperf -on latency -stackwalk profile
I then reran the workload and stopped xPerf with
xperf -d c:\temp\my.etl
As this was the first time of using xPerf and xPerfViewer on the server I set the symbols and symbols cache paths and started xPerfviewer.
After loading the file into xperf and loading symbols, I added the CPU by Process to the display and chose a position where I saw high and erratic CPU and brought up the summary table. After drilling into the SQL Server process I finally saw what the source of the spinlocks could be. We have very high weightings for AcquireLocksForInsert, and also for InsertRecordOnPage (expected since we are actually inserting data), and LocatePageForInsert (Could be contention here due to the locking).
The screenshot below highlights the problem functions.
In order to resolve this, I decided to review what was actually happening on the inserts, i.e. what kind of locks were being taken.
It appeared in the lock dmv’s that millions of KEY locks were being taken during the load, which certainly explains the amount of time spent in the “AcquireLocksForInsert”
To resolve this two actions were taken:
- Change the CLUSTERED INDEX to not allow row/page locks.
CREATE CLUSTERED INDEX ix_cu_Target_partitioned
ON Target_table_partitioned (surrogate_key)
WITH (ALLOW_ROW_LOCKS = OFF, ALLOW_PAGE_LOCKS = OFF)
- Alter the destination table to set the LOCK_ESCALATION=AUTO, which according to BOL will allow lock escalation to the partition level.
ALTER TABLE [dbo].[Target_table_partitioned] SET (LOCK_ESCALATION = AUTO)
Running the test again we get much faster execution:
Also, much less time spent on “AcquireLocksForInsert”
This is also backed up by the locking during the inserts, where there are only 5 PAGE locks, which escalated to HOBT(partition) locks, 1 for each Partition
This gives us an insert speed into the fact table of around 1.2millions rows/second
Some other useful posts