Analysis Services Partition Creation TABULAR and MOLAP

February 6, 2014 2 comments

As something I’ve noticed over the past few years. I see many people not sure on how many partitions to create within analysis services. That in itself is a long conversation, but it should be a balance between

  • What is a logical sub division of the data
  • How much parallelism you would like to load the cube (of course not yet in the case of TABULAR cubes)
  • How many partitions a query is likely to hit
  • How long it actually takes to create/alter/delete them.

I believe it’s often the case that people tend to take into account the first one, but less consideration as the list goes on. To this end, I’ve run a few tests against a MOLAP and Tabular adventure works cube to record the timings for creating partitions through AMO and also through XMLA and have a few interesting observations.

UPDATE: For the purposes of this post, I am reusing some existing code that I have so the single partitions were created in AMO and the batch in XMLA

Tests

  1. Create up to 1000 partitions using AMO
  2. Create 50, 100, 200, 400, 600, 800, 1000 partitions in batches using XMLA

Test 1

As mentioned already, tests were carried out on the TABULAR and MOLAP versions of the Adventure works cube. This test was coded in a c# script in a loop. A connection was opened and maintained and then up to individual partitions were created.

Between each test all partitions were removed so the starting point was always the same.

Below are the results of the test plotted to compare the TABULAR creation times versus the MOLAP creation times.

image

As can be see, the 1000th TABULAR partition took 3101ms to create whereas the 1000th MOLAP partition took 297ms to create.  It is very clear that creating the partitions in TABULAR is far more expensive than in MOLAP.

 

Another perspective on these results is to look at the cumulative time.

image

So, the cumulative time to create 1000 TABULAR partitions is 1643 seconds (27 minutes and 23 seconds) and in MOLAP it took 221 seconds (3 minutes and 41 seconds).

The TABULAR timing is obviously a very long time, but given a major advantage MOLAP partitioning has over TABULAR is parallel processing, we probably wouldn’t consider creating so many to begin with.  Having said that, the MOLAP time also isn’t particularly good.

Test 2

For this test, I again used a c# script task in SSIS to just build up various size XMLA strings to batch create a number of partitions.  Again, all partitions that were created were removed before the next batch.

Below is a chart representing the timings for each of the batches in seconds.

image

 

For alternative consumption, here is a table of the above chart

image

 

So even using XMLA batches, MOLAP is considerably faster.  For those of you know suffered the pain of having to create many partitions back in SSAS 2005, the TABULAR partition creating times will seem all to familiar, where the algorithm was timing was considerably worse.  It seems that we aren’t yet seeing the benefits of those changes in 2012 TABULAR yet that we currently see in MOLAP.

Taking a closer look at the difference in time between creating the TABULAR and MOLAP partitions, it seems that when creating a TABULAR partition there is an additional event “Metadata Load” which consumes some time, but the main difference in time is during the commit phase, where there adjustments made to every existing partition in that measure group (table).

 

UPDATE: Thanks to Gerhard for the reminder that it is also possible to do a batch transaction in AMO.  This is just as fast as creating the XMLA, but I think a lot easier to read. I won’t paste all of the AMO code to do this, but a snippet is below

Server svr = new Server();
svr.Connect(serverConnectionString);
svr.CaptureXml = true;

//connected to cube, not create all partitions as usual, including the .update

//when all done we execute the log

//execute the log against the server connection
//second parameter MUST be false, i.e. not parallel for creating partitions

XmlaResultCollection executeResults = svr.ExecuteCaptureLog(true, false);

foreach (XmlaResult result in executeResults)
{
foreach (XmlaMessage message in result.Messages)
{
MessageBox.Show(message.Description);

        if (message is XmlaError)
{
//do something with this error, probably other failure
}
}
}

 

Summary

If you need to create multiple partitions then do it in a batch in XMLA.

TABULAR is considerably slower than MOLAP at present, but I expect this will change when partitioning becomes more useful in TABULAR.

There are many advantages with partitioning, but there are also costs, and the meta data cost with both TABULAR and MOLAP is a significant one and can certainly cause things to slow down, especially in the COMIT phase!

 

Chris

Analysis Services Testing Best Practices

November 20, 2012 1 comment

An important but forgotten part of building Analysis Services cubes is the performance testing part. Yes, we frequently are able to test queries return the correct results, but generally not enough performance and regression (performance) testing is done.

Relatively briefly I will try to outline some good practices for this.

Ingredients

  1. A real set of production queries:
    1. Identify if these queries are from different applications that query the cube and split them up as well as create a “mix” set which includes a sample of all
    2. Make sure that when you run these queries they will actually “hit” data in your cube. No point having many useful queries if they are for a date which isn’t in your cube.
  2. What concurrency i.e. How many users do you want to test with
    1. Pick REALISTIC numbers. Don’t say 1000 if the most you will ever have running on the cube is 20. This will likely be a waste of your time
    2. If you do think you will have higher concurrency, then chose intermediate numbers. So if max concurrency will be 1000, try testing for 10, 100, 400, 1000, etc. This will give you an indication of how your cube performs as the concurrency increases for your different application queries.
  3. A query test tool
    1. This could be something like the ASCMD_StressTestingScripts on Analysis Services codeplex project here (Some more info by Stacia Misner here also on ASCMD usage)
    2. Build your own.. This is not a simple undertaking, but if it’s for a long term project then it may well be worth it in the long run.
  4. A Perfmon collector Set
    1. This is in case you see any surprises in the testing results. Here is a minimum sample of what should be included.

    \LogicalDisk(*)\*
    \MSOLAP$MSAS11:Memory\*
    \MSOLAP$MSAS11:Storage Engine Query\*
    \MSOLAP$MSAS11:Threads\*
    \Process(msmdsrv)\% Privileged Time
    \Process(msmdsrv)\% Processor Time
    \Process(msmdsrv)\% User Time
    \Process(*)\% Privileged Time
    \Process(*)\% Processor Time
    \Process(*)\% User Time
    \Processor(_Total)\% Privileged Time
    \Processor(_Total)\% Processor Time
    \Processor(_Total)\% User Time
    \System\Context Switches/sec

    You may also want to think about network just to check you aren’t getting slow network response time or maxing out the network connection (On both the query server and the SSAS server.

  5. Caching
    1. Test on a Cold Cache and a Warm cache. I would recommend a couple of runs on each. I prefer cold warm cold warm for this for practical reasons. To clear the cache you can use a couple of methods:
      1. This could be the ASSP codeplex project which contains assp.ClearAllCaches() which clears all SSAS Caches as well as the File System Cache
      2. RAMMAP To use this:
        1. Open RAMMAP Can be left open during testing, and will take quite a long time (5 minutes) to open on system with lots of memory (1/2 TB)
        2. Stop SSAS Service
        3. In RAMMAP on
        4. Empty Menu, chose “Empty System Working Set”, then “Empty Modified Page List” then “Empty Standby List”
        5. Start SSAS Service
  6. Plan your Tests before starting
    1. I find Excel pretty good for logging all the test and you can generate the test names as you go. For example, include application the queries are from, number of queries, concurrency, warm/cold cache, run number etc..
    2. Testing can actually be quite boring (surprise) so it can be easy to lose concentration, especially considering can take quite a long time depending on how many tests, queries, cube size, etc..
    3. Name your tests and make sure to keep your output from the tests organised. It is quite annoying if you complete your tests and realise you’ve lost track of where you were and have to start again.
  7. Test Results
    1. Consider not including the top and bottom 1% of your query results. You may find that there is one query that just continues to run at the end and scews the test time.

Practicalities

When running your tests, make sure you have a strict pattern to follow. For example:

  • Update test log to indicate current test in progress
  • Create folder for current test
  • Clear SSAS cache
  • Start Perfrmon
  • Start test and wait to complete
  • Stop Perfmon, rename to test name
  • Copy perfmon and query results to the folder for the current test
  • Update log to say the test is completed
  • Go back to start of pattern..

As you can see, having something like this can really help reduce the number of mistakes during testing, and there will be some regardless of how well you keep track.

Above testing pattern can be used along with the ssas performance/operations guide to really tune SSAS for your environment, or just to give you confidence that the changes you’ve made are going to fall off a cliff as soon as people start querying the cube post release.

Last but certainly not least:

BACKUP your test results!

Make sure that no one else is running queries on the server at the same time!

Chris

SQL 2012 Configuration Tuning

November 11, 2012 2 comments

I have been working for some time now on a very large Analysis Services 2008R2 MOLAP cube. We have been looking to upgrade this to 2012 and a few weeks before it got into production spotted some “interesting” behaviour. So thought it best to write up some of the symptoms and how to diagnose and potentially resolve.

The server itself is pretty beefy. It’s a HP DL580-G7 which has four CPU’s with 10 cores in each clocked at 2.3GHz, 512GB of memory, and some very fast Fusion IO storage (with Hyper Threading turned off, which if ON would double the number of processors and cause 2 processor groups to be created, each with 2 NUMA Nodes and each NUMA Node would have 20 processors)

I have tried to summarise below CPU layout of our HT OFF system.

Hopefully people are aware that as part of SSAS 2012, some changes were made to the thread pools, specifically of interest is new IO Processing Pool, which is now a separate pool(s) from the processing pool.

I suggest at this point you go here to have a read of a few things:

Welcome back! (if you left)

So now you can see, the changes to introduce the IO thread pool and then to actually split this into multiple subpools have been done for a couple of reasons. This is to help minimise costly cross NUMA access on the server, where accessing file data from a different CPU’s memory is an expensive operation.

I realise that I have been quite brief in the above statement, and hopefully Wayne will give more info on this in his upcoming post.

Symptoms

So it was noticed that during querying of the cube (processing was fine and VERY fast) that I saw some very high kernel/privileged time, and also some very slow query response speeds. Below find a gory picture of this in action:

Yep, nearly wants to make you cry L

This was observed on 2012 RTM CU2. It was also noticed that as the cached got warmed less of this behaviour was observed, but as a rule of thumb I would say that we should be seeing on average less than 20% privileged time. (there is a good and basic introduction this http://www.codinghorror.com/blog/2008/01/understanding-user-and-kernel-mode.html please just read up to 2. User Mode). That figure is just my opinion, but in my experience, when things have been fast then less than 20% average was observed.

In addition to this, there has been some extensive tuning done on the configuration file for this system, for the types of queries, number of concurrent users, processing, index and aggregation builds taking into account processor layouts and last but not least the file system, which is pretty fast!

So, you will have to check all this for YOUR system for the queries and concurrency that you experience. I expect that for most SSAS cubes, the 2012 out of the box config will cope very well with most cases, but for larger cubes on multi NUMA systems, you will have to check everything is behaving well and to your expectation. This is a good investment of time as most likely the servers you are working on will have cost a LOT of money!

Diagnose

At this point if you haven’t read my post on testing mentioned earlier, it can be found here (Sorry for sending you all over the place, but hopefully it will be worth it)

As mentioned in the testing blog, you should be monitoring in perfmon at least the following counters

\LogicalDisk(*)\*

\MSOLAP$IAS01DEV2012:Memory\Memory Usage KB

\MSOLAP$IAS01DEV2012:Threads\*

\Process(msmdsrv)\% Privileged Time

\Process(msmdsrv)\% Processor Time

\Process(msmdsrv)\% User Time

\Process(*)\% Privileged Time

\Process(*)\% Processor Time

\Process(*)\% User Time

\Processor(_Total)\% Privileged Time

\Processor(_Total)\% Processor Time

\Processor(_Total)\% User Time

\System\Context Switches/sec

For your testing, you should aim to run on both a clear cache and a warm cache. A clear cache in this case means a clear Analysis Services cache and also a clear file system cache.

What you are looking for here is the proportion of “\Process(msmdsrv)\% Privileged Time” of \Process(msmdsrv)\% Processor Time, and for the test to be meaningful you need to be pushing the server so ideally the “\Process(*)\% Processor Time” should be >; 50%

Seeing a “high” proportion here means you might be experiencing the behaviour. You should also check the “\System\Context Switches/sec” if this is high also, it might also be an indicator. The other indicator will also be the actual query performance, is this in line with your expectations?

What you can do to help

Well I will start with the bad news. If you are pre SP1 then there isn’t a huge amount you can do to maintain full use of the server. You can try to affinitize the IO processing pool across the first two NUMA nodes, and you may see that there is less contention. Here is an example of this for when hyperthreading is turned OFF and we affinitize across the first two NUMA nodes.

;
;-20;
;-20;
;2;
;2;
;0;
;0x00000000000FFFFF;
;1;
;

Also change random access mode to 1

;1;

Quoting from MSDN

“Do not change this property except under the guidance of Microsoft Support. For more information, see Understanding GroupAffinity Server Properties in this topic.” Obviously, doesn’t stop you doing it, but we should follow guidelines

This got us into a position where half the server was being used for querying and processing. Pretty far from ideal, but at least it got us back to a working system.

2012 SP1

Now that SP1 is publicly available there is good news! With the correct settings we were able to get MUCH nicer behaviour and some very good query throughput as per this screenshot.

Much nicer eh!

The config we settled on was:

80 query threads
16 short/long parsing
392 processing pool threads
pernumanode -2
randomfileaccessmode 1
Affinitise across all cores
Hyper Threading OFF

Throughput for our system went from 1000 queries using 10 threads taking nearly 12 minutes on a cold cache to 2 ½ minutes on a cold cache. So you can see there, that this really can make a huge difference.

To get to this point, one by one we change the number of threads for the Query, short parsing, long parsing and IO thread pools, as well as the randomFileAccessMode and PerNumaNode settings. This generates a LOT of tests. We also tested using a variety of sets of queries, and also at different concurrency (numbers of users).

The key points here are:

  • Decide on the queries for testing
  • Decide on what configuration you are going to try for each test, changing ONLY ONE thing at a time.
  • Decide on the number of users you are going to test with (10, 100, 1000)..
  • Be VERY consistent in warm cache and cold cache tests
  • Name your tests before you run them and keep all your perfmon files and test run times together.
  • Possibly think about excluding the top and bottom 1% of the queries from the results to remove any random scew.

And last point for now. More threads doesn’t mean it will go faster, as you may find you will overload the server.

So hopefully this will all be of use to people. Would be happy to hear any feedback and I would be very interested to hear from people who are running SSAS 2012 with Hyper Threading ON if they found it better than HT OFF. I am sure the product team would be happy to hear this also and I will make sure to pass on.

Finally, a very big thank you for the help from the SSAS Product Team, in particular Akshai. Also, to CSS team who were part of the days and days of testing and reviewers of many xperf and perfmon traces :)

Chris

Analysis Services 2012 Configuration settings

November 11, 2012 Leave a comment

Having recently moved from 2008R2 to 2012, I thought I would take the opportunity to write up (in hopefully simple terms) some of the config settings that are interesting. The following information may only be useful for people running SSAS on Machines with more than 2 NUMA nodes, but it may be of interest anyway.

As usual with these things, you should only really change them on the advice of Microsoft support. You can of course change them yourself, but please test as much as you possibly can before you think about promoting past your development environment.

SSAS Config/Log File

Hopefully people reading this will be aware that Analysis services has a configuration file and is normally located in the installation directory:

C:\Program Files\Microsoft SQL Server\MSAS11.MSSQLSERVER\OLAP\Config\msmdsrv.ini

For the settings I will discuss below please always check the Analysis Services Log file, which tries to output the running settings. This is by default located in the installation directory:

C:\Program Files\Microsoft SQL Server\MSAS11.MSSQLSERVER\OLAP\Log\msmdsrv.log Remember to check the end of the log file for the settings since the most recent startup

Background

As part of SSAS 2012, there have been quite a few under the covers changes to remove some of the known bottleneck’s when running analysis services 2008R2 and below on machines with 4 NUMA nodes or more.

Please have a read of this presentation -> Akshai’s Tuning AS for High End Hardware Presentation (specifically page 22) for a bit of background.

One area that has changed is the separation of the Processing Thread Pool and IO Processing Thread Pool. In brief Processing Thread Pool is for handling processing related jobs, file related jobs (create/delete) and ROLAP whereas the IO Processing Thread Pool is for reading data from the files.

I suggest at this point you go here to have a read of Wayne’s post on more of the changes with 2012 and come back! CSS: Wayne Robertson – Analysis Services Thread Pool Changes in SQL Server 2012

I will go through some of the settings such as PerNumaNode, GroupAffinity, MinThreads/MaxThreads and RandomFileAccessMode. I will discuss their known input values, what that means and their output values. I will try to include examples and will base everything on a 16 core 4 NUMA node system. (I have no idea if this exists, but using a real 4 NUMA 40 Core machine would probably be unnecessary). I will not be discussing here what happens internally, as I will leave that for Wayne’s current post and potentially a new one out in the coming weeks.

GroupAffinity (DEFAULT: BLANK)
There is a GroupAffinity tag for all of the threadpools, and simply means, over which Processors can the threads in that pool run. It is in HEX format so a value of 0xF translates to a binary 1111 which maps to processors 3, 2, 1, 0.

An Example.
For us to create an Affinity mask across all of our processors we would need to translate 16 binary 1′s into HEX (the left most 1 maps to process number 15 and the rightmost is processor number 0

1111 1111 1111 1111
0XFFFF

If we wanted to just use processors 4, 5, 6, 7, 12, 13, 14

0111 0000 1111 0000
0X70F0

This is a very contrived example, and please don’t think about using it in real life J

By default SSAS Config file specifies no affinity mask for any of the pools (equivalent to my first example), which means that the pools are across ALL cores on the machine (providing you haven’t tweaked any windows server settings)

If you have a server that has greater than 64 logical processors then windows splits them into processor groups with an equal number of processors and numa nodes in each. The affinity mask for this is specified as comma separated. So if we use as an example a server with 80 cores with 4 numa nodes, this would be represented in windows as 2 processor groups, each with 2 Numa Nodes and each numa node would have 20 processors. The FULL affinity mask for this would be:

0XFFFFFFFFFF,0XFFFFFFFFFF

Ok, so hopefully that is Affinity Masks explained.

PerNumaNode (DEFAULT: 0)
This relates to the IO processing thread pool and can have values of -1, 0, 1 and 2 (2 is NEW in SP1 which is just released! SQL Server 2012 SP1) This setting allows us to create one or more SUBpools across the processors Analysis services is affinitized to.

-1 : will create a IO processing subpool for each of the 4 NUMA nodes. You should only put as -1 when you have a system with 4 NUMA nodes and you have the affinity mask as BLANK or across ALL processors.

0 : SSAS will create one IO processing pool (this means we go back to SSAS 2008 R2 and before behaviour) across all the processors specified in group affinity mask, where BLANK means all processors.

1 : it will create an IO processing subpool for each numa node

And with affinity mask

0XFFF0

Note, that processors 0,1,2,3 don’t have an IO subpool

2 : it will create an IO processing subpool for each LOGICAL processor

    

MinThread and MaxThreads (DEFAULT: 0)
This is a SIGNED integer

Defaults

The default value of 0 uses some internal algorithms to try and “figure out” the best number of threads for your system, and usually does a pretty good job.

Query Thread Pool = Number of cpu’s (which for our example system is 16) * 2 == 32

Short/Long Parsing Thread pool = Number of cpu’s / 4 == 4

Processing Thread Pool = Number of cpu’s * 8 == 128

IO Processing Thread Pool = Number of cpu’s * 10 == 160

Possible Values

All examples assume affinity mask across ALL processors

Firstly, for the NON IO processing Pool threads.

When > 0 specifies the number of threads you would like in total

When = 0 SSAS will decide what the number of threads will be based on the number of cpu’s. (See defaults above)

Now, for IO Processing Pool threads

When > 0 specifies the number of threads you would like in total

When = 0 SSAS will decide what the number of threads will be based on the number of cpu’s. (See defaults above)

When < 0 Represents the number of threads per IO processing subpool.

Below is a table of some examples, so please try and follow through the examples to fully understand. Please pay particular attention to the PerNumaNode setting, as it is actually a combination of Threads, perNumaNode, and groupAffinity that decides how many (sub)pools are created, across what CPU’s and then the number of threads per (sub)pool.

I think for perNumaNode 0, -1, and 1 it should be fairly simple to work out, but for perNumaNode 2, I will put into words the examples above.  Also, for the last row in the table, the total threads is 192 as there isn’t a multiple on this machine that goes into 160.

Lets, start with the MaxThreads = 160. For this we are saying we want a maximum of 160 threads, have an affinity across all CPU’s, there are 16 CPU’s and therefore 16 subpools. So it is simply 160/16 = 10 threads per sub pool.

For MaxThreads = -3, we are saying that we want per subpool (in this case CPU) 3 threads * number of CPU’s per numa node. So 3 * 4 = 12 threads per cpu, making a total of 192.

RandomFileAccessMode

This is already mentioned in Wayne’s blog as per the link above. This simply “changes the file cache access behaviour”.

Chris

Non Materialized Reference Dimensions, ProcessUpdate and Indexes

(Sorry in advance for the catchy titleSmile)

I recently had a requirement to dynamically (some processing allowed) to make certain fact data appear depending on if a product was “active” or not active.

One potential design for this is to make use of a default member of is_active = Y on the product dimension.  The downside to this would be having to run a ProcessUpdate on the product dimension and then having to update the indexes on the related fact tables.

This might be OK if you have a small fact, but for now lets assume we have a few billion rows of fact data per day just to discount updating all of the indexes as a plausible idea.

So, another design is to use the default member again, but on an non materialized reference dimension.

Simple design illustrated below

image

So to test, we build out the database with 20,000 rows in each table for simplicity (and also to move us past the default index build threshold of 4096 rows).

Now we design and build the cube, the only two things to note here are the reference dimension setup

image

and the default member on the Ref Product dimension

image

Now, after we have deployed the cube we can test to see if the indexes are built on the fact for the non materialized reference dimension attributes, and importantly, are they used?

So, looking at one of the partition directories we can see that Ref Product.* files are present. This is a little surprise.

image

But since SSAS has gone to the trouble of building them, it should probably make use of them.

Now, lets check using procmon if a query uses them..

image

And, we can see that the files aren’t touched at all! so it built them, but didn’t use them!!

OK, so now let us try and get it to not build them by setting the “AttributeHierarchyOptimizedState” to “NotOptimized” for both of the ref product attributes and redeploy the cube.  (To be thorough, we delete the existing cube first).

image

Now we can see that the fact indexes for ref product aren’t built.  Then running the query again and monitoring through procmon needless to say the indexes aren’t used.

So, ALL GOOD! well not quite, we now have to test if we were to make a product is_active = N and run a ProcessUpdate then we have minimal impact on processing time, i.e. ONLY the ref product dimension should be touched.

Firstly, to make sure it does something, mark all but one ref product as inactive.

Checking the file versions before we run the process update they are all set to version 1 as this is a brand new cube.

Here is the info.xml file from one of the fact partitions

image

(Just confirming we at all at version one of the indexes)

So after running a ProcessUpdate ONLY on the ref product dimension (process affected objects false) I can see that the index versions on the ref product dimension are updated to version 2 (expected), BUT can also see that the partition indexes have also been updated to version 2!!

image

and from the partition info.xml file

image

This is not good.

To put some perspective on this, if we have a fact indexes and aggregations per partition of around 500MB and have more than 4,000 partitions, it will copy the 500MB for each of the partitions with the new index version number and then delete all of the old version files.

That is copying 1.9TB of data and then deleting the old version, and not actually doing anything at all with it apart from changing the file number.

So in Summary

  1. Indexes are built for non materialized reference dimensions they aren’t used.
  2. When indexes are not build (AttributeHierarchyOptimizedState=NotOptimized) a ProcessUpdate on just that dimension causes the index version to be updated.

Connect item for this can be found here

https://connect.microsoft.com/SQLServer/feedback/details/753153/non-materialized-reference-processupdate-and-indexs

PLEASE VOTE if you agree

Alberto Ferrari has also written something on reference dimensions here http://sqlblog.com/blogs/alberto_ferrari/archive/2009/02/25/ssas-reference-materialized-dimension-might-produce-incorrect-results.aspx

Management Studio Reporting wrong connection

November 16, 2011 Leave a comment

I was recently on the wrong end of discovering a bug in Management studio this week, where the server name in the status bar and hover over tab was reporting one server, but I was actually connected to a different server.

It doesn’t seem that this is fixed yet, but I have yet to upgrade to SP1 for 2008 R2 so will update here if that does correct the problem.

Below are the steps for this as a repro.

clip_image002

Click “New Query” button on menu bar and AS SOON AS the new window is opening, right click.

This will bring up the context menu, where you can see

clip_image004

As you could see in the first screen grab. I was connected to 0019. Now I have chosen 0218

clip_image006

Now in that window, I run select @@servername again. And I fully expect to see 0218. Correct?

clip_image008

WRONG!

I see that although SSMS reports in the status bar that I am connected to 0218 as expected. I am actually connected to 0019.

Unfortunately I discovered this bug when an update was run against a production server instead of DEV (yes debate about privs in production for another day).

Jamie Thompson has here identified a good working practice on different colours for the status bar.  Please note, that when i tested this, the bug reproduced above still exists.

http://sqlblog.com/blogs/jamie_thomson/archive/2011/08/11/query-window-custom-colouring-now-works-properly-ssms.aspx

Partitioned Table Spinlocks

September 27, 2011 2 comments

I was recently working on development of a fact table with 24 partitions and needed to hold up to 5 billion rows of data. 

The data received in the warehouse is through a file being loaded into a staging heap table.  One heap table is created per file, and there are several thousand files per day.

In the process of testing the load, I came across what I think was an interesting find.

For this part we are going to adopt loading one slice into the mail fact table at a time. This way we have a very controllable method of consuming server resources along with making the process as parallel as we desire. For the initial cut, I will build an SSIS package that has 24 execute sql statements, where each one will firstly select into a variable the insert statement (this makes testing much simpler) and then each of these will flow to another 24 execute sql tasks which will run the insert.

clip_image001

Example of 4 threads shown here

If this is a successful method then we can easily add/change the control flow to operate off a stack of slices to load into the fact table.

On running the first test, while I expected some drop off when loading concurrently over 24 threads (one per partition) I noticed straight away that there was a large drop in loading speeds when, which was unexpected.

Load times for the run of 3.2 million rows was around 17 seconds over a number of runs, which given 1 thread running on it’s own took 1 second, we clearly had a problem. I then decided to try running the package on the server itself. Given that we are only doing two execute sql statements per thread I didn’t imagine to yield much improvement. This actually resulted in a slower execution time of 26.8 seconds.

clip_image001[5]

Interestingly, the CPU usage was quite a lot higher (around 10% higher) when this was executing on the server compared with my 2 core desktop.

I then started looking into wait stats, latches and spins which showed very little until it came to the spinlocks. Below is an screenshot of latches, waits and spins

image

It is clear from the spinlock SOS_OBJECT_STORE that something is happening that perhaps shouldn’t be. We have a spin count of over 2 billion. Other runs of the test actually showed a much higher number of over 4 billion.

At this point I decided to perform some research on this particular type of Spinlock and found very little information. However, a blog by Thomas Kejser from SQLCAT Team (http://blog.kejser.org/2011/05/30/diagnosing-and-fixing-sos_object_store-spins-for-singleton-inserts/) Although this is related to Singleton inserts, following the diagnostic path is useful.

I thought from here that xPerf would be a good tool to give me a hint of which direction to go in, so after installing Windows Performance Toolkit on the server, I started logging in xPerf with the command

xperf -on latency -stackwalk profile

I then reran the workload and stopped xPerf with

xperf -d c:\temp\my.etl

As this was the first time of using xPerf and xPerfViewer on the server I set the symbols and symbols cache paths and started xPerfviewer.

After loading the file into xperf and loading symbols, I added the CPU by Process to the display and chose a position where I saw high and erratic CPU and brought up the summary table. After drilling into the SQL Server process I finally saw what the source of the spinlocks could be. We have very high weightings for AcquireLocksForInsert, and also for InsertRecordOnPage (expected since we are actually inserting data), and LocatePageForInsert (Could be contention here due to the locking).

The screenshot below highlights the problem functions.

image

In order to resolve this, I decided to review what was actually happening on the inserts, i.e. what kind of locks were being taken.

It appeared in the lock dmv’s that millions of KEY locks were being taken during the load, which certainly explains the amount of time spent in the “AcquireLocksForInsert”

To resolve this two actions were taken:

  • Change the CLUSTERED INDEX to not allow row/page locks.

CREATE CLUSTERED INDEX ix_cu_Target_partitioned
ON Target_table_partitioned (surrogate_key)
WITH (ALLOW_ROW_LOCKS = OFF, ALLOW_PAGE_LOCKS = OFF)
ON [ps_hash_pos_24](partition_key)

  • Alter the destination table to set the LOCK_ESCALATION=AUTO, which according to BOL will allow lock escalation to the partition level.

ALTER TABLE [dbo].[Target_table_partitioned] SET (LOCK_ESCALATION = AUTO)

Running the test again we get much faster execution:

image

Also, much less time spent on “AcquireLocksForInsert”

image

This is also backed up by the locking during the inserts, where there are only 5 PAGE locks, which escalated to HOBT(partition) locks, 1 for each Partition

image

This gives us an insert speed into the fact table of around 1.2millions rows/second

 

Some other useful posts

http://blog.kejser.org/2011/05/30/diagnosing-and-fixing-sos_object_store-spins-for-singleton-inserts/

http://sqlblog.com/blogs/adam_machanic/archive/2009/10/30/sql-server-2008-lock-escalation-inserts-and-a-potential-bug.aspx

http://msdn.microsoft.com/en-us/library/ms184286.aspx

http://msdn.microsoft.com/en-us/library/ms190273.aspx

http://jmkehayias.blogspot.com/2008/12/troubleshooting-locking-in-database.html

http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx

Follow

Get every new post delivered to your Inbox.

Join 86 other followers