Saturday, December 22, 2012

SSAS - Designing approaches for dimensional model






In SSAS, there are two ways to design dimensional models. Top-down or bottom up.

Top Down Approach
  • This is generally not much used option.
  • The relational data source does not exist in this case. The tables are created from the dimensional model from SSAS.
  • Cube and dimension wizards has the option to implement this option.
  • Schema Generation Wizard is used to generate and deploy schema in the database..
  • DBO permission required in order to create database objects.

Bottom Up Approach
  • Most of the times this is the approach thats taken.
  • The data mart already exists.
  • The dimensional model and cubes are designed from the Relational tables in the Data Mart.
  • Cube and Dimension wizards have the option to implement this solution.
 

Monday, December 10, 2012

SQL Server - Cost estimation issues even with up-to-date statistics

The only way to know if statistics are update-to-date or outdated is look at Actual Records count and Estimated Records count in an execution plan. A mismatch indicates that statistics need to be updated. But there are exceptions to this rule. There are situations when even if statistics is up-to-date still there is a difference in estimated and actual count.


First let's check the result with fixed parameter query. In my example I have bloated the data in AdventureWorksDW2008R2.FactInternetSales by duplicating the data multiple times (of couse, ensuring that the Primary key is unique).

T-SQL with fixed query (OK)



T-SQL with local variables (Issue)


Stored procedure with local variables (Issue)

 Here are the scenarios when the same query with different implementation style would give a matching estimated and actuals count.








Possible Solutions...

Here are the possible solutions to correct the issue.
    Stored procedure without local variables (OK)
 


 
    Dynamic sql with sp_executesql (OK)



Dynamic sql with EXEC (OK)




_

Saturday, December 1, 2012

Big Data - 1: A big picture


Everyday we get bombarded with terminologies in connection with big data like NoSQL, RavenDB, Hadoop, MongoDB and so on. Since this is an upcoming area in IT, we struggle to understand what those terms are and how these technologies can help us. I am not an expert in Big data, but I would share what I have known so far.



The diagram above gives a high level view of various terminologies in the big data arena and their relationship with each other.
  • Big Data : 
    • Big data is the data that is big in terms of
      • Volume (terabytes, petabytes and above)
      • Velocity (rate of transmission and reception) 
      • Variety (complexity, text, pictures, videos etc).
    • Big data by itself is messy and would not make any sense as it is unstructured or semi-structured unlike traditional data. The only way it can be useful is pull out only the relevant information. Pulling out relevant information out of such big data is the real problem that traditional sytems cannot help in. This is where big data technologies come into picture
There are three technology classes that support Big Data. Hadoop, NoSQL and Massively Parallel Processing (MPP) databases. I would start with MPP databases as they have been there for decades.
  • MPP Databases: These databases spread the data into independent storage and CPU thus achieving parallellism and great processing capability. This is special hardware and software combination that help you achieve such parallellism. The big products in market are IBM's Netezza, HP's Vertica and EMC's Greenplum. Microsoft's SQL Server Parallel Datawarehouse (PDW) is one upcoming MPP database.
  • Hadoop : Hadoop is one of the technology classes that support big data. It is an opensource version of Mapreduce created by Apache software foundation. In general, Mapreduce is parallel programming framework.
  • NoSQL : NoSQL stands for "not only SQL". It represents a class of databases that do not follow RDBMS principles (like transactional integrity) and mainly deal with big data. There are four categories of NoSQL databases. Please refer to the diagram above for the database products that fall under the below categories. Few of the NoSQL databases work in conjunction with Hadoop components.
    • Document store
    • Graph Database
    • Key-Value store
    • Wide Column store
I hope this post was useful in understanding the basic terminologies of big data and related technologies.

 In the next big data post, I would provide more details on NoSQL database categories.

Monday, November 26, 2012

SQL Server - Partitioning Design : 4 - Partitioned Tables + Partitioned View

This post is an extension to the previous post - Partitioning Design Patterns. In order to understand this post, you would need to go through the previous post. In this post I will write in-depth on why this design is an optimal solution. I would cover only those points that make this design a preferred implementation style as compared to other designs.


 

How this works?

  • Please refer to the diagram above.
  • Data for each year stored in separate physical table. In the above example, FactTransaction_2010, FactTransaction_2011, FactTransaction_2012, FactTransaction_2013.
  • Each physical table is partitioned into one partition per month.
  • Each table is stored on a different file group. Each file group having 4 files per filegroup.
  • Each filegroup is stored on a different storage drive based on workload. More recent data (very frequently accessed) is stored on high performance storage option. Older data generally accessed less frequently is placed on a less performance, less expensive storage option.
  • A view (FactTransaction_vw, union All on all tables) exposes the data in all the tables. The application queries uses this view for its reporting needs.
  • A synonym (FactTransaction_syn) points to the underlying table that is for current year (FactTransaction_2012). All the queries that insert or update data for current year, would use the synonym. Next year, the synonym would be pointed to the relevant table without any impact on existing code.
  • Most imporantly, tables for each year will have a check constraint on the date column (Partition column) for valid date range. For example, a check constraint on date column in FactTransaction_2012 to ensure that data from 1Jan2012 to 31Dec2012 can be stored. With this sql server does table elimination when searching for a particular date. Since each table is further partitioned per month, SQL Server does partition elimination. For instance, if a user queries data for 10 Nov 2012. SQL Server would directly jump to table FactTransaction_2012 (Table elimination. Check constraint helps sql server do this) without going to any other table and then within FactTransaction_2012 it jumps to partition meant for month Nov (partition elimination). This greatly helps in partition column based search operations.

Variations of this design depending on data volume


       With some more creativity, this design can be tweaked if daily/monthly data volumes are too heavy. But do a thorough testing if this really works for you. Below is one such variation.
    • One table for each year (FactTranaction_2010, FactTransaction_20xx, and so on).
    • Each table relevant to each year can be partitioned per month (or even date).
    • One separate table (FactTransaction_CurrentMonth) that would store data for current month (or quarter depending on data volume).
    • Daily rebuilds could to be done only on FactTransaction_CurrentMonth  table which would be a very small fraction of a huge historic table. Any maintenance activity like index rebuild or statistics update would not put undue pressure on system resources and would definitely benefit the system availability.
    • Once current month is done (Eg Nov2012), the current month data from FactTransaction_CurrentMonth can be merged in the FactTransaction_2012 table. Then FactTransaction_CurrentMonth would be used to store Dec2012 data and so on.
    • Monthly rebuilds need to be done on Yearly table after merging current month data from FactTransaction_CurrentMonth to FactTransaction_2012.

Pros

Below benefits give this design an edge over other implementations. Please remember, I have not mentioned all other benefits of this design that overlap with other designs.
  • Relieves SQL Server from unwanted CPU and IO pressure from online/offline index rebuilds.
    • Index rebuild is something that cannot be avoided by any database implementation. Based on index design and index fragmentation levels, indexes might need to be rebuilt as a part of daily batch. Index rebuilds are generally a problem for huge databases due to space requirements and resource consumption.
    • Let's understand this with an example. Let's say there is a table with 1 Terabyte (TB) of data. This table stores 4 years of data. If any partitioning design pattern except Design-4 is implemented, then index rebuild would lead to a very big hit on CPU and IO. In case of clustered index, because it includes actual data pages plus other index pages; the clustered index size would be 1 TB (data size) + other index pages size. Rebuilding such a clustered index is going to be a huge spike on server in terms of IO and CPU consumption. In case of non-clustered indexes, even if it does not contain the data pages as in case of clustered index; its still going to hit CPU and IO as well. That's because, the Non-clustered index structure will store data for all the historic data of entire table. Let's say, there are 5 non-clustered indexes, and each index is (hypothetically, based on design) is 100 gb. So a rebuild here will lead to rebuild of 500 gb of data which will lead to a big hit on server resources.
    • It's worth a thought... why should the index data for previous years be rebuilt over and over again, even if what really needs to be rebuilt is only current year data. Now coming to the recommended multiple table based design, data is already split into multiple "separate" physical structures. Obviously, rebuilding only the current year indexes will be very light on SQL Server as compared to rebuilding indexes on a single table with all the historic data
 
  • Better availabiliy with online rebuilds.
    • A particular partition can be rebuilt only in offline mode. Online option is available only for entire table. This can be an issue for applications that need to be online 24x7. In offline mode, the partition being rebuilt is not available for querying.
    • With a table for each year, the entire data is divided into separate structures and so rebuilding only a single year table becomes more practical in ONLINE mode.
    • If a single year data is huge, the solution can be tweaked to suit your needs. Refer to section "Variations of this design depending on data volume". As mentioned in the example, online rebuild would take very small fraction of time, as only a very small fraction of data would be rebuilt. For example, in the FactTransaction_CurrentMonth, on day-1 only one day data would to be rebuilt. On Day-2, two days and so on until day 31. So at the max 31 days of data would be rebuilt, while rest of the data for all the years is untouched. That cuts down a lot of maintenance time and resources and enhances application availability.

  • Space issue associated with online rebuilds is eradicated.
    • Rebuilding indexes in online mode on huge tables need a lot of space. Any many a times, the rebuilds fail due to lack of space. Let's understand this with an example.
    • For instance, if a clustered index on table with 500 GB data needs to be rebulit, it would need 1.1 Terabyte of space. This is how it would need.
      • A copy of data is created which would need 500 GB
      • Around 20% of sorting space, that would be 100 GB
      • 500 GB of log is generated.
      • If mirroring is implemented, then it would generate a lot of network traffic.
    • Lets see the space requirement with this design or its variation (FactTransaction_CurrentMonth). For instance, the FactTransaction_CurrentMonth table that stores one month of data at the max would store 10 GB at the end of the month. At the max it would need around 22 GB of space.
      • A copy of data is created which would need 10 GB
      • Around 20% of sorting space, that would be 2 GB.
      • 10 GB of log is generated.

  • Relieves SQL Server from unwanted CPU and IO pressure from statistics recompute for a huge base data.
    • A huge table will have huge regular indexes. There could be situations when the performance is bad due to skewed statistics as the sampling size is not enough. Scanning such huge data with higher sampling size to rebuild statistics would overkill CPU and IO for a long time, as sql server has more data pages to scan through to create more accurate statistics.
    • With one huge table storing data for all the previous years, statistics update would lead to re-computation of statistics for all the previous years. This is un-necessary work. Would it not be great if only current year statistics is being recomputed. Previous year statistics can be computed once and used multiple times by all the queries.

  • Helps SQL Server in maintaining accurate statistics which is a major contributor in better query performance.
    • Apart from the undue pressure on CPU and IO, huge tables also lead to skewed statistics. The rule is, small tables have accurate statistics and huge tables by default have skewed statistics. In this particular design, because each table is a separate physical structure, the statistics too are maintained separately for each table by year. If it was one big table with all the historic data, there would be huge statistics maintained for that table.
    • So why accurate statistics are beneficial for better execution plans? Thats because accurate statistics have enough data to help sql server come up with a good execution plan. If the statistics is skewed, SQL Server cannot come up with good execution plan. This follows the standard rule - "Garbage in, garbage out".
    • Since this blog is not dedicated to statistics, I cannot include more details on why and how statistics get skewed on huge tables. I would do that in a separate post. But in general stats get skewed in huge tables because SQL server computes statistics only on occurence of certain events and when it does, it uses a sampling algorithm. The samples might not be enough to compute accurate statistics

Cons    

  • Any change in table schema need to be reflected in all the underlying tables. If you have 10 tables for 10 years, then all those would need to be updated with the new schema. Although there is another way to avoid this. One can create a dummy value column in the view. But I would not prefer that as a personal choice.
  • Can have max of 255 tables in a view. This practically should not be an issue as it would mean 255+ years of historic data which is next to impossible for all practical purposes.

Please leave your comments if you liked/have suggestions for  this blog post.


Sunday, November 25, 2012

SQL Server - Wait Statistics : 2 - SQL Server Thread and Task Architecture


 
  • Each Physical CPU has one or many mapped Logical CPUs. A Logical CPU has to have one and only one mapped Physical CPU.
  • One scheduler is assigned to one Logical CPU. However, a logical CPU can have multiple schedulers.
  • A scheduler maintains two structures, Runnable queue and Wait list. Runnable Queue is maintained for the collection of Runnable tasks. As Queue is a First in First out operation, the first task in the queue waiting for CPU time will be executed first and then the second in the queue.The Wait/Suspended List is an un-ordered set of tasks waiting for some resource. Once the resource is available, the task is pushed to Runnable Queue. So the task in Wait list waits until the resource the task is waiting for is available.
  • A scheduler (logical CPU) is assigned a set of workers (Thread or fiber depending on configuration) to execute tasks.
  • If you pause and think about the previous 4 points, it becomes logical that each logical CPU or scheduler would maintain its own Runnable Queue and Wait List. And since each logical CPU is an independent CPU for SQLOS, each Logical CPU/Scheduler has its own work and task assigned to it. And thus multiple CPUs would lead to parallellism.
  • The worker is bound to the task until its finished. This means if for some reason the task is suspended due to unavailable resources, the worker too waits. The worker is assigned a new task only when it finishes the current task in context.
  • At a given point of time, a task can be in either in Runnable, Suspended or Running state. It may change the state in the below fashion
    • Running >> Suspended >> Runnable >> Running
While a task is running, it might need a resource (data page, memory, lock etc)
which might not be available. If the resource is unavailable, then the task is changed
to Suspended state. And it remains there until it gets the resource its waiting for.
Once the resource is available the task is moved to Runnable state. This means that
the task is ready to be processed and it waiting in Queue for the processor time. It
gets executed once the tasks before it in the queue are executed.
    • Running >> Runnable >> Running
While a task is running, it might directly enter into a runnable state without going in
suspended state. This will happen if the task meets its quantum exhaustion. A
quantum exhaustion is the time limit a task can use processor time.
In SQL Server its 4 milli seconds. This means after 4 milli seconds, the current task
will be pushed to the bottom of runnable queue and the task on the top of the queu
will be popped for processing.

Tuesday, November 20, 2012

SQL Server - Find number of physical processors, cores, logical processors and RAM using T-SQL


There are times when you want to know the SQL Server box configuration, specifically processors. It can be frustrating if you have to wait for a couple of days for another group in your ogranization to give the information. Here is something, that you can use to find out the details yourself.

I found many posts that help you find physical and logical CPUs. But I was interested specifically in the number of cores too. Few posts mentioned that there is no way you can find number of cores using T-SQL.

Here is the query that can give you the number of cores too. Test it yourself and let me know if this does not work in your scenario.

SELECT ( cpu_count / hyperthread_ratio )AS NumberOfPhysicalCPUs
, CASE
WHEN hyperthread_ratio = cpu_count THEN cpu_count
ELSE ( ( cpu_count - hyperthread_ratio ) / ( cpu_count / hyperthread_ratio ) )
END AS NumberOfCoresInEachCPU
, CASE
WHEN hyperthread_ratio = cpu_count THEN cpu_count
ELSE ( cpu_count / hyperthread_ratio ) * ( ( cpu_count - hyperthread_ratio ) / ( cpu_count / hyperthread_ratio ) )
END AS TotalNumberOfCores
, cpu_count AS NumberOfLogicalCPUs
, CONVERT(MONEY, Round(physical_memory_in_bytes / 1073741824.0, 0)) AS TotalRAMInGB
FROM
sys.dm_os_sys_info
 
 Please keep in mind, you need to have VIEW SERVER STATE permission to use the DMV sys.dm_os_sys_info 

 

Monday, November 19, 2012

SQL Server - Wait Statistics : 1 - SQL Server Operating System (SQLOS) Basics

Few basic concepts around SQLOS which is a core component in SQL Server architecture, are critical for understanding SQL Server behaviour. If one wants to be good at troubleshooting performance issues in SQL Server, these concepts would help in creating that solid foundation. This blog is intended to explain the details around SQLOS and wait statistics in a simplified form. The entire topic will be covered in a series of posts. 

Let's go through these terms before we dive deep.
  • SQLOS can be visualized as a mini operating system built for SQL Server that works on the top of Windows Operating system. This Operating system works on the top of Windows OS to help SQL Server perform some critical functions. This means that SQLOS in many cases can also be dependent on Windows OS. SQLOS was first introduced in SQL Server 2005. In general, SQLOS performs two major functions of scheduling and memory management.
  • Physical CPU can be a multicore or single-core CPU on the SQL Server box. Each core acts as an individual processor. For example: a quad core cpu means on a single die, there are 4 CPUs.
  • Logical CPU is CPU visible to Operating System (OS) irrespective of number of physical CPUs for its processing needs. There could be a quad core physical CPU (4 physical CPUs); but with the help of technologies like hyperthreading double the number of physical CPUs (i.e. 8 CPUs) are exposed to OS.
  • Scheduler is a logical CPU on SQL Server side. Each scheduler is mapped to one logical processors exposed by operating system. However, one logical processor can have multiple schedulers assigned to it. Schedulers can be visible or hidden. A visible scheduler is visible to SQL Server user processes, whereas, the hidden scheduler is visible only to SQL Server internal processes. One scheduler is dedicated to DAC (Dedicated admin connection). This DAC is a back entry to SQL Server for the DBA if the server hangs.
  • Worker is something that executes a task. Scheduler is also called worker thread. Worker threads in SQL Sever are mapped to windows threads or fibers by scheduler.
  • Task is a unit of work assigned to SQL Server. Generally a session submits a batch. This batch is broken into multiple tasks and these are processed by multiple threads. This leads to parallellism.
  • Session or SPID is a connection over which requests are sent to SQL Server.