How this works?
- Please refer to the diagram above.
- Data for each year stored in separate physical table. In the above example, FactTransaction_2010, FactTransaction_2011, FactTransaction_2012, FactTransaction_2013.
- Each physical table is partitioned into one partition per month.
- Each table is stored on a different file group. Each file group having 4 files per filegroup.
- Each filegroup is stored on a different storage drive based on workload. More recent data (very frequently accessed) is stored on high performance storage option. Older data generally accessed less frequently is placed on a less performance, less expensive storage option.
- A view (FactTransaction_vw, union All on all tables) exposes the data in all the tables. The application queries uses this view for its reporting needs.
- A synonym (FactTransaction_syn) points to the underlying table that is for current year (FactTransaction_2012). All the queries that insert or update data for current year, would use the synonym. Next year, the synonym would be pointed to the relevant table without any impact on existing code.
- Most imporantly, tables for each year will have a check constraint on the date column (Partition column) for valid date range. For example, a check constraint on date column in FactTransaction_2012 to ensure that data from 1Jan2012 to 31Dec2012 can be stored. With this sql server does table elimination when searching for a particular date. Since each table is further partitioned per month, SQL Server does partition elimination. For instance, if a user queries data for 10 Nov 2012. SQL Server would directly jump to table FactTransaction_2012 (Table elimination. Check constraint helps sql server do this) without going to any other table and then within FactTransaction_2012 it jumps to partition meant for month Nov (partition elimination). This greatly helps in partition column based search operations.
Variations of this design depending on data volume
With some more creativity, this design can be tweaked if daily/monthly data volumes are too heavy. But do a thorough testing if this really works for you. Below is one such variation.
- One table for each year (FactTranaction_2010, FactTransaction_20xx, and so on).
- Each table relevant to each year can be partitioned per month (or even date).
- One separate table (FactTransaction_CurrentMonth) that would store data for current month (or quarter depending on data volume).
- Daily rebuilds could to be done only on FactTransaction_CurrentMonth table which would be a very small fraction of a huge historic table. Any maintenance activity like index rebuild or statistics update would not put undue pressure on system resources and would definitely benefit the system availability.
- Once current month is done (Eg Nov2012), the current month data from FactTransaction_CurrentMonth can be merged in the FactTransaction_2012 table. Then FactTransaction_CurrentMonth would be used to store Dec2012 data and so on.
- Monthly rebuilds need to be done on Yearly table after merging current month data from FactTransaction_CurrentMonth to FactTransaction_2012.
- Relieves SQL Server from unwanted CPU and IO pressure from online/offline index rebuilds.
- Index rebuild is something that cannot be avoided by any database implementation. Based on index design and index fragmentation levels, indexes might need to be rebuilt as a part of daily batch. Index rebuilds are generally a problem for huge databases due to space requirements and resource consumption.
- Let's understand this with an example. Let's say there is a table with 1 Terabyte (TB) of data. This table stores 4 years of data. If any partitioning design pattern except Design-4 is implemented, then index rebuild would lead to a very big hit on CPU and IO. In case of clustered index, because it includes actual data pages plus other index pages; the clustered index size would be 1 TB (data size) + other index pages size. Rebuilding such a clustered index is going to be a huge spike on server in terms of IO and CPU consumption. In case of non-clustered indexes, even if it does not contain the data pages as in case of clustered index; its still going to hit CPU and IO as well. That's because, the Non-clustered index structure will store data for all the historic data of entire table. Let's say, there are 5 non-clustered indexes, and each index is (hypothetically, based on design) is 100 gb. So a rebuild here will lead to rebuild of 500 gb of data which will lead to a big hit on server resources.
- It's worth a thought... why should the index data for previous years be rebuilt over and over again, even if what really needs to be rebuilt is only current year data. Now coming to the recommended multiple table based design, data is already split into multiple "separate" physical structures. Obviously, rebuilding only the current year indexes will be very light on SQL Server as compared to rebuilding indexes on a single table with all the historic data
- Better availabiliy with online rebuilds.
- A particular partition can be rebuilt only in offline mode. Online option is available only for entire table. This can be an issue for applications that need to be online 24x7. In offline mode, the partition being rebuilt is not available for querying.
- With a table for each year, the entire data is divided into separate structures and so rebuilding only a single year table becomes more practical in ONLINE mode.
- If a single year data is huge, the solution can be tweaked to suit your needs. Refer to section "Variations of this design depending on data volume". As mentioned in the example, online rebuild would take very small fraction of time, as only a very small fraction of data would be rebuilt. For example, in the FactTransaction_CurrentMonth, on day-1 only one day data would to be rebuilt. On Day-2, two days and so on until day 31. So at the max 31 days of data would be rebuilt, while rest of the data for all the years is untouched. That cuts down a lot of maintenance time and resources and enhances application availability.
- Space issue associated with online rebuilds is eradicated.
- Rebuilding indexes in online mode on huge tables need a lot of space. Any many a times, the rebuilds fail due to lack of space. Let's understand this with an example.
- For instance, if a clustered index on table with 500 GB data needs to be rebulit, it would need 1.1 Terabyte of space. This is how it would need.
- A copy of data is created which would need 500 GB
- Around 20% of sorting space, that would be 100 GB
- 500 GB of log is generated.
- If mirroring is implemented, then it would generate a lot of network traffic.
- Lets see the space requirement with this design or its variation (FactTransaction_CurrentMonth). For instance, the FactTransaction_CurrentMonth table that stores one month of data at the max would store 10 GB at the end of the month. At the max it would need around 22 GB of space.
- A copy of data is created which would need 10 GB
- Around 20% of sorting space, that would be 2 GB.
- 10 GB of log is generated.
- Relieves SQL Server from unwanted CPU and IO pressure from statistics recompute for a huge base data.
- A huge table will have huge regular indexes. There could be situations when the performance is bad due to skewed statistics as the sampling size is not enough. Scanning such huge data with higher sampling size to rebuild statistics would overkill CPU and IO for a long time, as sql server has more data pages to scan through to create more accurate statistics.
- With one huge table storing data for all the previous years, statistics update would lead to re-computation of statistics for all the previous years. This is un-necessary work. Would it not be great if only current year statistics is being recomputed. Previous year statistics can be computed once and used multiple times by all the queries.
- Helps SQL Server in maintaining accurate statistics which is a major contributor in better query performance.
- Apart from the undue pressure on CPU and IO, huge tables also lead to skewed statistics. The rule is, small tables have accurate statistics and huge tables by default have skewed statistics. In this particular design, because each table is a separate physical structure, the statistics too are maintained separately for each table by year. If it was one big table with all the historic data, there would be huge statistics maintained for that table.
- So why accurate statistics are beneficial for better execution plans? Thats because accurate statistics have enough data to help sql server come up with a good execution plan. If the statistics is skewed, SQL Server cannot come up with good execution plan. This follows the standard rule - "Garbage in, garbage out".
- Since this blog is not dedicated to statistics, I cannot include more details on why and how statistics get skewed on huge tables. I would do that in a separate post. But in general stats get skewed in huge tables because SQL server computes statistics only on occurence of certain events and when it does, it uses a sampling algorithm. The samples might not be enough to compute accurate statistics
- Any change in table schema need to be reflected in all the underlying tables. If you have 10 tables for 10 years, then all those would need to be updated with the new schema. Although there is another way to avoid this. One can create a dummy value column in the view. But I would not prefer that as a personal choice.
- Can have max of 255 tables in a view. This practically should not be an issue as it would mean 255+ years of historic data which is next to impossible for all practical purposes.
Please leave your comments if you liked/have suggestions for this blog post.