SQL Server Index Design
- Contents
- Index Design Recommendations
- Examine the WHERE clause and join criteria columns
- Use Narrow Indexes (A few columns as much as possible).
- Examine column uniqueness (High Selectivity, lots of unique rows)
- Examine column data types
- Consider column order
- Consider type of index (clustered vs non-clustered)
- Clustered Index
- Non Clustered Index
- Defining the Bookmark Lookup
- Covering Index
- Index Intersections
- Index Joins
- Filtered Index
- Index Views
- Index Compression
Index Design Recommendations
- Examine the WHERE clause and join criteria columns
- Use narrow indexes
- Examine column uniqueness
- Examine column data types
- Consider column order
- Consider the type of index (clustered vs non-clustered)
Examine the WHERE clause and join criteria columns
SQL Server query optimizer tries to find the best data access mechanism for every table referred to in the query.
- Optimizer identifies the column in the WHERE clause and join criteria.
- Optimizer examines indexes on those columns
- Optimizer assesses the usefulness of each index by determining the selectivity (number of rows returned), from statistics maintained on the index
- Optimizer estimates the least costly method of retrieving the qualifying rows, based on the information gathered in the previous steps.
The behaviour of the query optimizer is that the WHERE clause column helps the optimizer choose an optimal indexing operations for a query. Also applicable for column used in the join criterial between two tables. Optimizer looks for the indexes on the WHERE clause column or join criterion column and if available, consider using the index to retrieve the rows from the table.
Therefore, having indexes on the frequently used columns in the WHERE clause and the join criteria of SQL query helps the optimizer avoid scanning a base table.
When the amount of a data inside a table is so small that it fits onto a single page (8 kb), a table scan may work better than an index seek. If you have a good index in place but you’re still getting a scan, consider this issue.
Use Narrow Indexes (A few columns as much as possible).
You can create an indexes on a combination of columns in a table. For best performance, use a few columns in an index as you can.
Also avoid wide data type columns, or with string data types (CHAR,VARCHAR, NCHAR, and NVARCHAR), can be quite wide as can binary, unless it’s absolutely necessary.
A narrow index can accommodate more rows in an 8 Kb index page than a wide index.
- Reduces I/O (by having to read a fewer 8 Kb pages).
- Make database caching more effectively, SQL can cache fewer index pages, reducing logical reads required for the index pages in the memory
- Reduces storage space to the database.
Examine column uniqueness (High Selectivity, lots of unique rows)
Creating index on column with a very low range of possible values (such as gender), will not benefit performance, query optimizer will not use index effectively to narrow down the rows to be returned.
It is always preferable to have a column in the WHERE clause with lots of unique rows (or high selectivity) to limit the number of rows accessed. You should create index on those columns to help the optimizer access a small result set.
While creating index on multiple columns, referred to as composite index, column order matters. Using the most selective column first will help filter the index rows more efficiently.
It is important to know the selectivity of a column before creating an index on it. The column with the highest number of unique values (or selectivity) can be the best candidate for indexing when referred to in a WHERE clause or a join criterion.
Examine column data types
The data type of index matters. The index search on integer keys is very fast because of the small size and easy arithmetic manipulation of the INT data type. You can also use variations of INT, such as BIGINT, SMALLINT, and TINYINT) for index column.
CHAR, VARCHAR, NCHAR, NVARCHAR require a string match operation, which is usually costlier than an integer match operation. The CPU doesn't understand how to perform arithmetic operations on this data, and therefore converts it to integer data type before arithmetic operations.
Consider column order
An index key is sorted on the first column of the index and then sub sorted on the next column within each value of the previous column. The first column in a compound index is frequently referred to as the leading edge of the index.
Consider type of index (clustered vs non-clustered)
Two main index types: clustered and non clustered. Both have a B-tree structure.
The main difference between the two types is that the leaf page in a clustered index are the data pages of the table and therefore in the same order as the data. The clustered index is the table.
Clustered Index
The leaf page of a clustered index and the data pages of the table are one and the same. Rows are physically sorted on the clustered index column(s), only one physical sort order, and therefore only one clustered index per table.
SQL Server default behaviour create a primary key constraint as unique clustered index. This is not a requirement, it's just default behaviour, which you can change prior to creating the table.
Heap tables, are table with no clustered index. Data row are not stored in a particular order or linked to the adjacent pages in the table. This unorganized structure usually increase overhead accessing large heap when compared to accessing a large non heap table (with clustered index).
Relationship with Non Clustered indexes
An index row of a nonclustered index contains pointer to the corresponding data row of the table. This pointer is called row locator. The value of row locator depends whether the data page are stored in a heap or a clustered.
For nonclustered index, the row locator is a pointer to the RID for the data row in a heap.
For a table with a clustered index, the row locator is the clustered index key value.
Clustered Index Recommendation
- Create the clustered index first
- If the nonclustered index are built before a clustered index, then the nonclustered row locator will contain pointer to the corresponding RID. Creating a clustered index later will modify all nonclustered indexes to contain clustered index key as a row locator value.
- Keep index narrow
- Since nonclustered index hold clustered keys as row locator, keep the overall byte size of the clustered index as small as possible. Thus keep the number of columns in the clustered index to a minimum. A column of INT data type is a good candidate, string data type will be less than optimal choice.
- Rebuild the clustered Index in a Single Step
- Because of the dependency of nonclustered indexes on the clustered index, rebuilding the clustered index as separate
DROP INDEXand CREATE INDEX statements causes all the nonclustered indexes to be rebuild twice.
- To avoid this, use DROP_EXISTING clause of the CREATE INDEX statement to rebuild the clustered index in a single atomic step. You can also use this in nonclustered index.
CREATE NONCLUSTERED INDEX ixn_SomeTable__SomeIndexName_ic1
on dbo.SomeTable ( Column1 )
include ( Column2, IncludeThisNewColumn3 )
with ( sort_in_tempdb = on, drop_existing = on, online = on, fillfactor = 95 ) on [SomeFileGroup]
GOWhen to use clustered index
- Retrieving a range of data
Since the leaf pages of a clustered index and the data page are the same, the order of clustered index is the physical order of the data rows, then the disk head can read all the rows sequentially, without much disk movement. Since physical disk head constitutes a major portion of the costs of a disk operation, sorting rows physically order on the disk optimizes I/O costs.
Foreign key (FK) on another table are accessed frequently in a relational systems, which is a great candidate for inclusion in the clustered index.
- Retrieving pre-sorted data
Clustered index are particularly efficient when data retrieval needs to be sorted, which the row is already physically sorted on disk, eliminating overhead of sorting the data after it is retrieved.
When not to use Clustered Index
- Frequently Updatable columns
This will cause the row locator of all the non clustered indexes to be updated accordingly, significantly increase the cost of the relevant action query. Also affects database concurrency by blocking all other queries.
- Wide keys
Non clustered index holds clustering key as row locator, for performance reasons, avoid creating clustered index on a very wide column.
- Too many concurrent inserts in sequential order
If adding many rows concurrently, then it may be a better for performance to distribute those across the data pages of the table. If you add all the rows in the same order as imposed by clustered index, then all the inserts will be attempted on the last page of the table, which may cause a huge "hot spot" on sector of the disk.
The inserts can be randomized throughout the table by creating a clustered index on another column that doesn’t arrange the row in the same order as the new rows.
The caveats to this recommendation, allowing inserts at the bottom of the row, prevents page splits on the intermediate page of the B-tree to accommodate the new rows.
However, if the disk hot spot becomes a bottleneck, reduce the fill factor of the table to likely prevent page split. In addition, the "hot" pages
Non Clustered Index
It does not affect the order of the data in the table pages, because the leaf page of a nonclustered index and the data page of the table are separate.
- Nonclustered index is most useful when all you want to retrieve is a small number of rows from a large table. As the number of rows increases, the overhead costs of the bookmark lookup rises proportionately. To retrieve a small number of rows from a table, the index columns should have a very high selectivity.
- Clustered index can be an alternative index for (since it is not suitable for clustered keys, but remain cautious)
- frequently updatable columns (It doesn’t affect other indexes, update is limited to base table and the non clustered key)
- wide keys (not as costly as clustered index, it doesn’t increase the size of other index).
- Nonclustered index also helps resolve blocking and deadlocking issues
- Nonclustered index are not suitable for querying large number of rows, which are better served by clustered index, which doesn't require a separate bookmark lookup, which require additional logical reads.
Non Clustered Index Maintenance
The row locator value of the nonclustered indexes continues to have the same clustered index value, even when the clustered index rows are physically relocated.
To optimize this maintenance costs, SQL Server adds a pointer to the old data page to point to the new data page after a page split,instead of updating the row locator of all relevant nonclustered indexes.
Although it reduces maintenance costs, it increase navigation costs from the nonclustered index row to the data row, since extra link is added between the old data page and the new data page.
Having a clustered index as the row locator decreases this overhead associated with the non-clustered index.Defining the Bookmark Lookup
When the query requests columns that are not part of the nonclustered index chosen by the optimizer, a lookup is required.
This maybe a key lookup when going against a clustered index or an RID lookup when perform against a heap.
Common term for this is bookmark lookup. It fetches the corresponding data row from the table by following the row locator value from the index row, requiring a logical read on the data page besides the logical read on the index page.
However, if the columns required by the query are available in the index itself, then access to the data page is not required, this is known as the covering index.
Bookmark lookup are the reason that large resultsets are better served with a clustered index. A clustered index does not require a bookmark lookup, since the leaf pages and the data pages for a clustered index are the same.Covering Index
A covering index is a nonclustered index built upon all the columns required to satisfy a SQL query without going to the base table.
If a query encounters an index and does not need to refer to the underlying data table at all, then the index can be considered a covering index.
Adding column using the INCLUDE statement makes this functionality easier to achieve without adding to the number of columns in an index or the size of the index key since the included columns are stored only at the leaf level of the index.
The INCLUDE is best used in the following cases:- You dont want to increase the size of the index keys, but you still want to make the index a covering index
- You're indexing a data type that can't be indexed (except text, ntext, and images).
- You've already exceeded the maximum number of key columns for an index.
Index Intersections
If a table has multiple indexes, then SQL Server can use multiple indexes to execute a query, selecting a small subsets of data based on each index and then performing the INTERSECTION of the two subsets.
SQL Server can exploit multiple indexes on a table and then employ a join algorithm to obtain the index intersection between the two subsets.
The existence of multiple nonclustered index is possible when an existing nonclustered index key may be already quite wide.Index Joins
The index joins is a variation of index intersection, where the covering index technique is applied to the index intersection. If no single index covers a query but multiple indexes together can cover the query, SQL Server can use an indexes to satisfy the query fully without going to the base table..
Filtered Index
Is a nonclustered index that uses a filter, basically a WHERE clause, to create a highly selective set of keys against a column or columns that may not have a good selectivity.
For example, a column with a large number of NULL values may be stored as a sparse column to reduce the overhead of those null values. Adding filtered index will allow you to have an index available on the data that is not null.Index Views
A database view in SQL Server is like a virtual table that represents the output of a SELECT statement. In general it doesn’t stored any data -- every time a view is queried, it further queries the underlying tables by executing its associated SELECT statement.
A database view can be materialized on the disk by creating a unique clustered index on the view, referred to as index view or materialized view, persisted in physical storage in the database, saving overhead of performing costly operations during query execution. After the view is materialized, multiple NON clustered indexes can be created on the indexed view.
Benefit:
- Aggregation can be precomputed and stored, to minimize expensive computation
- Table can be pre-joined
- Combination of joins and computation can be materialized
Overhead:
- Any change in base table has to be reflected in the indexed view by executing the view's SELECT statement.
- Any changes to base table may initiate the nonclustered indexes on the index view.
- Additional storage is required
Index Compression
Was introduced in SQL Server 2008, compressing the index means getting more key information onto a single page, which can lead to performance improvements because fewer pages are needed to store the index.
There will be overhead on CPU and memory as the key values in the index are compressed and decompressed, so this may not be a solution for all indexes.
There are two types of compression: row and page level compression. Non leaf pages in an index receive no compression under the page type. - Index Design Recommendations