Wiki source code of SQL Server Index Design

Last modified by Billie D on 2021/02/12 18:04
Hide line numbers
  1: {{box cssClass="floatinginfobox" title="**Contents**"}}
  2: {{toc /}}
  3: {{/box}}
  4: 
  5: [[image:About Us.WebHome@myitguide_small_01.jpg]]
  6: 
  7: = Index Design Recommendations =
  8: 
  9: * Examine the **WHERE clause** and join criteria columns
 10: * Use **narrow** indexes
 11: * Examine column **uniqueness**
 12: * Examine column **data types**
 13: * Consider **column order**
 14: * Consider the **type of index** (clustered vs non-clustered)
 15: 
 16: == Examine the WHERE clause and join criteria columns ==
 17: 
 18: SQL Server query optimizer tries to find the best data access mechanism for every table referred to in the query.
 19: 
 20: * **Optimizer** identifies the column in the **WHERE clause** and **join criteria**.
 21: * Optimizer examines indexes on those columns
 22: * Optimizer assesses the **usefulness** of each index by determining the **selectivity (number of rows returned),** from statistics maintained on the index
 23: * Optimizer estimates the least costly method of retrieving the qualifying rows, based on the information gathered in the previous steps.
 24: 
 25: The behaviour of the query optimizer is that the WHERE clause column helps the optimizer choose an optimal indexing operations for a query.  Also applicable for column used in the join criterial between two tables.  Optimizer looks for the indexes on the WHERE clause column or join criterion column and if available, consider using the index to retrieve the rows from the table.
 26: 
 27: Therefore, having **indexes on the frequently used columns** in the** WHERE clause** and the** join criteria** of SQL query **helps the optimizer avoid scanning a base table**.
 28: 
 29: When the amount of a data inside a table is so small that it fits onto a single page (8 kb), a table scan may work better than an index seek.  If you have a good index in place but you’re still getting a scan, consider this issue.
 30: 
 31: == Use Narrow Indexes (A few columns as much as possible). ==
 32: 
 33: You can create an indexes on a combination of columns in a table.  For best performance, use a few columns in an index as you can.
 34: 
 35: Also **avoid** **wide data type** columns, or with string data types (**CHAR,VARCHAR, NCHAR**, and **NVARCHAR**), can be quite wide as can binary, unless it’s absolutely necessary.
 36: 
 37: A __narrow index can accommodate more rows in an 8 Kb index page__ than a wide index.
 38: 
 39: * Reduces I/O (by having to read a fewer 8 Kb pages).
 40: * Make database caching more effectively, SQL can cache fewer index pages, reducing logical reads required for the index pages in the memory
 41: * Reduces storage space to the database.
 42: 
 43: == Examine column uniqueness (High Selectivity, lots of unique rows) ==
 44: 
 45: Creating index on column with a__ very low range of possible values (such as gender), will not benefit performance__, query optimizer will not use index effectively to narrow down the rows to be returned.
 46: 
 47: It is always __preferable to have a column in the WHERE__ clause with __**lots of unique rows (or high selectivity**) __to limit the number of rows accessed.  You should create index on those columns to help the optimizer access a small result set.
 48: 
 49: While creating index on multiple columns, referred to as __composite index__, **column order matters**.  **__Using the most selective column__ first** will help **filter the index rows more efficiently**.
 50: 
 51: It is important to know the selectivity of a column before creating an index on it.  The __column with the highest number of unique values (or selectivity) can be the best candidate__ for indexing when referred to in a WHERE clause or a join criterion.
 52: 
 53: == Examine column data types ==
 54: 
 55: The data type of index matters. The index search on **__integer keys__ is __very fast__** because of the **__small size and easy arithmetic manipulation__** of the INT data type. You can also use variations of INT, such as BIGINT, SMALLINT, and TINYINT) for index column.
 56: 
 57: CHAR, VARCHAR, NCHAR, NVARCHAR require a **__string match operation, which is usually costlier__** than an integer match operation. The CPU doesn't understand how to perform arithmetic operations on this data, and therefore converts it to integer data type before arithmetic operations.
 58: 
 59: == Consider column order ==
 60: 
 61: An index key is sorted on the first column of the index and then sub sorted on the next column within each value of the previous column. The __**first column** in a compound index__ is frequently referred to as the **__leading edge of the index__.**
 62: 
 63: == Consider type of index (clustered vs non-clustered) ==
 64: 
 65: Two main index types: **clustered** and **non clustered**.  Both have a B-tree structure.
 66: 
 67: The main difference between the two types is that the __leaf page in a clustered index are the data pages__ of the table and therefore in the same order as the data. The clustered index is the table.
 68: 
 69: == Clustered Index ==
 70: 
 71: The **leaf page** of a clustered index and the **data pages** of the table **are one and the same**.  **Rows are physically sorted on the clustered index column**(s), **only one physical sort order**, and therefore only one clustered index per table.
 72: 
 73: SQL Server **default behaviour** create a **primary key constraint as unique clustered index**.  This is **not a requirement**, it's just default behaviour, which you can change prior to creating the table.
 74: 
 75: **Heap tables**, are table with **no clustered index**.  Data row are not stored in a particular order or linked to the adjacent pages in the table.  This **__unorganized structure usually increase overhead__ //accessing l__arge heap__ when compared to accessing a large non heap table (with clustered index).//**
 76: 
 77: === Relationship with Non Clustered indexes ===
 78: 
 79: An index row of a nonclustered index contains **pointer** to the corresponding data row of the table.  This pointer is called **__row locator__**.  The value of **row locator depends** whether the data page are stored in a **heap** or a **clustered**.
 80: 
 81: For __nonclustered index__, the row locator is a pointer to the **__RID__** for the data row in a heap.
 82: 
 83: For a table with a __clustered index__, the row locator is the **__clustered index key value__**.
 84: 
 85: === Clustered Index Recommendation ===
 86: 
 87: * //Create the **clustered index first**//
 88:  
 89: * If the nonclustered index are built before a clustered index, then the nonclustered row locator will contain pointer to the corresponding RID. __Creating a clustered index **later **//will modify all nonclustered indexes to contain clustered index key as a row locator value//__.
 90:  
 91: * Keep index** narrow**
 92:  
 93: * Since nonclustered index hold clustered keys as row locator, **keep the overall byte size of the clustered index as small as possible**.  Thus keep the number of columns in the clustered index to a minimum.  A column of INT data type is a good candidate, string data type will be less than optimal choice.
 94:  
 95: * Rebuild the clustered Index in a Single Step
 96:  
 97: * Because of the dependency of nonclustered indexes on the clustered index, **rebuilding** the __**clustered index**__ as separate --**DROP INDEX**-- and **CREATE INDEX** statements //__causes all the nonclustered indexes to be rebuild twice__//.
 98:  
 99: * To avoid this, __use **DROP_EXISTING** clause of the CREATE INDEX statement to **rebuild the clustered index in a single atomic step**__.  You can also **use this in nonclustered index.**
100: 
101: {{code language="SQL"}}
102: CREATE NONCLUSTERED INDEX ixn_SomeTable__SomeIndexName_ic1 
103: on dbo.SomeTable                ( Column1 )
104: include                         ( Column2, IncludeThisNewColumn3 )
105: with ( sort_in_tempdb = on, drop_existing = on, online = on, fillfactor = 95 ) on [SomeFileGroup]
106: GO
107: {{/code}}
108: 
109: ==== When to use clustered index ====
110: 
111: 
112: * Retrieving a **range of data**
113: \\Since the leaf pages of a clustered index and the data page are the same, the order of clustered index is the physical order of the data rows, then the __disk head can read all the rows sequentially, without much disk movement__.  Since physical disk head constitutes a major portion of the costs of a disk operation, sorting rows physically order on the disk optimizes I/O costs.
114: \\Foreign key (**FK**) on another table are accessed frequently in a relational systems, which is a great candidate for inclusion in the clustered index.
115:  
116: *  Retrieving **pre-sorted data**
117: \\Clustered index are particularly **efficient when data retrieval **needs to be **sorted**, which the row is already physically sorted on disk, **__eliminating overhead of sorting the data after it is retrieved__**.
118: 
119: ==== When not to use Clustered Index ====
120: 
121: * Frequently **Updatable columns**
122: \\This will cause the row locator of all the non clustered indexes to be updated accordingly, significantly **increase the cost of the relevant action query**.  Also affects database concurrency by **blocking all other queries**.
123:  
124: * **Wide keys**
125: \\Non clustered index holds clustering key as row locator, for performance reasons, avoid creating clustered index on a very wide column.
126:  
127: * Too many **__concurrent inserts in sequential order__**
128: \\If adding many rows concurrently, then it may be a better for performance to **__distribute__ those across the data pages of the table**.  If you add all the rows in the same order as imposed by clustered index, then all the inserts will be attempted on the last page of the table, which may **cause a huge "hot spot" on sector of the disk**.
129: \\The __inserts can be **randomized** throughout the table by **creating a clustered index on another column** that **doesn’t arrange the row in the same order as the new rows**__.
130: \\The caveats to this recommendation, __allowing inserts at the bottom of the row, prevents page splits on the intermediate page of the B-tree to accommodate the new rows__.
131: \\However, if the **disk hot spot becomes a bottleneck**, __reduce the fill factor of the table to likely prevent page split__.  In addition, the "hot" pages
132: 
133: == Non Clustered Index ==
134: 
135: It does not affect the order of the data in the table pages, because the leaf page of a nonclustered index and the data page of the table are separate.
136: 
137: * __Nonclustered index is most useful when all you want to retrieve is a **small number of rows from a large table**__.  //As the **number of rows increases**, the **overhead costs of the bookmark lookup rises proportionately**//.  To retrieve a small number of rows from a table, the index columns should have a **very high selectivity.**
138:  
139: * Clustered index can be an **alternative index** for (since it is not suitable for clustered keys, but remain cautious)
140: ** frequently **updatable columns**  (It doesn’t affect other indexes, update is limited to base table and the non clustered key)
141: ** **wide keys**  (not as costly as clustered index, it doesn’t increase the size of other index).
142: ** __Nonclustered index also helps **resolve blocking and deadlocking issues**__
143: * Nonclustered index are **not**__ suitable for **querying large number of rows**__, which are__ better served by **clustered index**__, which **__doesn't require a separate bookmark lookup__**, which **require additional logical reads**.
144: 
145: === Non Clustered Index Maintenance ===
146: 
147: The row locator value of the nonclustered indexes continues to have the same clustered index value, even when the clustered index rows are physically relocated.
148: \\To optimize this maintenance costs, SQL Server adds a pointer to the old data page to point to the new data page after a page split,instead of updating the row locator of all relevant nonclustered indexes.
149: \\Although it reduces maintenance costs, it increase navigation costs from the nonclustered index row to the data row, since extra link is added between the old data page and the new data page.
150: \\Having a clustered index as the row locator decreases this overhead associated with the non-clustered index.
151: 
152: = Defining the Bookmark Lookup =
153: 
154: When the query requests **columns** that are __**not part** of the **nonclustered index**__ chosen by the optimizer, **a lookup is required**.
155: \\This maybe a **key lookup** __when going against a clustered index or an RID lookup__ when perform against a heap.
156: \\Common term for this is **//bookmark lookup//**.  It __fetches the corresponding data row from the table by following the row locator value__ from the index row, **requiring a logical read** on the data page **besides the logical read on the index page**.
157: \\However, __if the columns required by the query are available in the index itself, then access to the data page is not required__, this is known as the //**covering index**.//
158: \\**Bookmark lookup are the reason that large resultsets are better served with a clustered index**.  A **clustered index** does __**not **__require a **bookmark lookup**, since the **leaf pages** and the **data pages** for a **clustered index** are the **same**.
159: 
160: = Covering Index =
161: 
162: A covering index is a **__nonclustered index__** built upon __all the columns required to satisfy a SQL query **without going to the base table**__.
163: \\If a query encounters an index and does not need to refer to the underlying data table at all, then the index can be considered a covering index.
164: \\Adding column using the **__INCLUDE__** statement makes this functionality easier to achieve __without adding to the number of columns in an index or the size of the index key__ since the **included columns are stored only at the leaf level of the index**.
165: \\The INCLUDE is best used in the following cases:
166: 
167: * You **__dont want to increase the size of the index keys__**, but you still want to make the index a **covering index**
168:  
169: * You're indexing a data type that can't be indexed (except text, ntext, and images).
170:  
171: * You've already exceeded the maximum number of key columns for an index.
172: 
173: = Index Intersections =
174: 
175: If a table has multiple indexes, then SQL Server can use multiple indexes to execute a query, selecting a small subsets of data based on each index and then performing the **INTERSECTION** of the **two subsets**.
176: \\SQL Server can __exploit **multiple indexes**__ on a table and then __employ a join algorithm to obtain the **index intersection** between the two subsets__.
177: \\The existence of multiple nonclustered index is possible when an existing nonclustered index key may be already quite wide.
178: 
179: = Index Joins =
180: 
181: The index joins is a __variation of **index intersection**__, where the covering index technique is applied to the index intersection.  If no single index covers a query but __multiple indexes together can cover the query__, SQL Server can use an indexes to satisfy the query fully without going to the base table..
182: 
183: = Filtered Index =
184: 
185: Is a nonclustered index that uses a filter, basically a WHERE clause, **__to create a highly selective set of keys__** against a column or columns that may not have a good selectivity.
186: \\For example, a column with a __large number of NULL values may be stored as a sparse column to reduce the overhead of those null values__.  Adding filtered index will allow you to have an index available on the data that is not null.
187: 
188: = Index Views =
189: 
190: A database view in SQL Server is __like a **virtual table**__ that represents the output of a SELECT statement.  __In general it doesn’t stored any data__ ~-~- every time a view is queried, it further queries the underlying tables by executing its associated SELECT statement.
191: \\A database view can be **__materialized__** on the disk by __creating a **unique clustered index** on the view__, referred to as **index view** or **materialized view**, __**persisted in physical storage** in the database__, saving overhead of performing costly operations during query execution.  After the view is materialized, multiple NON clustered indexes can be created on the indexed view.
192:  
193: 
194: Benefit:
195: 
196: * **Aggregation** can be** precomputed** and stored, to minimize expensive computation
197: * Table can be **pre-joined**
198: * Combination of joins and computation can be materialized
199: 
200: Overhead:
201: 
202: * **Any change in base table** has to be reflected in the indexed view by executing the view's SELECT statement.
203: * Any changes to base table may initiate the nonclustered indexes on the index view.
204: * Additional storage is required
205: 
206: = Index Compression =
207: 
208: Was __introduced in SQL Server 2008__, compressing the index means __getting more key information onto a single page__, which can __lead to **performance improvements**__ because __fewer pages__ are needed to store the index.
209: \\There __will be **overhead on CPU** and memory__ as the key values in the __index are compressed and decompressed__, so this may not be a solution for all indexes.
210: \\There are two types of compression:  **row and page level compression**. Non leaf pages in an index receive no compression under the page type.