Skip to content
Fix Code Error

How does database indexing work?

March 13, 2021 by Code Error
Posted By: Anonymous

Given that indexing is so important as your data set increases in size, can someone explain how indexing works at a database-agnostic level?

For information on queries to index a field, check out How do I index a database column.

Solution

Why is it needed?

When data is stored on disk-based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire tablespace must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, which has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and a pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

The downside to indexing is that these indices require additional space on the disk since the indices are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

How does it work?

Firstly, let’s outline a sample database table schema;

Field name       Data type      Size on disk
id (Primary key) Unsigned INT   4 bytes
firstName        Char(50)       50 bytes
lastName         Char(50)       50 bytes
emailAddress     Char(100)      100 bytes

Note: char was used in place of varchar to allow for an accurate size on disk value.
This sample database contains five million rows and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).

Example 1 – sorted vs unsorted fields

Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.

A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value, given that the id field is a key field. But since the id field is also sorted, a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.

Now the firstName field is neither sorted nor a key field, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.

Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;

Field name       Data type      Size on disk
firstName        Char(50)       50 bytes
(record pointer) Special        4 bytes

Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

Example 2 – indexing

Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the index is N = (r/bfr) = 5000000/18 = 277,778 blocks.

Now a search using the firstName field can utilize the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 1,000,000 block accesses required to find a firstName match in the non-indexed table.

When should it be used?

Given that creating an index requires additional disk space (277,778 blocks extra from the above example, a ~28% increase), and that too many indices can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

Since indices are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.

Answered By: Anonymous

Related Articles

  • What exactly is std::atomic?
  • Convert array to nested JSON object - Angular Material tree
  • How can i display data from xml api in flutter?
  • Navbar not filling width of page when reduced to mobile view
  • How do I use arrays in C++?
  • Callback functions in C++
  • How to smoothly hide part of an element
  • Good way of getting the user's location in Android
  • Jquery fadeToggle Trouble
  • How do I keep only the first map and when the game…
  • Logging best practices
  • Centering in CSS Grid
  • Memcached vs. Redis?
  • How to change the color of vaadin-select-text-field…
  • What does "dereferencing" a pointer mean?
  • SQL query return data from multiple tables
  • Creating an dynamic array, but getting segmentation…
  • Fastest way to iterate over all the chars in a String
  • Why my "title" bloc doesn't go to bottom?
  • Database development mistakes made by application developers
  • What's the difference between the atomic and…
  • How does PHP 'foreach' actually work?
  • Show only certain items at certain condition on…
  • Is it possible to apply CSS to half of a character?
  • What is the difference between NULL, '' and 0?
  • Implementation of user defined array on stack and/or…
  • InfluxDB - ERR_EMPTY_RESPONSE - Used to come up in…
  • How do I include certain conditions in SQL Count
  • How to make Javascript font responsive?
  • The definitive guide to form-based website authentication
  • What does "atomic" mean in programming?
  • Active tab issue on page load HTML
  • python 3.2 UnicodeEncodeError: 'charmap' codec can't…
  • Polymer 1.0 'array-style' path accessors,…
  • Text size and different android screen sizes
  • JNA - How to turn `char ***devices`…
  • Vue + VUEX + Typescript + Vue Router. component not…
  • Start redis-server with config file
  • Unresponsive hamburger icon in HTML
  • What is a NullReferenceException, and how do I fix it?
  • How should a model be structured in MVC?
  • data.table vs dplyr: can one do something well the…
  • What are the new features in C++17?
  • Firestore, Security Rules, how to limit a size of…
  • What is the difference between const int*, const int…
  • How do SO_REUSEADDR and SO_REUSEPORT differ?
  • C++ Passing Pointer to Function (Howto) + C++…
  • Why does the function named "traverse" not work on my code?
  • When to use AtomicReference in Java?
  • 3 column layout HTML/CSS
  • Correct format specifier to print pointer or address?
  • Having trouble with my nav bar/header, It used to…
  • What is a "cache-friendly" code?
  • Volatile Vs Atomic
  • "Large data" workflows using pandas
  • vaadin combobox load wrong custom style
  • Hibernate: ids for this class must be manually…
  • How to get current location in Android
  • SDL_SetRenderTarget() hangs for a while if multiple…
  • C++11 introduced a standardized memory model. What…
  • What is the difference between localStorage,…
  • What is the difference between atomic / volatile /…
  • Android + Pair devices via bluetooth programmatically
  • How to upload images to an auto-generated doc…
  • What database does Google use?
  • Footer not staying at the bottom when scrolling
  • With the microservices outbox pattern, can I write…
  • The static keyword and its various uses in C++
  • Including npm packages in Vue.js app
  • How to get the current location latitude and…
  • Head pointer not accessible when creating a method…
  • How can I find the product GUID of an installed MSI setup?
  • How to import XML file into MySQL database table…
  • How can I persist local storage data from one page…
  • How to allocate aligned memory only using the…
  • How to fix overlapping issue in the Qweb reports?
  • AWS EFS vs EBS vs S3 (differences & when to use?)
  • Using media breakpoints in Bootstrap 4-alpha
  • CSS media queries for screen sizes
  • Playing HTML5 video on fullscreen in android webview
  • Do we really need to use Interlocked for global int…
  • Geofire query events not firing
  • Simple way to query connected USB devices info in Python?
  • Component Inheritance with vue js
  • How do you clear the SQL Server transaction log?
  • Python Linked List
  • UITableView with fixed section headers
  • What is a smart pointer and when should I use one?
  • Relational room database: The class must be either…
  • Smart way to truncate long strings
  • Android Location Providers - GPS or Network Provider?
  • What are the undocumented features and limitations…
  • Xamarin iOS new StackTrace() kills App: Assertion…
  • Virtual Memory Usage from Java under Linux, too much…
  • VUE Error when run test unit
  • show and hide previous and next button jQuery
  • How can I show current location on a Google Map on…
  • Pandas - Get first row value of a given column
  • What does "Fatal error: Unexpectedly found nil while…
  • Find location of a removable SD card

Disclaimer: This content is shared under creative common license cc-by-sa 3.0. It is generated from StackExchange Website Network.

Post navigation

Previous Post:

Pad a string with leading zeros so it’s 3 characters long in SQL Server 2008

Next Post:

Programmatically navigate using React router

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

.net ajax android angular arrays aurelia backbone.js bash c++ css dataframe ember-data ember.js excel git html ios java javascript jquery json laravel linux list mysql next.js node.js pandas php polymer polymer-1.0 python python-3.x r reactjs regex sql sql-server string svelte typescript vue-component vue.js vuejs2 vuetify.js

  • you shouldn’t need to use z-index
  • No column in target database, but getting “The schema update is terminating because data loss might occur”
  • Angular – expected call-signature: ‘changePassword’ to have a typedeftslint(typedef)
  • trying to implement NativeAdFactory imports deprecated method by default in flutter java project
  • What should I use to get an attribute out of my foreign table in Laravel?
© 2022 Fix Code Error