When new programmers start out, their main goal is just to make the code work correctly. However, as they gain experience, they learn that code quality has other important measures. One key measure is code maintainability, which means the code is readable, well-organized, and modular. Another crucial aspect is code efficiency, which is how fast the code runs. For example, two functions can do the same thing, but one might be much faster:

# Version 1
def print_numbers_version_one():
	number = 2
	while number <= 100:
		# If number is even, print it:
		if number % 2 == 0:
			print(number)
		number += 1
		
# Version 2
def print_numbers_version_two():
	number = 2
	while number <= 100:
		print(number)
		# Increase number by 2, which, by definition,
		# is the next even number:
		number += 2

The example given shows that iterating (looping) through a list and checking for even numbers (Version 1) is twice as slow as simply skipping non-even numbers by incrementing by 2 (Version 2).

Data Structures

Data is a general term for all types of information (like numbers and strings). Data structures are about how data is organized. The same data can be organized in different ways, such as independent variables or within an array. The way you choose to organize your data is extremely important because it can significantly impact how fast your program runs—sometimes by a huge margin. Choosing the right data structure can be the difference between software that works under a heavy load and software that crashes.

This chapter starts by analyzing the performance of two specific data structures: arrays (or lists) and sets.

The Array: The Foundational Data Structure 📝

The array (often called a list in Python) is one of the most basic and versatile ways to store a list of data elements:

  • The size of an array is simply the number of items it holds.
  • The index is the number that identifies the specific location of a data element within the array. Index counting usually starts at 0.

Data Structure Operations

To understand how a data structure performs, we analyze its speed for four common ways we interact with it, known as operations:

  1. Read: Looking up a value at a specific spot (like looking up an item using its index in an array).
  2. Search: Looking for a particular value to see if it exists and, if so, finding its index (like checking if “dates” is on a shopping list).
  3. Insert: Adding a new value to the data structure (like adding “figs” to a shopping list).
  4. Delete: Removing a value from the data structure (like removing “bananas” from a shopping list).

Measuring Speed

When we measure how fast or efficient an operation is, we don’t look at the time it takes (like seconds). Instead, we measure it by the number of steps (or computational steps) it takes.

  • Why steps, not time? Measuring time is unreliable because the result constantly changes based on the hardware the code runs on (it will be different on an old computer versus a new one).
  • Measuring the number of steps is consistent. If Operation A takes 5 steps and Operation B takes 500 steps, Operation A will always be faster, regardless of the machine.

Measuring the speed of an operation by the number of steps it takes is called measuring its time complexity. The terms speed, time complexity, efficiency, performance, and runtime are used interchangeably and all refer to the number of steps.

Reading

The Reading operation is looking up the value at a particular index inside an array.

A computer can perform the Read operation in just one step. This is because of how arrays are stored in the computer’s memory. Here’s how:

  1. Memory Cells: Computer memory is like a grid of cells, and each cell has a unique memory address (like a street address).
  2. Contiguous Storage: When a program creates an array, the computer finds a set of empty cells that are all right next to each other (contiguous) in memory:
  3. Every cell in a computer’s memory has a specific address. It’s sort of like a street address (for example, 123 Main St.), except that it’s represented with a number. Each cell’s memory address is one number greater than the previous cell’s address:
  4. Direct Access:
    • The computer can jump directly to any specific memory address in one step.
    • The computer remembers the starting memory address of the array (the location of index 0).
  5. Simple Calculation: To find the value at any other index (say, index 3), the computer simply takes the starting address and adds the index number. (Since the cells are sequential, index 3 will be exactly 3 addresses past index 0).

Since the computer can calculate the address and then jump straight to that location in a single step, Reading from an array is an efficient and fastest type of operation.

Searching

Searching an array means looking for a specific value to find out two things: if it exists and, if so, at what index it’s located. It is the opposite of Reading, where you give the computer the index and it gives you the value.

While Reading is fast (one step) because the computer can jump immediately to any memory address/index, Searching is less efficient.

  • A computer has immediate access to its memory addresses, but it does not know what values are stored at those addresses beforehand.
  • To search for a value, the computer has no choice but to inspect each cell, one at a time, until it finds the item. This process is called linear search.

The number of steps the search takes depends on where the value is located:

  • If the value is found at index 3, the search took 4 steps (checking indices 0, 1, 2, and 3).
  • The maximum number of steps is equal to the size of the array. This happens in two scenarios:
    1. The value you are looking for is in the very last cell.
    2. The value you are looking for is not in the array at all (the computer must check every single cell to confirm it’s missing).

If an array has N cells (size N), a linear search could take a maximum of N steps. Because searching can take many steps depending on the array’s size, it is less efficient than reading, which always takes only one step.

Insertion

The efficiency of adding new data into an array depends on where you put it.

Adding a new item to the end of an array is fast, taking only one step:

  • This works because the computer keeps track of the array’s size and its starting memory address.
  • It can quickly calculate the address for the next spot and insert the new value there immediately.
  • Note: Sometimes, if the array is full, the computer may have to automatically find and allocate a new, larger block of memory for the array, but this is often hidden from the programmer.

Inserting a new item at the beginning or in the middle of an array is a different, slower process. To make room for the new value, existing data needs to be shifted to the right:

  • This takes multiple steps: you must move elements one by one before the final insertion can happen.
  • The worst-case scenario is inserting at the very beginning of the array. In this case, all other values have to be shifted over.
  • If the array has N elements, the worst-case insertion takes N + 1 steps (N steps to shift all the data, plus 1 step for the final insertion).

Deletion

Deletion is about removing a value at a specific index in an array.

The actual removal of the item takes only one step. However, arrays don’t work well with gaps in the middle. Because of this, after an item is deleted, the process requires extra work: all the elements that followed the deleted item must be shifted to the left to close the empty space.

Like insertion, the worst-case scenario for deletion is removing the element at index 0 (the very first item). This forces the program to shift all remaining elements to the left. If the array has N elements, the maximum number of steps deletion can take is N steps (1 step for the delete, and steps to shift the remaining data).

The Efficiency Summary

You’ve analyzed the speed (or time complexity) of the foundational array data structure.

  • The Read operation is the fastest, always taking just 1 step.
  • The Search operation can take up to N steps in the worst case, as the computer may have to check every single element.
  • Insertion and Deletion can also be slow, potentially taking up to N or N+1 steps if you insert or delete from the beginning, because of all the necessary data shifting.

Understanding these performance differences is crucial. Choosing between an array and another data structure, like the set (which you’ll look at next), can greatly impact how fast your software runs.

Sets: How a Single Rule Can Affect Efficiency

A set is a data structure that does not allow duplicate values. This simple rule, or constraint, is the only difference between an array and the array-based set being discussed.

Sets are useful when you need to ensure unique data, such as making sure a phone number is listed only once in a phone book.

For most operations, the set behaves exactly like the array:

  • Reading: It takes 1 step (same as an array). The computer can instantly calculate and jump to the memory address of any index.
  • Searching: It takes up to N steps (same as an array). The computer must check each value one by one.
  • Deletion: It takes up to N steps (same as an array). This is due to the deletion itself plus the necessary shifting to close the gap.

Insertion is where the array and the set are very different in terms of speed. The problem is that before a set can insert a new value, it must first search the entire set to ensure the value doesn’t already exist (because duplicates aren’t allowed):

  1. Search First: The computer has to perform a full linear search (up to N steps) on the set.
  2. Then Insert: Only once the search confirms the value is unique can the actual insertion step occur.

Efficiency Breakdown:

  • Best Case (Inserting at the end): Even inserting at the end is slow because of the required search. It takes up to N + 1 steps (N steps to search every element + 1 step for the final insertion). This is much slower than the single step required for insertion at the end of a regular array.
  • Worst Case (Inserting at the beginning): This is the slowest scenario, requiring two major phases:
    1. Search: N steps to check for duplicates.
    2. Shift/Insert: N steps to shift all existing data to the right, plus 1 step for the final insertion.
      • This totals steps. This is far less efficient than the steps needed for worst-case insertion into a regular array.

Conclusion: You shouldn’t avoid sets; they are essential when you need to prevent duplicate data. However, if uniqueness isn’t a requirement, an array is generally more efficient for insertion. You must always analyze your application’s needs to choose the best data structure.