Introduction to Algorithms

Study Notes on Introduction to Algorithms

by Charles Leiserson and Erik Demaine

Qianqian Shan

Contents

1 Analysis of Algorithms, Insertion- and Merge-Sort 3

2 Asymptotic Notation, Recurrences, Substitution, Master Method 5

3 Divide and Conquer (Divide and Rule): Strassen, Fibonacci,

Polynomial Multiplication 9

4 Quick Sort (Tony Hoare, 1962) and Randomized Quick Sort 15

5 Linear-time Sorting: Lower Bounds, Counting Sort, Radix

Sort 21

6 Order Statistics, Median 25

7 Hashing, Hash Functions 27

8 Universal Hashing, Perfect Hashing 34

9 Relation of BSTs to Quicksort - Analysis of Random BST 38

10 Red-black Tree, Rotations, Insertions, Deletions 44

11 Augmenting Data Structures, Dynamic Order Statistics, In-

terval Trees 50

12 Skip Lists 56

CONTENTS

13 Amortized Algorithms, Table Doubling, Potential Method 60

14 Competitive Analysis: Self-organizing Lists 65

15 Dynamic Programming, Longest Common Subsequence 70

16 Greedy Algorithms, Minimum Spanning Trees 74

17 Shortest Paths I : Properties, Dijkstra’s Algorithm, Breadth-

ﬁrst Search 81

18 Shortest Paths II : Bellman-Ford, Linear Programming, Dif-

ference Constraints 85

19 Shortest Paths III : All-pairs Shortest Paths, Matrix Multi-

plication, Floyd-Warshall, Johnson 89

20 Advanced Topics: Parallel Algorithms I 91

21 Advanced Topics (cont.) 94

1 ANALYSIS OF ALGORITHMS, INSERTION- AND MERGE-SORT

Background

The following notes were taken when I learned an open course Introduction

to Algorithms taught by Prof. Charles Leiserson and Prof. Erik Demaine at

https://ocw.mit.edu.

1 Analysis of Algorithms, Insertion- and Merge-

Sort

The theoretical study of computer-program performance and resource usage.

Q: In programming, what’s more important than performance?

A: Features, modularity, security, user-friendlyness ···

Q: Why study algorithms and performance?

A: 1. Performance measures the line between feasible and the infeasible. 2.

Algorithm gives a pervasive language to talk about program behavior.

Problem: Sorting

Input: sequence < a

, a

, ··· , a

> of numbers.

Output: permutation of numbers < a

, ··· , a

> such that a

≤ a

··· ≤ a

Algorithm 1 Insertion-Sort(A, n)

1: for j ← 2 to n do

2: key = A[j]

3: i ← j − 1

4: while i > 0 and A[i] > key do

5: A[i + 1] ← A[i]

6: i = i − 1

7: end while

8: A[i + 1] = key

9: end for

Running time of algorithm:

• Depends on input (eg., already sorted).

1 ANALYSIS OF ALGORITHMS, INSERTION- AND MERGE-SORT

• Depends on input size (6 elements vs 6 × 10

elements).

– Parameterize input size.

• Want upper bounds on the running time, which represents a guarantee

to the users.

Kinds of analysis:

• Worst-case (usually), T (n) = max time on any input of size n.

• Average-case (somtimes), T (n) = expected time of all inputs of size n.

(Need assumption of statistic distribution).

• Best-case (bogus), cheat.

Q: What’s insertion sort’s worst-case time?

A: It depends on computer (relative speed on same machine, absolute speed

on diﬀerent machines).

BIG IDEA! Asymptotic analysis:

1. Ignore machine dependent constants.

2. Look at the growth of the T (n) as n → ∞.

Asymptotic notation:

• Θ, drop low order terms and ignore leading constants. For example,

+ 90n

+ 6046 = Θ(n

). As n → ∞, Θ(n

) algorithm always beat

a Θ(n

) algorithm.

Insertion sort analysis:

Worst-case: input reversely sorted,

T (n) =

j=2

Θ(j) = Θ(n

which is arithmetic series. For small n, the algorithm is moderately fast, but

not at all for large n.

2 ASYMPTOTIC NOTATION, RECURRENCES, SUBSTITUTION, MASTER METHOD

Algorithm 2 Merge-Sort(A, n)

1: If n = 1, done.  Θ(1)

2: Recursively sort A[1, dn/2e] and A[dn/2e + 1, n].  2T (Θ(n/2))

3: Merge two sorted lists.  Θ(n) linear time for comparing lists .

Running time is,

T (n) =



Θ(1) if n = 1 (usually omit)

2T (n/2) + Θ(n) if n > 1.

Use recursion tree to compute T (n) = 2T (n/2) + c n, with constant c > 0.

T (n) = c n + T (n/2) + T (n/2)

= c n + c(n/2) + c(n/2) + T (n/4) + T (n/4) + T (n/4) + T (n/4)

= ···

= c n log(n) + Θ(n) = Θ(n log n)

Note: log is abbreviated for log

instead of the natural log.

2 Asymptotic Notation, Recurrences, Sub-

stitution, Master Method

Asymptotic notation:

1. Upper bounds,

• f(n) = O(g(n)), means there are constants c > 0, n

> 0 such

that 0 ≤ f(n) ≤ c g(n) for all suﬃciently large n ≥ n

. For

example, 2n

= O(n

• Set deﬁnition of O,

O(g(n)) = {f(n) : ∃c > 0, n

> 0 such that 0 ≤ f(n) ≤ c g(n) for all n ≥ n

• Macro convention: A set in a formula represents an anonymous

function in that set.

Ex: f(n) = n

+ O(n

) means there is a function h(n) = O(n

)

such that f(n) = n

+ h(n).

2 ASYMPTOTIC NOTATION, RECURRENCES, SUBSTITUTION, MASTER METHOD

O Ω Θ

≤ ≥ =

Table 1: Analogies

Ex: n

+ O(n) = O(n

), everything over the left hand side of the

equation is the right hand side. Anything that is n

+O(n) is also

O(n

), but not the other way around (no symmetric). Interpre-

tation: for any f(n) ∈ O(n), there is an h(n) ∈ O(n

) such that

+ f(n) = h(n).

2. Lower bounds,

• Ω(g(n)) = {f(n) : ∃constants c > 0, n

> 0 s.t. 0 ≤ c g(n) ≤

f(n) for all n ≥ n

Ex:

√

n = Ω(log(n)), for up to constant factors, root n is at least

log(n) for suﬃciently large n.

3. Θ(g(n)) = O(g(n)) ∩ Ω(g(n)).

4. o, little o corresponds to <. ω corresponds to >. Their deﬁnitions are

similar to O and Ω, the diﬀerence is, the relationship between f and g

must hold for all c.

Ex: 2n

= o(n

), with n

= 2/c.

Ex: (1/2)n

= Θ(n

) 6= o(n

Solving recurrences

1. Substitution method

• Guess the form of the solution.

• Verify whether the recurrence satisﬁes this bound by induction.

• Solve for constants.

Ex: T (n) = 4T(n/2) + n, trivial case T (1) = Θ(1).

(a) Guess T (n) = O(n

(b) Assume T (k) ≤ c k

for k < n.

2 ASYMPTOTIC NOTATION, RECURRENCES, SUBSTITUTION, MASTER METHOD

(c)

T (n) = 4T (n/2) + n

≤ 4 c (n/2)

+ n

= cn

− (

− n)

≤ cn

, if

− n ≥ 0.

The above bound is a bound but not a tight bound.

Ex: Prove T (n) = O(n

)

(a) Assume T (k) ≤ ck

for k < n.

(b)

T (n) = 4T (n/2) + n

≤ 4c(n/2)

+ n

= cn

+ n

not ≤ cn

True but useless to complete the induction. To ﬁx this, assume a

strengthened induction hypothesis, T (k) ≤ c

+ c

k for k < n,

T (n) = 4T (n/2) + n

≤ 4[c

(n/2)

− c

(n/2)] + n

= c

+ (1 − 2c

= c

− c

n − (−1 + c

)n,

want (−1 + c

)n > 0, that is, c

≥ 1. T (1) = Θ(1) ≤ c

− c

if c

is suﬃciently large w.r.t. c

2. Recursion-tree method: a particular way of adding up recurrence.

Ex: T (n) = T(n/4) + T (n/2) + n

, see Figure 1.

Total cost (level-by-level),

T (n) = (1 +

256

+ ···

+ ···)n

≤ 2n

∞

k=0

(1/2)

= 2 >

∞

k=0

(5/16)

2 ASYMPTOTIC NOTATION, RECURRENCES, SUBSTITUTION, MASTER METHOD

Figure 1: Recursion tree of T (n) = T(n/4) + T (n/2) + n

3. Master method, only applies to particular recurrences of the form,

T (n) = aT (n/b) + f(n),

where a ≥ 1, b > 1, f(n) should be asymptoticly positive (f(n) > 0 for

n ≥ n

Compare f(n) with n

log

(number of leaves in the recursion tree).

(a) f(n) = O(n

log

a−

) for some  > 0 → T (n) = Θ(n

log

(b) f(n) = Θ(n

log

(log

) for k ≥ 0 → T (n) = Θ(n

log

(log n)

k+1

log

a+

) for some  > 0 and af(n/b) ≤ (1 −

)f(n) for

some 

> 0 → T (n) = Θ(f(n)).

Ex: T (n) = 4T(n/2) + n with a = 4, b = 2, and f(n) = n.

log

= n

→ case (a) → T (n) = Θ(n

Ex:T (n) = 4T (n/2) + n

case (b) with k = 0 → T (n) = Θ(n

log n)

Ex:T (n) = 4T (n/2) + n

3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,

POLYNOMIAL MULTIPLICATION

Figure 2: Proof sketch tree of master method.

case (c) → T (n) = Θ(n

)

Ex:T (n) = 4T (n/2) + n

/ log n

Master method doesn’t apply.

Proof sketch/intuition in Figure 2:

Case (a): costs increase geometrically, dominated by Θ(n

log

Case (b): costs at each level is roughly the same, especially the upper-

most levels. Cost is f(n) log

Case (c): costs decrease geometrically as we go down the tree. Domi-

nated by f(n).

3 Divide and Conquer (Divide and Rule): Strassen,

Fibonacci, Polynomial Multiplication

A joke on divide and conquer: Try and test way of conquering a land by

dividing into sections of some kind could be diﬀerent political factions and

3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,

POLYNOMIAL MULTIPLICATION

make them no longer like each other like starting a family feud.

Divide and conquer is a basic and powerful algorithm design technique.

Algorithm:

1. Divide the problem (instance) into one or more subproblems.

2. Conquer each subproblem recursively

3. Combine solutions.

Quick review on merge-sort phrased in divide and conquer steps:

1. Divide array into two halves (trivial).

2. Recursively sort each subarray.

3. Combined linear-time merge.

Running time of merge-sort:

T (n) = 2T (n/2) + Θ(n),

where (n/2) is the size of subproblems (i.e., two problems of size n/2), Θ(n)

is the divide and conquer running times (non-recursive). Case (b) of master

method with k = 0. So the cost is T (n) = Θ(n log n).

Ex: Binary search

Find x in a sorted array in a divide and conquer paradigm.

1. Divide. Compare x with the middle elements of the array.

2. Conquer. Recurse in one sub-array.

3. Combined. Trivial.

T (n) = 1 T (n/2) + Θ(1) = Θ(log n). Divide it into one subproblem and one

comparison with the middle element.

Ex: Powering a number

Given number x, integer n ≥ 0, compute x

3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,

POLYNOMIAL MULTIPLICATION

Naive algorithm: x · x ···x = x

, Θ(n) time due to n multiplications.

Divide and conquer algorithm:



n/2

n is even.

(n−1)/2

x n is odd.

As the ﬁrst and second term in both cases are the same, computation on

only one part of it is needed.

T (n) = T (n/2) + Θ(1) = Θ(log n).

Ex: Fibonacci numbers

(

0 n < 0

1 n = 1

n−1

+ F

n−2

n ≥ 2

Naive recursive algorithm: For each n ≥ 2, solve F (n − 1) and F (n − 2)

recursively. Time is exponential, Ω(φ

) with φ = (1 +

√

5)/2.

Bottom-up algorithm: Compute F

, F

, ··· , F

. Time is linear in n: Θ(n).

Naive recursive squaring algorithm: F

= φ

√

5 and round to the near-

est integer to obtain the Fibonacci number. However, it’s not allowed (the

round oﬀ of a ﬂoating machine can ruin the computation).

Recursive squaring algorithm: time Θ(log n), the problems is similar to pow-

ering a number with divide and conquer, but powering a matrix here.

Theorem 1. Fibonacci numbers,



n+1

n−1





1 1

1 0



which implies that time is Θ(log n).

Proof. By induction.

Base case, n = 1,







1 1

1 0



3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,

POLYNOMIAL MULTIPLICATION

Suppose we have



n−1

n−2





1 1

1 0



n−1

then



n+1

n−1





n−1

n−2



1 1

1 0





1 1

1 0



Ex: Matrix multiplication

C = [c

] = AB with A = [a

] and B = [b

] for i, j = 1, 2, ··· , n.

Standard algorithm: c

k=1

, n operations each for n

elements,

time Θ(n

Divide and conquer algorithm:

Idea: n × n matrix = 2 × 2 block matrix of n/2 × n/2 sub-matrices.

C =



r s

t u



= A B =



a b

c d



e f

g h



so r = ae + bg, s = af + bh, t = ce + dg, u = cf + dh. There are 8 recursive

multiplications (ae, bg, af, ···) of n/2×n/2 matrices and 4 additions (Θ(n

)).

T (n) = 8T (n/2) + Θ(n

) = Θ(n

according to case (a) of master method.

Strassen’s algorithm:

Idea: reduce the number of multiplications in divide-and-conquer algorithm

to 7.

= a(f − h)

= (a + b)h

= (c + d)e

= d(g − e)

= (a + d)(e + h)

= (b −d)(g + h)

= (a −c)(e + f)

3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,

POLYNOMIAL MULTIPLICATION

As a result,

r = p

+ p

− p

+ p

s = p

+ p

t = p

+ p4

u = p

+ p

− p

Algorithm:

1. Divide A, B, compute the terms for the products needed to compute p

values, it takes Θ(n

2. Conquer by recursively p

, ··· , p

3. Combined r, s, t, u.

The recurrence is T (n) = 7T (n/2) + Θ(n

) = Θ(n

log

) = O(n

2.81

) (strictly

less than n

2.81

This is not the best, but the simplest that can beat n

. The best so far is

.376.

Ex: Very Large Scale Integration (VLSI) layout

Problem: Embed a complete binary tree on n leaves in a grid with minimum

area.

Naive embedding as shown in Figure 3:

Height of the tree H(n) = H(n/2) + Θ(1) = Θ(log n).

Width of the tree W (n) = 2W (n/2) + O(1) = Θ(n).

The area is Θ(n log n).

Goal: to get W (n) = Θ(

√

n) and H(n) = Θ(

√

n), so the area is Θ(n), for

example, cost in the form of T (n) = 2T (n/4) + O(n

1/2−

H layout as shown in Figure 4:

L(n) = 2L(n/4) + Θ(1) = Θ(

√

n) (Case a of master method).

3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,

POLYNOMIAL MULTIPLICATION

Figure 3: Naive embedding.

Figure 4: H layout.

4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT

4 Quick Sort (Tony Hoare, 1962) and Ran-

domized Quick Sort

1. It’s divide-and-conquer algorithm.

2. It sorts “in place” (i.e., re-arranging the elements where they are).

3. Very practical (with tuning).

Divide and conquer paradigm:

1. Divide: Partition array into 2 subarrays around pivot x such that ele-

ments in the lower subarray ≤ x ≤ elements in upper subarray.

≤ x x ≥ x

2. Conquer: Recursively sort 2 subarrays.

3. Combined: Trivial.

The key is linear-time (Θ(n)) partitioning subroutine as shown in Algo. 3.

Array x ≤ x ≥ x ?

Location p i j q

Table 2: Locations p, q are invariant, the partition tries to loop through all

elements and move all elements smaller than the pivot element to one side.

Running time for the partition is T (n) = Θ(n) for n elements.

Ex: Array 6, 10, 13, 5, 8, 3, 2, 1, choose pivot x = 6

6, 5, 13, 10, 8, 3, 2, 11

↓

6, 5, 3, 10, 8, 13, 2, 11

↓

6, 5, 3, 2, 8, 13, 10, 11

↓

Now we have two subarrays:

2, 5, 3, and 8, 13, 10, 11 that are smaller and bigger than pivot 6 respectively.

Quicksort routine is then

4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT

Algorithm 3 Partition(A, p, q) or A[p, q]

1: x ← A[p]  pivot A[p]

2: i ← p

3: for j ← (p + 1) to q do  Loop through all elements

4: if A[j] ≤ x then

5: i ← i + 1

6: Exchange A[i] ↔ A[j]  Swap

7: end if

8: end for

9: Exchange A[p] ↔ A[i]

10: Return i  The current pivot element location

Algorithm 4 Quicksort(A, p, q)

1: if p < q then

2: r ← Partition(A, p, q)

3: Quicksort(A, p, r − 1)

4: Quicksort(A, r + 1, q)

5: end if

Initial call: Quicksort(A, 1, n).

Analysis - assume all elements are distinct

1. T (n) = worst-case time:

Input sorted or reverse sorted, one side of partition has no elements.

T (n) = T (0) + T (n − 1) + Θ(n)

= Θ(1) + T (n − 1) + Θ(n)

= T (n − 1) + Θ(n)

= Θ(n

) (arithmetic series like insertion sort)

The worst case time is no better than insertion sort.

Recursion tree in Figure 5

cn + c(n − 1) + ··· + c1 = Θ

k=1

= Θ(n

4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT

Figure 5: Quick sort tree.

The total time is then

T (n) + Θ(n) + Θ(n

) = Θ(n

2. For intuition only, do best-case analysis: If we’re really lucky, partition

splits the array n/2 : n/2.

T (n) = 2T (n/2) + Θ(n) = n log n.

Suppose split is always

T (n) = T (

n) + T (

n) + Θ(n)

Recursion tree in Figure 6

3. Suppose we alternate lucky, unlucky, lucky ...

L(n) = 2U(n/2) + Θ(n) if lucky,

U(n) = L(n − 1) + Θ(n) if unlucky,

then

4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT

Figure 6: (1/10) : (9/10) split tree.

4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT

L(n) = 2 [L(n/2 − 1) + Θ(n/2)] + Θ(n)

= 2L(n/2 −1) + Θ(n)

= Θ(n log n).

Q: How to make sure that we are usually luckly?

A: Randomly arrange the elements, randomly choose the pivot.

Randomized Quicksort: pivot on random element (swap the ﬁrst element

with some other randomly chosen element in the array in the code, and then

run the ordinary partition).

• running time is independent of input ordering.

• it makes no assumptions about input distribution.

• there is no speciﬁc input that can elicit the worst-case behavior.

• worst-case is only determined by random number generator.

Analysis:

T (n) = random variable for running time assuming the random numbers are

independent.

For k = 0, 1, ··· , (n − 1), let



1 if partition generates a k:(n-k-1) split

0 o.w..

is an indicator random variable with E[X

] = 1/n.

T (n) =











T (0) + T (n − 1) + Θ(n) if 0:(n-1) split

T (1) + T (n − 2) + Θ(n) if 1:(n-2) split

T (n − 1) + T (0) + Θ(n) if (n-1):0 split

n−1

k=0

[T (k) + T (n − k + 1) + Θ(n)] .

So the expected value of T (n)

4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT

E[T (n)] = E

n−1

k=0

[T (k) + T (n − k + 1) + Θ(n)]

n−1

k=0

E {X

[T (k) + T (n − k + 1) + Θ(n)]}

n−1

k=0

· E [T (k) + T (n − k + 1) + Θ(n)] (independence)

n−1

k=0

E [T (k) + T (n − k + 1) + Θ(n)]

n−1

k=0

E [T (k)] +

n−1

k=0

E [T (n − k + 1)] +

n−1

k=0

E [Θ(n)] (ﬁrst two terms same)

n−1

k=0

E [T (k)] + Θ(n) ···(1)

n−1

k=2

E [T (k)] + Θ(n)

(1). Absorb k = 0, 1 terms into Θ(n) for technical convenience.

Now prove E[T (n)] ≤ a n log n for some constant a > 0.

Proof. Choose a big enough so that a n log n ≥ E[T (n)] for small n.

Use fact:

n−1

k=2

k log k ≤

log n −

Use purely summation or integral to proof the fact.

5 LINEAR-TIME SORTING: LOWER BOUNDS, COUNTING SORT, RADIX SORT

Use substitution,

E[T (n)] =

n−1

k=2

E [T (k)] + Θ(n)

≤

n−1

k=2

a k log k + Θ(n)

≤



log n −



+ Θ(n)

= a n log n −



− Θ(n)



≤ a n log n, if the second term is positive / a big enough.

In practice, randomized quicksort is typically three or more times faster than

merge-sort. It tends to work well with caches in virtual memory.

5 Linear-time Sorting: Lower Bounds, Count-

ing Sort, Radix Sort

Q: How fast can we sort?

A: It depends on computational model. The model is what you can do

with elements. For example, randomized quicksort (Θ(n log n)), heapsort

(Θ(n log n)), merge sort (Θ(n log n)), insertion sort (Θ(n

)).

Q: Can we do better than Θ(n log n)?

Comparison sorting algorithm(model): only use comparisons to deter-

mine the relative order of elements.

Decision-tree model: more general than comparison model. It’s a graphi-

cal representation of what algorithm does.

Ex: Suppose we’d like to sort < a

, a

>, see Figure 7.

In general, with < a

, ··· , a

> to be sorted, each internal node has label

i : j, i, j ∈ {1, 2, ··· , n}, means compare a

vs. a

5 LINEAR-TIME SORTING: LOWER BOUNDS, COUNTING SORT, RADIX SORT

Figure 7: Decision tree for 3 elements.

• Left subtree shows comparisons if a

≤ a

• Right subtree shows comparisons if a

≥ a

• Each leaf node gives a permutation < π(1), π(2), ··· , π(n) > such that

π(1)

≤ a

π(2)

≤ ··· ≤ a

π(n)

Decision trees model comparison sorts:

• at least one tree for each n.

• view algorithm as splitting whenever it compares two elements.

• listing all possible executions of the algorithm (instruction traces).

• running time (number of comparisons) is the length of the path.

• therefore, the worst-case running time of all possible inputs of length

n is the longest path (the height of the tree).

Lower bound for decision tree sorting:

Theorem 2. Any decision tree that sorts n elements has height Ω(n log n).

Proof. Facts to use,

• # of leaves must be ≥ n! to get the right answer of all possible inputs.

5 LINEAR-TIME SORTING: LOWER BOUNDS, COUNTING SORT, RADIX SORT

• for a tree of height h, it has to be ≤ 2

leaves for a binary tree.

So n! ≤ 2

, take log

on both sides,

h ≥ log(n!)

≥ log[(n/e)

]

= n log(n) − n log e

= Ω(n log n)

Corollary: merge sort and heapsort are asymptotically optimal in term of

growth of n among comparison sorting algorithms. Randomized quicksort is

aymptotically optimal in expectation.

Sorting in linear time:

Two algorithms that are sorting faster than n log n.

1. Counting sort.

Input: A[1, ··· , n], each A[j] ∈ {1, 2, ··· , k}.

Output: B[1, ··· , n] = sorted of A.

Auxiliary storage: C[1, ··· , k], the range on input values.

Algorithm 5 Counting sort

1: for i ← 1 to k do

2: C[i] ← 0  Initialize occurrences of values C

3: end for

4: for j ← 1 to n do

5: C[A[j]] ← C[A[j]] + 1  C[i] = |{key = i}|, # of elements equal i

6: end for

7: for i ← 2 to k do

8: C[i] ← C[i] + C[i − 1]  C[i] = |{key ≤ i}|

9: end for

10: for j ← n to 1 do

11: B[C[A[j]]] ← A[j]

12: C[A[j]] ← C[A[j]] − 1  distribution

13: end for

5 LINEAR-TIME SORTING: LOWER BOUNDS, COUNTING SORT, RADIX SORT

Figure 8: Radix sort example operation.

Ex: Suppose A = [4, 1, 3, 4, 3], apply the algorithm for a simple demon-

stration.

(a) The running time for each for loop is Θ(k)+Θ(n)+Θ(k)+Θ(n) =

Θ(n + k), when k = O(n), counting sort takes Θ(n).

(b) If the number you are dealing with is small, then counting sort

may work, however, the auxiliary storage will take too much when

k is too big.

(d) No comparisons between elements in counting sort.

(e) An important property of counting sort is stability: it preserves

the order of equal elements.

2. Radix sort. It works for a much larger range of numbers in linear time.

One of the oldest sorting algorithm [Herman Hollerith, 1890]. It sorts

on least-signiﬁcant digit ﬁrst with auxiliary stable sort.

Ex: Figure 8

(a) Correctness of radix sort: Induct on digit position t

i. Assume by induction sorted on lower-order t − 1 digits.

6 ORDER STATISTICS, MEDIAN

ii. Sort on digit t. If two elements have same t

digit, we remain

the same order for stability, by induction order, they remain

sorted; If two elements have diﬀerent t

digit, put them in

the right order.

(b) Analysis.

i. use counting sort for each digit.

ii. say n integers with each integer b bits long (range is [0, 2

−1]).

Split each integer into b/r “digits”, each r bits (think number

in base 2

). b/r is number of rounds, base 2

is in some sense

k in counting sort.

Ex: Split 32-bit (b = 32) word into 4 digits (r = 8).

iii. running time:

T (n, b) = Θ([b/r] · [n+]) = Θ([b/r] · [n + 2

]),

which is number of rounds times number of time of each

round.

iv. Choose r to minimize T (n, b) by diﬀerentiating and set the

derivative to be zero.

6 Order Statistics, Median

Order statistics: given n elements in array, ﬁnd i

smallest element.

Naive algorithm: sort and return the i

element.

The median is i = b(n + 1)/2c or d(n + 1)/2e.

1. Randomized divide and conquer order-statistic selection, Algo. 6

Intuition (all analysis today assume that all elements are distint).

(a) Lucky, (1/10) : (9/10) fraction, assume the lucky case, when the

target element is in the (9/10) part,

T (n) = T (9n/10) + Θ(n) = Θ(n), Case 3 of master method.

(b) Unlucky case (worst-case analysis), there is no element in one side

of partition,

T (n) = T (n − 1) + Θ(n) = Θ(n

6 ORDER STATISTICS, MEDIAN

Algorithm 6 RAND-SELECT (A, p, q, i)

 Find i

smallest of A[p, q]

1: if p = q then

2: return A[p]

3: end if

4: r ← RAND-PARTITION(A, p, q)  r is location of pivot, see Algo.3

5: k ← r − p + 1

6:  k = rank(A[r]), # of elements smaller than A[r] (pivot).

7: if i=k then

8: return A[r]  Case 1

9: end if

10: if i < k then

11: return RAND-SELECT(A, p, r − 1, i)  Case 2

12: else

13: return RAND-SELECT(A, r + 1, q, i − k)  Case 3

14: end if

2. Analysis of expected time of randomized divide and conquer. The anal-

ysis here is similar with that of randomized quicksort.

(a) Let T (n) be the random variable for running time of rand-select

on input of size n. Assume random variables are independent.

(b) Deﬁne indicator variable X

for k = 0, 1, ··· , (n −1),



1 if Partition generates a k : (n − k + 1) split

0 o.w.

elements always falls in the

larger side of the partition,

T (n) =

n−1

k=0

[T (max{k, n − k − 1}) + Θ(n)] ,

where T (max{k, n − k − 1}) + Θ(n) is the running time if there

is k : (n − k − 1) split.

Take expectations on both sides, similar with the randomized

7 HASHING, HASH FUNCTIONS

quick sort analysis,

E[T (n)] = E

n−1

k=0

[T (max{k, n − k − 1}) + Θ(n)]

n−1

k=0

E [T (max{k, n − k − 1})] + Θ(n)

≤

n−1

k=bn/2c

E [T (k)] + Θ(n)

Claim: E[T (n)] ≤ c n for constant c > 0.

Proof. Use substitution.

E[T (n)] ≤

n−1

k=bn/2c

c k + Θ(n)

n−1

k=bn/2c

k + Θ(n)

≤ cn −



− Θ(n)



······(1)

(1). Use fact,

n−1

k=bn/2c

k ≤

If c is chosen large enough so cn/4 > Θ(n).

(d) To summarize, the algorithm works well in practice and works

fast, but the worst-case is bad.

3. Worst-case linear-time order statistics

(a) An algorithm that runs in linear time in worst-case (Blum, Floyd,

Pratt, Rivest and Tarjan [1973]).

(b) Key idea: Generate a good pivot recursively. A graphical descrip-

tion of the algorithm is in Figure 9.

7 Hashing, Hash Functions

1. Direct-access tables in Figure 10.

7 HASHING, HASH FUNCTIONS

Figure 9: Worst-case linear-time order statistics.

Algorithm 7 SELECT(i,n)

 Select the i

smallest element from array of size n

1: Divide the n elements into groups of 5. Find the median of each

5−element group.  Time Θ(n)

2: Recursively SELECT the median x of the bn/6c group medians as pivot.

3:  Time T (bn/5c)

4: Partition around pivot x. k ← rank(x).  Time Θ(n)

5: if i = k then

6: return x

7: end if

8: if i < k then

9: recursively SELECT i

smallest element in the lower part

10: else

11: recursively SELECT i

smallest element in the upper part

12: end if  Time T (3n/4)

Figure 10: Direct access table.

7 HASHING, HASH FUNCTIONS

(a) Operations

• Insertion(S, x) : S ← S ∪ {x}

• Delete(S, x) : S ← S − {x} (doesn’t take the key, takes the

record)

• Search(S, k) : returns x such that key[x] = k or nil if no such

x (search for a given key).

Operations that change the set like delete and insertion are “dy-

namic set”.

(b) Direct-access table, the simplest implementation of the operations.

It works when the keys are drawn from a small distribution, U =

{0, 1, ··· , m − 1} (assume keys are distinct). It works by setting

up an array T [0, ··· , m − 1] to represent the dynamic set S such

that

T [k] =



x if x ∈ S and key[x] = k

nil o.w.

Operations take constant time Θ(1).

character string values.

(d) Keep the table small while still preserving some of the properties

→ hashing.

2. Hashing: resolving collisions by chaining

(a) Hash function h maps keys “randomly” into slots of table T . See

Figure 11 (left).

When a record to be inserted maps to an already occupied slot, a

collision occurs.

Resolving collisions by chaining: Link records in same slot into a

list (Create a list for each slot and put all the elements that hash

to the same slot into the list).

Ex:Figure 11 (right).

(b) Analysis.

i. Worst case: every key hashes to the same slot. A linked list

to keep the data structure. Access time is Θ(n) if the size of

S, |S| = n.

ii. Average case: assumption of simple uniform hashing: (1).

Each key k ∈ S is equally likely to be hashed to any slot in T ;

7 HASHING, HASH FUNCTIONS

Figure 11: Hash map (left); Resolving collisions by chaining (right).

(2). Each key is independent of where other keys are hashed.

Load factor of a hash table:

α = n/m = ave # of keys per slot,

where n is number of keys and m is number of slots.

Expected unsuccessful search for a record with a given

key (looking for something that is not in the table and the

return result is Nil):

Θ(1 + α),

where 1 is from the cost to apply hash function and access

slot, α is the cost of searching list.

Expected search time is Θ(1) when α = O(1) or equivalently,

n = O(m) (number of elements in the table is upper bounded

by a constant times n).

Expected successful search is also asymptotically Θ(1+α),

see book for proof.

3. Choosing hash functions

• Should distribute keys uniformly into slots.

• Regularity in the key distribution should not aﬀect uniformly (for

example, a regularity you often see is all the keys that are being

inserted are even numbers).

7 HASHING, HASH FUNCTIONS

Division method (popular method for a quick hash function):

h(k) = k mod m,

where m is number of slots in the table.

• Don’t pick m with small divisor d. For example, d = 2 and all

keys are even, then odd slots never used. Another extreme ex-

ample, m = 2

, then hash doesn’t depend or all bits of k (if

k = 10101010001110110), and r = 6, then h(k) = 110110, the last

six bits of k.

• Good choices is pick m to be a prime not too close to a power of

2 or 10.

Multiplication method (usually superior, heuristic):

• Assume the numeber of slots is, m = 2

, assume computer has

w−bit words. (w = 32, 64 for example).

• Hash function

h(k) = ((A · k) mod 2

) bitwise right shifted by w − r,

where A is an odd integer in the range 2

w−1

< A < 2

• Don’t pick A too close to 2

w−1

or 2

• Fast method: multiplication mod to 2

is fast and right shift is

fast.

Ex: m = 8 = 2

and w = 7. Think A a a fractional number as

we only case about the module. See Figure 12.

4. Resolving collisions by open addressing

Idea: no storage for links. Probe table systematically until an empty

slot is found (If I hash to a given slot, and the slot is full, just hash

again with a diﬀerent(second) hash function. Keep the probe sequence

until I ﬁnd a place to put it) .

• The hash function depends on both the key and probe number

h : U × {0, 1, ··· , m − 1} → {0, 1, ··· , m − 1},

where U is the universe of keys, {0, 1, ··· , m−1} before the arrow

is probe number, {0, 1, ··· , m − 1} after the arrow is slot.

7 HASHING, HASH FUNCTIONS

Figure 12: Multiplication method.

• Probe sequence < h(k, 0), ··· , h(k, m − 1) > should be a permu-

tation of {0, 1, ··· , m − 1}.

• Issue: table may ﬁll up, so the number of elements in the table

has to ≤ m, the number of slots. Deletion is diﬃcult (you remove

a key out of the table, someone now is doing a probe sequence

who would have hit that key and gone to ﬁnd it’s empty slot: you

can delete things but keep them marked, it’s messy).

Ex: Figure 13.

Searches use the same probe sequence as shown in Figure 13, if it’s

successful, it ﬁnds the record, if it’s unsuccessful, it ﬁnds Nil.

Probing strategies:

(a) Linear probing

h(k, i) = (h(k, 0) + i) mod m.

The problem with it is that it suﬀers from primary clustering,

where regions of hash table get very full, and anything that hashes

into that region has to look through all the stuﬀ over there (long

runs of occupied slots build up).

(b) Double hashing

h(k, i) = [h

(k) + i · h

(k)] mod m.

7 HASHING, HASH FUNCTIONS

Figure 13: Resolve collisions with open addressing.

This is an excellent method, usually pick m = 2 and h

(k) to be

odd (to guarantee to hit every slot).

Analysis of open addressing:

Assume uniform of hashing: each key is equally likely to have any one

of the m! permutations as probe sequence, independent of other keys.

Theorem 3. The expected number of probes, if α = n/m < 1,

E[# probes] ≤

1 − α

Proof. Unsuccessful search:

• 1 probe is always necessary.

• With probability n/m, the ﬁrst probe gets collision, and a second

probe is needed.

• With probability (n −1)/(m −1), the second probe gets collision,

and a third probe is needed. Similar for the following probes.

Note:

n−i

m−i

= α for i = 1, 2, ··· , n − 1.

8 UNIVERSAL HASHING, PERFECT HASHING

So the expected number of probes

1 +



1 +

n − 1

m − 1

···



1 +

m − n + 1



···



≤ 1 + α(1 + α(1 + α(···(1 + α) ···)))

≤ 1 + α + α

+ ···

∞

i=0

1 − α

See textbook for more rigorous proof and for proof for analysis of suc-

cessful searches.

• If the table is half full, the expected number of probes is 1/(1 −

0.5) = 2.

• If the table is 90% full, the expected number of probes is 1/(1 −

0.9) = 10. The cost is going dramatically, typical you don’t want

to let the table get too full.

8 Universal Hashing, Perfect Hashing

Fundamental weakness of hashing: for any choice of hash function, there

exists bad set of keys that all hash to the same slot.

Idea: Choose hash function at random, independently from keys → universal

hashing.

Universal hashing: Let U be a universe of keys, and let H be a ﬁnite

collection of hash functions mapping U to {0, 1, ·, m − 1}. H is universe if

∀x, y ∈ U where x 6= y, it has

|{h ∈ H : h(x) = h(y) =

|H|

}|,

that is, the number of hash functions that hash two keys to the same place is

|H|/m. If h is chosen randomly from H, the probability of a collision between

x and y is 1/m.

Theorem 4. Choose h randomly from H, suppose h hashes n keys into m

slots in table T . Then, for a given key x,

E[# collisions within x] <

= α (load factor)

8 UNIVERSAL HASHING, PERFECT HASHING

Proof. Let C

be random variable denoting total # collisions of keys in T

with x, and let indicator



1 if h(x) = h(y)

0 o.w.

Note E[c

] =

and the total number of collisions C

y∈T −{x}

, the

expected total number of collisions

E[C

] = E





y∈T −{x}









y∈T −{x}

E[c

]





(linearity)

y∈T −{x}

n − 1

Constructing a universal hash function:

Let m be prime. Decompose any key k into r + 1 digits (representation of k

based m),

k =< k

, k

, ··· , k

where 0 ≤ k

≤ m. The random strategy is:

Pick a =< a

, a

, ··· , a

>, each a

is randomly chosen from {0, 1, ··· , m−1}.

Deﬁne

(k) =

i=0

mod m,

Number of diﬀerent hash functions, |H| = m

r+1

(there are m choices for each

Theorem 5. H is universal.

Proof. Pick two distinct keys x =< x

, x

, ··· , x

> and y =< y

, y

, ··· , y

they must diﬀer in at least one digit, without loss of generality,they diﬀer in

position 0. For how many h

∈ H do x and y collide?

Must have h

(x) = h

(y), that is,

8 UNIVERSAL HASHING, PERFECT HASHING

z 1 2 3 4 5 6

−1

1 4 5 2 3 6

Table 3: Theorem 6 example.

i=0

≡

i=0

mod m

⇒

i=0

− y

) ≡ 0 mod m

⇒ a

− y

) +

i=1

− y

) ≡ 0 mod m

⇒ a

− y

) ≡ −

i=1

− y

) mod m

⇒ a

−

i=1

− y

)

− y

)

−1

mod m ······(1)

(1). Since x

6= y

, using Theorem 6, we know ∃(x

− y

)

−1

Thus, for any choice of a

, a

, ··· , a

, exactly ONE of the m choices for

causes x and y to collide, and no collision for other m − 1 choices for a

The number of h

s that cause x, y to collide is 1 ·m ·m ···m = m

= |H|/m.

So H is universal by deﬁnition.

Fact from number theory:

Theorem 6. Let m be prime. For any z ∈ Z

, where Z

= integers mod

m, such that z 6≡ 0, there exists unique z

−1

∈ Z

such that

z · z

−1

≡ 1( mod m). the results of zz

−1

mod by m is 1.

Note: not true if m is not prime. Think about m = 10.

Ex: m = 7, Table 3. For z = 2, 2 · 4 mod 7 = 1.

8 UNIVERSAL HASHING, PERFECT HASHING

Figure 14: Perfect hashing.

Perfect hash:

Given n keys, construct a static hash table of size m = O(n) such that

SEARCH takes O(1) time in the worst case.

IDEA: use a two-level scheme with universal hashing at both levels. NO

collisions at level 2 (if n

items that hash to level 1 slot i, then use m

= n

slots in level 2, level 2 hash table will be very sparse). See Figure 14 for an

example.

Analysis of level 2 of perfect hash:

Theorem 7. Hash n keys into m = n

slots using random h in universal H

⇒ E[# of collisions] < 1/2.

Proof. Probability of 2 given keys collide under h is

There are





pairs of keys that can possibly collide, then the expected

9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST

number of collisions

E[# of collisions] =





n − 1

Markov’s inequality: For random variable x ≥ 0

Pr(X ≥ t) ≤

E[x]

Corollary: The probability of no collisions is at least 1/2.

Proof. Apply Markov’s inequality with t = 1 ⇒ Pr(X ≥ 1) ≤ E[X] < 1/2 ⇒

Pr(X = 0) ≥ 1/2.

To ﬁnd a good level-2 hash function. just test a few functions in H, we will

ﬁnd one quickly since at least half will work.

Analysis of storage of perfect hashing:

1. Level 1: Choose m = n.

2. Level 2: Let n

be random variable for # keys that hash to slot i in T .

Use m

= n

slots in each level-2 table S

, the total expected storage

n + E

m−1

i=0

Θ(n

)

= Θ(n) by bucket sort analysis.

9 Relation of BSTs to Quicksort - Analysis

of Random BST

Randomly built binary search tree (BST) deﬁnition:

Binary-search-tree sort for array A in algo.8.

Ex:A = [3, 1, 8, 2, 6, 7, 5], running time: O(n) for walk, Ω(n log n) for n tree-

inserts (Figure 15) .

Worst-case: If the array is already sorted or reversely sorted, time is

x∈T ree

depth(x) =

Θ(n

), where x is the node.

Lucky-case: balanced, tree height is O(log n), and time is O(n log n).

Running time is the same as quicksort.

9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST

Algorithm 8 Binary-search-tree-Sort(A)

1: T ← ∅  Create empty BST

2: for i = 1 to n do

3: TREE-INSERT(T, A[i])  Insert A[i] to the tree

4: end for

5: Perform an inorder tree walk of T  Inorder traversal

Figure 15: Inorder tree traversal.

9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST

Figure 16: Diﬀerence between height of tree and expected average depth.

Relation to quicksort: BST sort and quicksort make the same compar-

isons in a diﬀerent order.

Randomized BST SORT:

• Randomly permuate A.

• BST sort(A).

Time = time (randomized quicksort) and E[time] = E[randomizedquicksorttime] =

Θ(n log n). The corresponding tree is called randomized built BST.

Expected average depth:

x∈T

depth(x)

Θ(n log n) = log n,

which is NOT the height of the tree (maximum depth of any node). See

Figure 16 for an example.

Theorem 8.

E(height of rand. built BST ) = O(log n).

9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST

Proof. Outline:

1. Prove Jensen’s inequality f(E(X)) ≤ E(f(X)) for convex function f.

2. Instead of analyzing the random variable that tells the height of the

BST on n nodes, X

, analyze its exponentiation, Y

= 2

3. Prove that E[Y

] = O(n

4. Conclude that





= E[Y

] = O(n

so E[X

] ≤ log O(n

) = 3 log n + O(1).

Prove 1:

f : R → R is convex if ∀α, β ≥ 0 s.t. α + β = 1, we have for any x, y ∈ R

f(αx + βy) ≤ αf(x) + βf (y)

Lemma: f : R → R is convex and x

, x

, ··· , x

∈ R, α

, ··· , α

≥ 0 and

k=1

= 1. Then

k=1

≤

k=1

f(x

Use induction and convexity deﬁnition to prove it.

Jensen’s inequality (n → ∞):

f(E[X]) = f

∞

k=−∞

k Pr(X = k)

≤

∞

k=−∞

f(k) · Pr(X = k)

= E[f(x)]

Prove 2:

r.v. of height of randomly built BST on n nodes. Y

= 2

, a convex

function. rank(k) is the rank in the sense of the index in assorted order, for

example, for root of a BST tree, it has (k − 1) elements on the left (smaller

than root value) and (n −k) elements on the right, then the rank of the root

is k.

9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST

If root r has rank k, then the height of the tree is the max of the two subtrees

plus 1 (the height of top level)

= 1 + max{X

k−1

, X

n−k

and

= 2 · max{Y

k−1

, Y

n−k

}

Prove 3:

Deﬁne indicator random variable Z

k =

1 if root has rank k

0 o.w.

Pr(Z

= 1) = E[Z

] = 1/n

and

k=1

[2max{Y

k−1

, Y

n−k

}]

The expectation

9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST

= E

(

k=1

[2max{Y

k−1

, Y

n−k

}]

)

k=1

E {Z

[2max{Y

k−1

, Y

n−k

}]}

= 2E[Z

]

k=1

E [max{Y

k−1

, Y

n−k

}]

k=1

E [max{Y

k−1

, Y

n−k

}]

≤

k=1

E [Y

k−1

+ Y

n−k

]

k=1

E [Y

k−1

] +

k=1

E [Y

n−k

]

k=1

E [Y

k−1

]

n−1

k=0

E [Y

]

Claim: EY

≤ cn

Prove by substitution: base case, n = O(1) is true if c is suﬃciently large.

Induction,

n−1

k=0

E [Y

]

≤

n−1

k=0

≤

= cn

Prove 4:

10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS

f(x) = 2

, then

E[X

]

≤ E





= EY

≤ cn

Take log

on both sides

E[X

] ≤ 3 log n + O(1),

E[X

] ≈ 2.9882 log n (Luke Devroy, 1986).

10 Red-black Tree, Rotations, Insertions, Dele-

tions

Goal: Get a search tree structure so we can insert, delete and search.

Balanced search trees: search tree data structure maintaining dynamic

set of n elements using tree of height of O(log n). Balanced search trees are

not necessarily binary. Focus on binary case in this lecture. Balanced search

tree examples:

• AVL trees

• 2-3 trees

• 2-3-4 trees

• B-trees

• Red-black trees (binary search tree)

• Skip lists (more or less a tree)

• treaps

Red-black tree: binary search trees, with extra color ﬁeld for each node

satisfying red-black properties:

1. Every node is either red or black.

10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS

Figure 17: Simple paths of red-black tree (property 4).

2. The root and leaves (Nil’s) are all black.

3. Every red node has black parent (can never have two red nodes con-

secutive).

4. All simple paths (don’t repeat any vertices) from a node x to a descen-

dant leaf of x have same number of black nodes = black − height(x)

(the height doesn’t include x itself). See Figure 17.

Ex: See Figure 18 for an example of a red-black tree.

Height of red-black tree:

Theorem 9. Red-black tree with n keys has height

h ≤ 2 log(n + 1).

Proof. Proof sketch (see textbook for detailed proof with induction).

Merge each red node into its black parent to a 2-3-4 tree as in Figure 19.

• every internal node has 2-4 children.

• every leaf has the same depth (the black height of the root) by property

4 of red-black tree.

10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS

Figure 18: Red-black tree example.

Figure 19: Merged red-black tree.

10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS

Denote the height of original tree as h and the merged tree h

. First try to

bound h

and then relate it to h.

1. The number of leaves of each tree is n + 1, where n is the number of

internal nodes.

2. In the 2-3-4 tree, the number of leaves is in range [2

, 4

], can branch

at most 4

ways at each node, every node branches at least 2 ways.

3. 2

≤ n + 1 ≤ 4

⇒ h

≤ log(n + 1).

4. The number of red nodes can be at most half of the root-to-leaf path,

so h ≤ 2h

5. h ≤ 2 log(n + 1).

Corollary: Queries

Queries (search, min, max, successor, predecessor) all run in O(log n) time

on a red-black tree with n nodes.

Updates (insert, delete) must modify the tree, to preserve the tree proper-

ties,

• BST operation (tree insert, tree delete)

• Change the colors of some of the nodes

• Restructing of links via rotations

Rotation: See Figure 20 for left and right rotations. Rotation (1). preserves

inorder ordering of keys, suppose a ∈ α, b ∈ β, c ∈ γ ⇒ a ≤ A ≤ b ≤ B ≤ c.

(2). can be performed in O(1) time.

RB-Insert(x):

Idea

1. insert x in binary search tree (search for x wherever it’s supposed to

go, create a new node hanging oﬀ the original nodes).

10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS

Figure 20: Rotation operations of tree.

2. color it as red (so only property 2 is possibly violated as parent might

be red) ⇒ move the violation up the tree by recoloring until it can be

ﬁxed with rotations and recoloring.

Ex: Insert 15 into the previous example red-black tree in Figure 21.

Algorithm 9 RB-Insert(T, x)

1: TREE-INSERT(T, x)

2: color[x] ← RED

3: while x 6= root[T] and color[p[x]] = RED do  p[x]: the parent of x

4: if p[x] = left[p[p[x]]] then

5:  Case A: left child of parent of parent of x

6: y ← right[p[p[x]]]  right child of grandparent, uncle/aunt of x

7: if color[y] = RED then

8: Case 1

9: else if x = right[p[x]] then Case 2  x is right child of its parent

10: Case 3  x is left child of its parent

11: end if

12: else swap “left” and “right” in if  Case B

13: end if

14: end while

15: color[root[T]] ← BLACK

The three cases are in Figure 22.

10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS

Figure 21: RB-INSERTION of 15.

11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL

TREES

Summarization of RB-INSERT:

1. RB-INSERT add x to the dynamic set and preserves red-blackness.

2. Case 1 only recolors, case 2 and case 3 perform 1 or 2 rotations and

terminate.

3. The number of case 1 is at most log n and number of case 2 or 3 is at

most 1, case 2 and 3 together is at most 2.

4. See introduction on RB-DELETE from textbook.

11 Augmenting Data Structures, Dynamic Or-

der Statistics, Interval Trees

Start of the design phase of the class. How to design eﬃcient data structures,

eﬃcient algorithms for various problems (Textbook appendix B).

Dynamic order statistics

• OS-SELECT(i, S): returns i

smallest element in the dynamic set S.

• OS-RANK(x, S): returns rank of x ∈ S in the sorted order of S’s

elements.

Idea: Use a red-black tree for the set S, keep the size of subtrees in nodes of

red-black tree. See Figure 23 for an example OS-tree and the size of x is

size(x) = size[left[x]] + size[right[x]] + 1,

where left[x] is the left child of x.

Note: As size of Nil is zero, and in practical programming, the use of size of

Nil usually has error, the implementation trick is to use a sentinel (dummy

record) such that size[Nil] = 0 (instead of any place I would have used Nil

in the tree, have a special record that I will call Nil but it will be a whole

record, that way, I can set it as size 0, it’s a common type of programming

11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL

TREES

Figure 22: Three cases in Algo 9. RB-INSERT.

11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL

TREES

Figure 23: OS-tree example.

trick to simply code).

Code for OS-SLECT in Algo. 10.

Algorithm 10 OS-SELECT(x, i): i

smallest in subtree rooted at x

1: k ← size [left[x]] + 1  k is rank of x

2: if i = k then

3: return x

4: end if

5: if i < k then

6: return OS-SELECT(left[x], i)

7: else

8: return OS-SELECT(right[x], i − k)

9: end if

Ex: OS-SELECT(root, 5), the algorithm returns a pointer to the node whose

key is H. Try it.

Q: Why not keep ranks themselves in the nodes instead of subtree size?

A: Hard to maintain when the red-black tree is modiﬁed. The modifying

operations are INSERT and DELETE.

Strategy: update subtree sizes when inserting or deleting.

Ex: Insert element key K after key H and update the subtree sizes as shown

11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL

TREES

Figure 24: OS-tree insertion example.

in Figure 24. But must also handle rebalancing of the tree: (1). r-b tree

color changes, no eﬀect on subtree sizes; (2). rotations, look at children and

ﬁx up in O(1) time per rotation.

INSERT, DELTE still takes O(1) time.

Data-structure augmentation methodology with example of order-statistics

trees. Checklist:

1. Choose underlying data structure (red-black tree).

2. Determine additional information to be stored in the data structure

(subtree sizes).

3. Verify that this information CAN be maintained for modifying opera-

tions (RB-INSERT, RB-DELETE— don’t forget rotations).

4. Develop new operations that use the stored information (OS-SELECT,

OS-RANK).

11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL

TREES

Figure 25: Interval tree example.

There are usually interactions between the above steps.

Ex: Interval trees

Maintain a dynamic set of intervals (eg., time intervals).

Query: ﬁnd an interval in the set that overlaps a given query interval. See

Figure 25.

Methodology:

1. Choose an underlying data structure (red-black tree keyed on low/left

endpoint).

2. Store in node the largest value m in the subtree rooted at that node.

m[x] = max{high[int[x]], m[left[x]], m[right[x]]}.

3. Modifying operations.

• INSERT: Fix m’s on the way down.

• ROTATIONS: Fix up m’s during rotation takes O(1) time.

So the total INSERT time is O(log n). DELETE similar.

11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL

TREES

Algorithm 11 INTERVAL-SEARCH(i): ﬁnd an interval that overlaps in-

terval i.

1: x ← root

2: while x 6= Nil and (low[i] > high[int[x]] or low[int[x]] > high[i]) do

3:  i and the current int[x] don’t overlap, change x to node at subtree

4: if left[x] 6= Nil and low[i] ≤ m[left[x]] then

5: x ← left[x]

6: else x ← right[x]

7: end if

8: end while

9: return x

4. New operations. INTERVAL-SEARCH(i). See Figure 25 and Algo.11.

Analysis:

(a) Time is O(log n) to go down the tree, takes time proportional to

the time of the tree.

(b) List all intervals (suppose there are k overlapping intervals), search,

list, delete and repeat k times takes O(k log n) time (output-

sensitive bound).

Correctness of the algo. 11 :

Theorem 10. Let L the set of intervals in the left subtree, L = {L

∈

left[x]} and R intervals in the right subtree, R = {R

∈ right[x]}.

• If the search goes right, then

∈ L : i

overlaps i} = ∅.

• If the search goes left, then

If {i

∈ L : i

overlaps i} = ∅ ⇒ {i

∈ R : i

overlaps i}

That is, it’s always safe to take only 1 of the 2 children: we’ll either

ﬁnd something or nothing was to be found.

Proof. (a) Suppose search goes right. If left[x] = Nil, done.

Otherwise, the algorithm 11 states that low[i] > m[left[x]], where

m[left[x]] is the high endpoint of some interval j ∈ L by deﬁnition,

no other interval that can have a larger high endpoint than high[j]

⇒ {i

∈ L : i

overlaps i} = ∅.

12 SKIP LISTS

(b) Suppose search goes left and {i

∈ L : i

overlaps i}. By Algo.11,

low[i] ≤ m[left[x]] = high[j] for some j ∈ L. Since j ∈ L doesn’t

overlap with i, high[i] < low[j]. ⇒ According to binary-search-

tree property, low[j] ≤ low[i

] for all i

∈ R. ⇒ high[i] < low[i

]

for all i

∈ R.

12 Skip Lists

Another balanced search structure that maintains a dynamic set subject to

insertion, deletion and search. (Others eﬃcient include treaps, red-black

trees, B-trees).

• dynamic serach structure (William Pugh, 1989)

• randomized structure

• simple

• running time for each operation is O(log n) in expectation with high

probability (≈ 1−

α). (A strong bound on the tail of the distribution).

1. Starting from scratch

Sorted linked list is the simplest data structure.

Searches take Θ(n) time in worst case.

2. Two sorted linked list.

Ex: What is the sequence? 14, 23, 34, 42, 50, 59, 66, 72, 79, 86, 96, 103, 110, 116, 125.

It’s stations of NYC 7th Avenue Line. The idea is: two subway lines,

express line connects a few of stations and local line connects all sta-

tions. There are links between two lines at common stations. See

Figure 26.

Searching in two linked lists (Algo. 12):

Q: What keys go in L

(worst case)?

A: Best to spread them out uniformly.

Analysis:

12 SKIP LISTS

Figure 26: Two linked lists.

Algorithm 12 SEARCH(x)

1: Walk right in top linked list (L

) until going right would go too far.

2: Walk down to bottom linked list L

3: Walk right in L

until ﬁnd x (or some value greater than x if x is not in

the list).

• Cost of search is

≈ |L

| +

( uniformly dist. ).

• Minimize cost of search.

Denote |L

| = n. Up to constant factors,

| =

⇒ |L

| =

√

• Search cost is then roughly 2

√

3. log n linked lists.

For k sorted linked lists, cost is roughly k ·

√

For k = log n, log n ·

log n

√

n = 2 log n (2

log n

= n).

Ideal skip list: log n linked list structure. See Figure 27.

4. INSERTION.

Skip list maintains roughly subject to updates (insertion and deletion).

(Cannot maintain exactly). To insert an element x into a skip list,

12 SKIP LISTS

Figure 27: log n linked lists.

• Search(x) to see where x ﬁts in bottom list.

• Always insert into the bottom list.

Invariant: bottom list contains all elements.

• Issue: there could be more insertion around a certain node and

the balance of the lists can be broken. To which other lists should

we add x as well?

A: Flip a (fair) coin to next level up and ﬂip again.

On average:

– 1/2 of elements promoted 0 levels.

– 1/4 of elements promoted 1 levels.

– 1/8 of elements promoted 2 levels.

– ···

• Try to build a skip list from scratch by repeated insertion using a

real coin.

• A small change to the previous skip lists: add −∞ to every list

so the search can be implemented with the same algorithm.

5. To summarize, a skip list is the result of insertions (and deletions) from

an initially empty structure (containing only −∞). Skip lists are good

in terms of speed and balance (almost always).

Theorem 11. With high probability, every search in n−element skip list

costs O(log n).

12 SKIP LISTS

Deﬁnition: with high probability (w.h.p.):

Event E occurs w.h.p. if, for any α ≥ 1, there is a suitable choice of constants

for which E occurs with probability at (1 −O(1/n

)). (informal deﬁnition).

Boole’s inequality/union bound:

For any random event E

, E

, ··· , E

Pr(E

∪ E

∪ . . . ∪ E

) ≤ Pr(E

) + Pr(E

) + . . . + Pr(E

If k = n

O(1)

, each E

occurs w.h.p., then so does E

∩ . . . E

Proof. Warm up:

Lemma: With high probability, n-element skip list has O(log n) levels.

Proof: For any element x

Pr(more than c log n levels)

≤ n Pr(element x promoted at least c log n levels) Boole’s inequality

= n



c log n



= n







c−1



The probability is polynomially small. Can choose c arbitrarily large.

COOL IDEA: analyze search backward (leaf to root).

• Search starts (ends) at a node in the bottom list.

• At each node visited, if node wasn’t promoted higher, then we go left;

if node was promoted higher, we go up.

• Search stops (starts) at the root (or −∞).

Theorem proof:

• # of up moves < # of levels ≤ c log n (w.h.p. by lemma).

• ⇒ w.h.p. # moves ≤ # coin ﬂips needed to get c log n heads (move

up).

13 AMORTIZED ALGORITHMS, TABLE DOUBLING, POTENTIAL METHOD

• Number of coin ﬂips until c log n HEADS = Θ(log n) w.h.p. (See proof

on textbook)

13 Amortized Algorithms, Table Doubling,

Potential Method

Motivating question: how large should a hash table be?

A: We should make it as large as possible so searching is cheap; Flipside is

we should make it as small as possible so not to waste space. Happy medium

is make it Θ(n).

Q: What if we don’t know the proper size in advance?

A: Dynamic tables.

IDEA: Whenever the table overﬂows (gets too full), “grow” it.

• Allocate a new, larger table (malloc in C or new in Java).

• Move all items from old table to new.

• Free the storage of the old table.

Ex: Dynamic table example in Figure 28. Whenever there is overﬂow, create

a table of twice the size.

Analysis of sequence of n insert operations.

• Worst-case cost of 1 insert = Θ(n).

• Worst-case cost of n insert = nΘ(n) = Θ(n

). Wrong analysis: n

inserts in fact take Θ(n) time.

• Correct analysis.

– Let c

= cost of the i

insertion and

i if i − 1 is an exact power of 2

1 o.w.

See Figure 29 for an example of c

values.

13 AMORTIZED ALGORITHMS, TABLE DOUBLING, POTENTIAL METHOD

Figure 28: Dynamic table example.

Figure 29: Amortized costs.

13 AMORTIZED ALGORITHMS, TABLE DOUBLING, POTENTIAL METHOD

– Costs of n insertions

i=1

≤ n +

blog(n−1)c

j=0

≤ 3n geometric sequence of second term

= Θ(n).

Thus, the average cost of each dynamic-table operation is Θ(n)/n =

Θ(1): amortized analysis.

Amortized analysis: Analyze a sequence of operations to show that average

cost per operation is small, even though several operations may be expensive.

Probability is NOT involved even though we take average.

Types of amortized analysis:

1. Aggregate method (just saw, less precise).

2. Accounting method (more precise because they allocate more speciﬁc

amortized cost to each operation).

• Charge i

operation a ﬁctitious amortized cost ˆc

($1 pays for 1

unit of work).

• Fee is consumed to perform operation.

• Unused amount stored in “bank” for use by later operations.

• Bank balance must NOT go negative. That is

i=1

≤

i=1

ˆc

(or

i=1

ˆc

−

i=1

≥ 0)

for all n.

Now go back to the dynamic table example as shown in Figure 29, set

• Charge ˆc

= $3 for i

insert. $1 pays for immediate insert, $2

stored for table doubling.

• When table doubles, $1 pays to move the recent item (since last

doubling), $1 moves old item.

13 AMORTIZED ALGORITHMS, TABLE DOUBLING, POTENTIAL METHOD

3. Potential method (more precise because they allocate more speciﬁc

amortized cost to each operation).

• “Bank account” viewed as potential energy of dynamic set.

Framework:

• Start with an initial data structure D

• Operation i transforms D

i−1

to D

• The cost of operation i is c

• Deﬁne potential function

Φ : {D

} → R,

such taht Φ(D

) = 0 and Φ(D

) ≥ 0 for all i.

Analysis:

• Deﬁne amortized cost ˆc

with respect to Φ

ˆc

= c

+ Φ(D

) − Φ(D

i−1

where Φ(D

) − Φ(D

i−1

) = ∆Φ

is the potential diﬀerence.

– If ∆Φ

> 0, ˆc

> c

. Operation i stores work in data structure

for later use.

– If ∆Φ

< 0, ˆc

< c

. Data structure delivers up stored work

to help pay for pperation i.

• Total amortized cost of n operations is

i=1

ˆc

i=1

+ Φ(D

) − Φ(D

i−1

)]

i=1

+ Φ(D

) − Φ(D

)

≥

i=1

(by deﬁnition of potential function).

Amortized cost is an upper bound of total cost.

13 AMORTIZED ALGORITHMS, TABLE DOUBLING, POTENTIAL METHOD

• Table doubling potential analysis.

Deﬁne the potential of the table after the i

insertion (assume

dlog 0e

= 0)

Φ(D

) = 2i − 2

dlog ie

which satisﬁes the deﬁnition of potential function (non-negative).

• Amortized csot of i

insert.

ˆc

= c

+ Φ(D

) − Φ(D

i−1

)

i if i − 1 is an exact power of 2

1 o.w.



2i − 2

dlog ie



−



2(i − 1) − 2

dlog(i−1)e



i if i − 1 is an exact power of 2

1 o.w.

+ 2 − 2

dlog ie

+ 2

dlog(i−1)e

– Case 1: i − 1 is an exact power of 2.

ˆc

= i + 2 −2

dlog ie

+ 2

dlog(i−1)e

= i + 2 −2(i −1) + (i −1) (2

dlog ie

= 2

[log(i−1)]+1

)

= 3.

– Case 2: i − 1 is NOT power of 2.

ˆc

= 1 + 2 −2

dlog ie

+ 2

dlog(i−1)e

dlog ie

= 2

log(i−1)

)

= 3.

Thus n insertions cost Θ(n) in the worst case. (Bug: the cost of

the ﬁrst operation can be 2).

Summarization

• Amortized costs provide a clean abstraction of data-structure perfor-

mance.

• Any of the analysis methods can be used when an amortized analysis

is called, but each method has some situations where it is arguably the

simplest or most precise.

• Diﬀerent potential functions (in potential analysis) or accounting costs

(in accounting method) may yield diﬀerent bounds.

14 COMPETITIVE ANALYSIS: SELF-ORGANIZING LISTS

Figure 30: Self-organizing lists example.

14 Competitive Analysis: Self-organizing Lists

1. Self-organizing lists: List L of n elements

• Operation Access(x) (access x in the list, could be by searching,

or however we want to do) costs rank

(x) = distance of x from

the head of L. (what users can do)

• L can be reordered by transposing adjacent elements at a cost of

Ex: See Figure 30 for an example list.

(1). Access element 14 costs 4.

(2). Transpose 3 and 50 costs 1.

2. On-line algorithm

Deﬁnition:

A sequence S of operations is provided one at a time. For each opera-

tion, an on-line algorithm A must execute the operation immediately

without knowledge of future operations.

Oﬀ-line algorithm deﬁnition:

May see all of S in advance.

GOAL: Minimize the on-line total cost C

(S).

Analysis:

• Worst-case analysis.

An adversary always accesses tail element of L. Then the cost for

14 COMPETITIVE ANALYSIS: SELF-ORGANIZING LISTS

an on-line algorithm A

(S) = Ω(|S| · n),

where |S| is the size of operations S, n is the number of elements

of the list.

• Average-case analysis.

Suppose element x is accessed with probability p(x) (a priori dis-

tribution of the elements). Then

E[C

(S)] =

x∈L

p(x) · rank

(x),

which is minimized when L is sorted in decreasing order with re-

spect to p.

If the distribution of x is unknown, a heuristic way is: keep a count

of the number of times each element is accessed, and maintain L

in order of decreasing count.

3. In practice, implementers found that the move-to-front (MTF) heuris-

tic empirically yields good results: after accessing x, move x to the head

of L using transposes with cost = 2 ·rank

(x). (It works well when one

element is accessed, then it’s more likely that it will be accessed again,

responds well to locality).

4. Competitive analysis.

Deﬁnition: An on-line algorithm A is α-competitive if there exists a

constant k such that for any sequence S of operations,

(S) ≤ α · C

OP T

(S) + k,

where C

OP T

(S) is the cost of optimal oﬀ-line algorithm (God’s algo-

rithm).

Theorem 12. MTF is 4-competitive for self-organizing lists.

The theorem is the basis of many types of analysis of on-line algorithms (no

need to make statistical assumptions).

Proof. Steps:

14 COMPETITIVE ANALYSIS: SELF-ORGANIZING LISTS

Figure 31: Potential function computation example.

• Notation

• Potential function

• What happens on an access?

• Compute the costs

1. Notation:

• Let L

be MTF’s list after the i

access

• Let L

∗

be the OPT’s list after the i

access

• c

= MTF’s cost for the i

operation = 2·rank

i−1

(x) if it accesses

• c

∗

= OPT’s cost for the i

operation = rank

∗

i−1

(x) + t

if OPT

performs t

number of transposes

Compare c

with c

∗

at any point of time.

2. Potential function: count how diﬀerent of two lists, the more diﬀerences

there are, the more stored work has to be done for later paying.

Deﬁne Φ : {L

} → R by

Φ(L

) = 2 · |{(x, y) : x ≺

y and y ≺

∗

x}| = 2 · (# of inversions),

where ≺

is precedes-operation in list L

, if x ≺

y, then hit x ﬁrst

when accessing list L

Ex: See Figure 31 for two lists, compute potential of L

14 COMPETITIVE ANALYSIS: SELF-ORGANIZING LISTS

Figure 32: Relationship of A, B, C, D.

Φ(L

) = 2 · |{(E, C), (E, A), (E, D), (E, B), (D, B)}| = 10.

Check the properties of potential function:

• Φ(L

) ≥ 0 for any i

• Φ(L

) = 0 if MTF and OPT start with the same list

Q: How much doesn Φ change from 1 transpose?

A: A transpose creates/destroy 1 inversion. So ∆Φ = ±2.

3. What happens when opearation i accesses x?

Deﬁne

A = {y ∈ L

i−1

: y ≺

i−1

x and y ≺

∗

i−1

B = {y ∈ L

i−1

: y ≺

i−1

x and y 

∗

i−1

C = {y ∈ L

i−1

: y 

i−1

x and y ≺

∗

i−1

D = {y ∈ L

i−1

: y 

i−1

x and y  L

∗

i−1

x},

see Figure 32 for a picture of relationship of the sets.

Denote r = rank

i−1

(x) and r

∗

= rank

∗

i−1

(x), so

r = |A| + |B| + 1

∗

= |A|+ |C| + 1

Q: How many inversions are created/destroyed?

A: When MTF moves x to the front, it creates |A| inversions and de-

stroys |B| inversions.

Each transpose by OPT creates ≤ 1 inversion. So

Φ(L

) − Φ(L

i−1

) ≤ 2(|A| − |B| + t

14 COMPETITIVE ANALYSIS: SELF-ORGANIZING LISTS

4. Compute the costs.

Amortized cost for the i

operation of MTF w.r.t. Φ

ˆc

= c

+ Φ(L

) − Φ(L

i−1

)

≤ 2r + 2(|A| − |B| + t

) by cost deﬁnition

= 2(|A| + |B| + 1) + 2(|A| − |B| + t

)

= 4|A| + 2 + 2t

≤ 4(r

∗

+ t

) (as r

∗

= |A| + |C| + 1)

= 4c

∗

by deﬁnition

Thus

MT F

(S) =

i=1

[ˆc

+ Φ(L

i−1

− Φ(L

))]

i=1

[ˆc

+ Φ(L

) − Φ(L

)

≤

i=1

4ˆc

∗

+ Φ(L

) − Φ(L

)

≤ 4C

OP T

(S) (since Φ(L

) = 0 and Φ(L

) ≥ 0)

Addendum:

1. If we count transpositions that move x toward the front as “free” (con-

stant time), then MTF is 2-competitive.

2. If L

6= L

∗

, then Φ(L

) might be Θ(n

) in worst case (one list is

the reverse of the other, then there are choose 2 from n inveresions).

MT F

(S) ≤ 4C

OP T

(S) + Θ(n

), which is still 4-competitive as n

constant as |S| → ∞.

15 DYNAMIC PROGRAMMING, LONGEST COMMON SUBSEQUENCE

Figure 33: Longest common subsequence example.

15 Dynamic Programming, Longest Common

Subsequence

Programming here is an old word that means any tabular method for accom-

plishing something.

Dynamic programming is a design technique like other design techniques such

as divide and conquer.

Ex: Longest common subsequence (LCS). See Figure 33.

Given two sequences x[1, ··· , m] and y[1, ··· , n], ﬁnd a (not “the”) longest

subsequence common to them both.

1. Brute-force LCS algorithm:

Check every subsequence of x[1, ··· , m] to see if it’s also a subsequence

of y[1, ··· , n].

Analysis:

• For each subsequence of x, it took O(n) to check.

• There are 2

subsequences of x.

• So the worse-case running time is O(n2

) (exponential time, slow!)

2. Towards a better algorithm

Simpliﬁcation:

• Look at length of LCS(x, y).

• Extend the algorithm to ﬁnd LCS itself.

15 DYNAMIC PROGRAMMING, LONGEST COMMON SUBSEQUENCE

Figure 34: When x[i] = y[j] in proof of Theorem 13.

Strategy:

Consider preﬁxes of x and y.

Deﬁne

c[i, j] = |LCS(x[1, ··· , i], y[1, ··· , j])|,

and then

c[m, n] = |LCS(x, y)|.

Theorem 13.

c[i, j] =



c[i − 1, j − 1] + 1 if x[i] = y[j]

max{c[i − 1, j], c[i, j − 1]} o.w.

Proof. Discuss two cases separately.

Case 1: x[i] = y[j] See Figure 34.

Let z[1, ··· , k] = LCS(x[1, ··· , i], y[1, ··· , j]) with c[i, j] = k, then

z[k] = x[i] = y[j] (given x[i] = y[j]).

⇒ z[1, ··· , k −1] is common sequence of x[1, ··· , i−1] and y[1, ··· , j −

1].

⇒ Claim: z[1, k − 1] = LCS(x[1, ··· , i − 1], y[1, ··· , j − 1]).

Proof: Suppose w is longer CS of x[1, ··· , i − 1] and y[1, ··· , j − 1],

that is, |w| > k −1. Then sequence {w, z[k]} is a common subsequence

of x[1, ··· , i] and y[1, ··· , j] and its length is larger than (k−1+1) = k.

Contradiction.

⇒ c[i, j] = c[i −1, j − 1] + 1.

Case 2: Similar.

3. Hallmark (characteristic) #1 of dynamic programming on optimal sub-

structure:

15 DYNAMIC PROGRAMMING, LONGEST COMMON SUBSEQUENCE

Figure 35: Recursion tree with m = 3, n = 4.

An optimal solution to a problem (instance) contains optimal solutions

to subproblems.

Ex: If z = LCS(x, y), then any preﬁx of z is an LCS of a preﬁx of x

and a preﬁx of y.

4. Recursive algorithm for LCS in Algo. 13. Worst-case is x[i] 6= y[j],

then the algorithm needs to evaluate two subproblems.

Ex: A recursion tree of LCS with m = 3, n = 4 in Figure 35. The

Algorithm 13 LCS(x, y, i, j)

1: if x[i] = y[j] then

2: c[i, j] ← LCS(x, y, i − 1, j − 1) + 1

3: else c[i, j] ← max{LCS(x, y, i − 1, j), LCS(x, y, i, j − 1)}

4: end if

height of the tree is m + n, which works potentially exponential.

5. Hallmark # 2 of dynamic programming on overlapping subproblems:

A recursive solution contains a “small” number of distinct subproblems

15 DYNAMIC PROGRAMMING, LONGEST COMMON SUBSEQUENCE

Figure 36: Dynamic programming table for Memo algorithm.

repeated many times. Subproblem space contains m · n distinct LCS

subproblems for two strings of length m and n.

6. Memoization (memo) algorithm in Algo. 14, same as Algo. 13 except

the ﬁrst step to check if c[i, j] has been calculated or not:

• After computing a solution to a subproblem, store it in a table.

Subsequent calls check the table to avoid redoing work.

• So the time = Θ(mn) (constant work per table entry). Space is

Θ(mn).

• Dynamic programming algorithm to compute the table. See Fig-

ure 36.

– Compute the table bottom-up.

– Reconstruct LCS by tracing backwards.

16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES

Algorithm 14 LCS(x, y, i, j)

1: if c[i, j] = Nil then

2: if x[i] = y[j] then

3: c[i, j] ← LCS(x, y, i − 1, j − 1) + 1

4: else c[i, j] ← max{LCS(x, y, i − 1, j), LCS(x, y, i, j − 1)}

5: end if

6: end if

16 Greedy Algorithms, Minimum Spanning

Trees

Read textboof Appendix B for background information of graphs.

Graphs (review):

1. Deﬁnition: directed graph (digraph) G = (V, E) is an ordered pair

consisting of

• a set V of vertices

• a set E ⊆ V × V of edges.

2. Deﬁnition: undirected graph G = (V, E), the edge set E consists of

unordered pairs of vertices.

3. The number of edges (directed or undirected) is |E| = O(V

) (order of

the size of V

, ignored vertical bars around V

for simplicity).

4. If G is connected (there is a path from any vertex to any other vertex

in the graph), then |E| ≥ |V | − 1 ⇒ log |E| = Θ(log V ) (by the lower

and upper bound).

Graph representations:

1. Adjacency matrix of G = (V, E), where V = {1, 2, ··· , n}, is the n ×n

matrix A given by

A[i, j] =



1 if (i, j) ∈ E

0 if (i, j) 6∈ E

Ex: Adjacency matrix example in Figure 37.

16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES

Figure 37: Adjacent matrix.

The storage is Θ(n

), is a dense representation. (A good representa-

tion if the number of edges is close to the possible number of edges).

A tree is an example of connected graph that has no cycles (free tree).

A complete graph is an example of graph that’s dense (All 1’s of matrix

A).

2. Adjacency list of vertex v ∈ V is the list Adj[v], a set of vertices adjacent

to v.

Ex: The adjacency list graph in Figure 37 is

Adj[1] = {2, 3}

Adj[2] = {3}

Adj[3] = {}

Adj[4] = {3}

• In undirected graphs, degree(v) = |Adj[v]|

• In directed graphs, out − degree(v) = |Adj[v]|

Handshaking lemma: For undirected graph

v∈V

degree(v) = 2|E|.

16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES

Figure 38: MST example.

It applies that adjacency lists use Θ(V + E) storage — a sparse repre-

sentation ⇒ often better than adjacency matrix.

Minimum spanning trees (MST):

One of the world’s most important algorithms, important in distributed sys-

tems.

1. Input: Connected, undirected graph G = (V, E) with weight function

w : E → R.

• Assume all edge weights are distinct for simplicity (see textbook

for general case).

2. Output: A spanning tree T — a tree that connects all vertices — of

minimum weight:

w(T ) =

(u,v)∈T

w(u, v). [edge (u, v)]

Ex: MST example in Figure 38.

16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES

Figure 39: Optimal substructure example.

Optimal substructure:

MST has a great optimal substructure property.

1. Set-up: some MST:T in Figure 39. By removing any edge (u, v) ∈ T ,

T is partitioned into two subtrees T

and T

Theorem 14. The subtree T

is an MST for G

= (V

, E

), and the

subgraph of G

induced by vertices in T

, that is

• V

= vertices of T

• E

= {(x, y) ∈ E : x, y ∈ V

} (pairs of vertices of (x, y) that have

edges in E

)

Similarly for T

Proof. Cut and paste:

w(T ) = w(u, v) + w(T

) + w(T

)

16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES

Figure 40: “Greedy” algorithm hallmark proof.

Suppose T

is better than T

for G

(has lower weight), then

= {(u, v) ∪T

∪ T

}

will be a spanning tree better than T for G. (violated T is MST).

Q: Do we have overlapping subproblems?

A: Yes, dynamic programming may work. BUT, MST exhibits another pow-

erful property which leads to an even more EFFICIENT algorithm. ⇒

Greedy algorithm.

Hallmark for “greedy” algorithm: A locally optimal choice is globally

optimal.

Theorem 15. Let T be MST of G = (V, E), and let A ⊆ V . Suppose

(u, v) ∈ E is the least-weight edge connecting A to V − A. Then (u, v) ∈ T

(in MST).

I only need to check the least-weight for local part instead of the whole tree

by applying this algorithm.

Proof. Suppose (u, v) ∈ T . Cut and paste.

See Figure 40, (u, v) is least-weight edge connecting A to V − A.

• Consider unique simple path from u to V in T (textbook appendix

B.5.1).

16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES

Figure 41: Prim’s algorithm.

• Swap (u, v) with ﬁrst edge (in general, there may be many alternates,

just pick the ﬁrst one) on this path that connects a vertex in A to a

vertex in V − A.

• A lower weight spanning tree than T results.

• Contradicts.

Prim’s algorithm:

IDEA: Main V − A as a priority queue Q. Key each vertex in Q with the

weight of the least-weight edge connecting it to a vertex in A.

The algorithm is in Figure 41.At the end {v, π[v]} forms the MST.

• Handshaking Lemma states that there are Θ(E) implicit DRECREASE-

KEYS in k[v] ← w(u, v).

• So the total time is Θ(V )T

EXT RACT −MIN

+ Θ(E)T

DECREASE−KEY

16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES

Q T

EXT RACT −MIN

DECREASE−KEY

Total

array O(V ) O(1) O(V

)

binary heap O(log V ) O(log V ) O(E log V )

Fibonacci head O(log V ) O(1) O(E + V log V )

Table 4: Analysis of Prim

Figure 42: Prim’s algorithm example.

• If it’s a dense graph (E is close to V

), array is better; if its a sparse

graph (E is much smaller than V

), binary heap is better; See table 4.

Ex: Prim’s algorithm example in Figure 42.

— Kruskal’s algorithm uses disjoint-set data structure with O(E log V ),

see textbook.

17 SHORTEST PATHS I : PROPERTIES, DIJKSTRA’S ALGORITHM, BREADTH-FIRST

17 Shortest Paths I : Properties, Dijkstra’s

Algorithm, Breadth-ﬁrst Search

Paths review:

1. Paths in graphs:

Consider a digraph G = (V, E) with edge-weight function w : E → R.

The weight of path p = v

→ v

→ ···v

is deﬁned as

w(p) =

k−1

i=1

w(v

, u

i+1

is the sum of the weights of the edges along the path.

2. Shortest path from u to v is a path of minimum weight from u to v.

The shortest path weight is

δ(u, v) = min{w(p) : p is a path from u to v}.

Note:

• δ(u, v) = ∞ if no path from u to v exists.

• Negative edge weights ⇒ some shortest paths may not exist (for

example, when the sum of weights of a cycle is negative, one can

go through the cycle as many times as possible to obtain smaller

weights, in this case, δ(u, v) = −∞).

Theorem 16. A subpath of a shortest path is a shortest path.

Proof. Cut and paste. If the subpath is not shortest, then there must be

another path that is shorter than the current shortest path.

Triangle inequality:

Theorem 17. For all vertices u, v, x ∈ V , we have δ(u, v) ≤ δ(u, x) +δ(x, v)

Proof. Prove by graph.

17 SHORTEST PATHS I : PROPERTIES, DIJKSTRA’S ALGORITHM, BREADTH-FIRST

Single-source shortest paths:

A particular version shortest path.

1. Problem: given source vertex s ∈ V , ﬁnd the shortest-path weights

δ(s, v) for all v ∈ V .

2. Further restrictions: assume all edge weights w(u, v) are nonnegative,

then all shortest-path weights must exist with δ(u, v) > −∞.

3. IDEA: greedy.

• Maintain a set of S of vertices whose shortest-path distances from

s are known (s ∈ S).

• At each step add to S the vertex v ∈ (V − S) whose distance

estimate from s is minimal.

• Update the distance estimates of vertices adjacent to v.

4. Algorithm.

Dijkstra’s algorithm

Ex: See Figure 43.

Shortest path tree: the union of all shortest paths. For each ver-

tex in graph, consider the last edge into that vertex that was relaxed

among all vertices u.

Correctness:

1. Invariant: d[v] ≥ δ(s, v) for all v ∈ V .

Breadth-ﬁrst search (BFS) (unweighted graphs, w(u, v) = 1 for all (u, v) ∈

E), see Algorithm 16:

BFS is actually Dijkstra algorithm, two changes are

• BFS doesn’t use a priority queue, it uses a FIFO queue instead.

17 SHORTEST PATHS I : PROPERTIES, DIJKSTRA’S ALGORITHM, BREADTH-FIRST

Algorithm 15 Dijkstra’s algorithm

1: d[s] ← 0  d is some array indexed by vertices

2:  d[x] is distance estimate from s to x

3:  d[x] is δ(s, x) when x ∈ S, is the output

4: for each v ∈ V − {s} do

5: d[v] ← ∞

6: end for

7: S ← ∅

8: Q ← V  Priority queue keyed on d

9:  Above is initialization

10: while Q 6= ∅ do

11: u ← EXT RACT − MIN(Q)

12: S = S ∪ {u}

13: for each v ∈ Adj[u] do

14: if d[v] > d[u] + w(u, v) then

15:  Relaxation (the triangular inequal.) step, implicit

DECREASE-KEY

16: d[v] ← d[u] + w(u, v)

17: end if

18: end for

19: end while

Algorithm 16 BFS

1: while Q 6= ∅ do

2: u ←DEQUEUE(Q)

3: for each v ∈ Adj[u] do

4: if d[v] = ∞ then

5: d[v] ← d[u] + 1

6: ENQUEUE(Q, v)

7: end if

8: end for

9: end while

17 SHORTEST PATHS I : PROPERTIES, DIJKSTRA’S ALGORITHM, BREADTH-FIRST

Figure 43: Dijkstra’s algorithm example.

18 SHORTEST PATHS II : BELLMAN-FORD, LINEAR PROGRAMMING, DIFFERENCE

CONSTRAINTS

• In order to relax, if d[v] = ∞ (we haven’t visited v yet), then we declare

it to have the shortest path weight d[v] = d[u] + 1, and add v to the

end of the queue.

We start with the queue with just contains the vertex s (the only thing we

know the shortest path for), the queue is used for, known the shortest path,

deal with it when you can, look at the outgoing edges when you can. Initially,

just s, for all the outgoing edges, s has zero, all the outgoing edges from there

have weight 1 (shortest path weight from source), add all the vertices to the

end of the queue, process things in order, keep incrementing their values d[·].

• The time is O(V + E) because we only look at each edge once, as soon

as we set d[v] to something, it will remain unchanged (added to S, only

look at once).

• By proving that FIFO queue does exactly the same thing as the priority

queue, then can use the correctness of Dijkstra analysis to prove the

correctness of BFS.

• BFS can be used to

– Find all the vertices

– Find the shortest path weights from s to every other vertex when

the weights are all one.

18 Shortest Paths II : Bellman-Ford, Linear

Programming, Diﬀerence Constraints

Deal with negative weight cycles.

Bellman-Ford algorithm: simpler than Dijkstra but not as fast.

• Computes shortest path weights δ(s, v) from source vertex s ∈ V to all

vertices v ∈ V . Or

reports that a negative-weight cycle exists. See algorithm 17.

Ex: See Figures 44 and 45.

Correctness of Bellman-Ford Algorithm:

18 SHORTEST PATHS II : BELLMAN-FORD, LINEAR PROGRAMMING, DIFFERENCE

CONSTRAINTS

Figure 44: Bellman-Ford algorithm pass 1

18 SHORTEST PATHS II : BELLMAN-FORD, LINEAR PROGRAMMING, DIFFERENCE

CONSTRAINTS

Algorithm 17 Bellman-Ford algorithm

1: d[s] ← 0

2: for each v ∈ V − {s} do

3: d[v] ← ∞

4: end for  initialization

5: for i ← 1 to |V | − 1 do  i is just a counter

6: for each edge (u, v) ∈ E do

7: if d[v] > d[u] + w(u, v) then

8: d[v] ← d[u] + w(u, v)

9: end if

10: end for

11: end for  Relaxation step

12: for each edge (u, v) ∈ E do

13: if d[v] > d[u] + w(u, v) then

14: report that a negative-weight cycle exists

15: end if

16: end for

17: At the end, d[v] = δ(s, v), if no negative-weight cycles. Time is O(V E).

Figure 45: Bellman-Ford algorithm continued

18 SHORTEST PATHS II : BELLMAN-FORD, LINEAR PROGRAMMING, DIFFERENCE

CONSTRAINTS

Theorem 18. If G = (V, E) has no negative-weight cycles, then B-F algo-

rithm terminates with d[v] = δ(s, v) for all v ∈ V .

Proof. By monotonicity of the d values and correctness part I d[v] ≥ δ(s, v),

only need to show that d[v] = δ(s, v) at some time.

Consider a shortest path (in terms of the total weight of the path) p from

s to v with the minimum number of edges (so the path p is a simple path,

which doesn’t repeat any vertices),

p = v

→ v

→ ··· → v

where s = v

and v = v

, so we have

δ(s, v

) = δ(s, v

i−1

) + w(v

i−1

, v

Then

d[v

] = 0 = δ(s, v

) (d[v

] ≥ δ(s, v

))

. After pass 1

d[v

] = δ(s, v

)

. After pass 2

d[v

] = δ(s, v

)

d[v

] = δ(s, v

)

Since there is no negative-weight edges, p is simple,the longest simple path

has ≤ |V | − 1.

Corollary: If a value d[v] fails to converge after |V | − 1 passes for B-F al-

gorithm, there exists a negative-weight cycle in G reachable from s.

Linear Programming (LP): the farther and mother of shortest path.

If A be an m × n matrix, b is an m-vector, and c is an n-vector. Find an

n-vector x that maximizes c

x subject to Ax ≤ b, or determine that no such

solution exists. See Figure 46.

Many eﬃcient algorithms (and code) to solve LPs.

19 SHORTEST PATHS III : ALL-PAIRS SHORTEST PATHS, MATRIX

MULTIPLICATION, FLOYD-WARSHALL, JOHNSON

Figure 46: Linear programming.

• Simplex, practical but exponential time in worst cases.

• Ellipsoid, ﬁrst algorithm to solve LP in polynomial time, impractical.

• Interior-point methods, mixture (polynomial time but practical).

• Random sampling, relatively new, completely randomized, simple.

Corollary: The Bellman-Ford algorithm can solve a system of m diﬀerence

constrains on n variables in O(mn) time.

19 Shortest Paths III : All-pairs Shortest

Paths, Matrix Multiplication, Floyd-Warshall,

Johnson

The last two lectures:

• Single-source shortest path: ﬁnd the shortest path from a source vertex

to every other vertex

– Unweighted path: BFS, O(V + E)

– Non-negative edge weights: Dijkstra, O(E + V log V )

– General case: Bellman-Ford, O(V E)

– Directed Acyclic Graph (DAG): topological sort, then run B-F

given the order from topological sort one round, a linear time

algorithm.

In this lecture, we discuss all-pairs shortest paths, where we want to know

the shortest path weight between every pair of vertices.

19 SHORTEST PATHS III : ALL-PAIRS SHORTEST PATHS, MATRIX

MULTIPLICATION, FLOYD-WARSHALL, JOHNSON

• Unweighted path: V times of BFS, |V | × BF S ∼ O(V E).

• Non-negative edge weights: Dijkstra’s algorithm |V | times, O(V E +

logV ).

• General case: focus of today.

– |V |× B-F, O(V

– 3 algorithms will be discussed today.

Deﬁnition of all-pairs shortest paths:

Input: Digraph G = (V, E), where V = {1, 2, ··· , n}, with edge weight

function w : E → R.

Output: n × n matrix of shortest-path lengths δ(i, j) for all i, j ∈ V .

If it’s a dense graph, that is, E = O(V

), then the time is three times of B-F

is O(V

), ﬁrst goal is to beat this time complexity.

Dynamic programming:

• Let A be a weighted adjacency matrix:

A = (a

where a

= w(ij) if (ij) is an edge or a

= ∞ if there no edge.

• Deﬁne subproblems: Let d

(m)

the weight of a shortest path from i to j

that uses at most m edges, then from B-F algorithm, d

(m−1)

= d

(m)

δ(i, j) if there is no negative weight cycles.

Claim:

(0)

0 if i = j

∞ if i 6= j

for m = 1 and

(m)

= min

(m−1)

+ a

}

20 ADVANCED TOPICS: PARALLEL ALGORITHMS I

20 Advanced Topics: Parallel Algorithms I

For serial algorithms, most people have basic model sometimes been called

a random access machine model, which was what we have been using to an-

alyze things.

In the parallel space, there are a huge number of models and there is no

general agreement on what is the best (varies with diﬀerent machines/con-

ﬁgurations...).

1. Dynamic multithreading: appropriate for the multicore machines

or for shared memory programming, NOT appropriate for distributed

memory programs particularly (processors are not able to access things).

Ex: Fib(n). Keywords explanation:

Algorithm 18 F ib(n)

1: if n < 2 then return n

2: end if

3: x ← SPAWN F ib(n − 1)  thread A (line 1 - 3)

4: y ← SPAWN F ib(n − 2)  thread B (line 4)

5: SYNC

6: return (x + y)  thread C (line 5 -6)

(a) SPAWN: subroutine can execute at the same time as its parents.

(b) SYNC: wait until all children are done.

The above is the logical parallelism, not actual. A scheduler determines

how to map dynamically unfolding execution onto whatever processors

available.

2. Multithread computation: the parallel instruction stream are di-

rected acyclic graph (dag). Vertices of dag are threads, which are max-

imal sequences of instructions not containing parallel control (SPAWN,

SYNC and return from a SPAWNed procedure). See Figure 47 for a

graphic demonstration of the treads. Three types of edges (spawn,

return and continuation) are used.

20 ADVANCED TOPICS: PARALLEL ALGORITHMS I

Figure 47: Directed acryclic graph for multithread computation.

20 ADVANCED TOPICS: PARALLEL ALGORITHMS I

3. Performance measures: Let

= running time on p processors,

=work = serial time, the running time on one processor,

∞

= critical-path length = longest path in the dag.

Ex: For unit time threads, Fib(4) has

= 17 (count number of small circles in Fig ??, how many instructions were there)

∞

= 9 (A – A – A –B – A –C –C–C),

≥

p processors can do p work at most in one step. p processors can do at

most p work in one step. And

≥ T

∞

p processors cannot do more work than ∞ processors.

4. Speed up on p processors:

(a) If speedup = Θ(p), it’s linear speedup.

(b) If sppedup > p, it’s superlinear speedup. (never happens due to

lower bound above in this model).

, T

is T

∞

= parallelism = av-

erage amount of work that can be done in parallel along each step

of critical path =

P .

5. Scheduling: map computation to p processors, typically done by a

runtime system. Algorithm that maps the executing program (spawn,

return...) onto the processors of the machine as it executes.

Online schedulers are complex. Here we illustrate the ideas using oﬀ-

line scheduling.

Greedy scheduler (p processors): Do as much as possible on every

step. For diﬀerent types of step,

(a) Complete step: at least p threads ready to run. Strategy: execute

any p.

21 ADVANCED TOPICS (CONT.)

(b) Incomplete step: fewer than p thread ready to run. Strategy:

execute all of them.

Theorem 19. (Graham and Brent) A greedy scheduler executes any

computation G with work T

and critical path length T

∞

time,

≤ T

/p + T

∞

Note: T

is within a factor of two of optimal time.

Proof. Count how many complete steps and how many incomplete

steps.

• Complete steps: the largest number of complete steps is T

/p since

otherwise more than T

works would be done.

• Incomplete steps: let G

be the subgraph of G that remains to be

executed, see Figure 48.

(a) Threads with in-degree zero (no predecessor) in G

are ready

to be executed.

(b) After executing all of the in-degree zero thread (incomplete

case), the critical path length of the remaining subgraph de-

creases by one. If complete case, no decrease of the critical

path length.

∞

. (amortized

argument).

Add up the two contributions. This is the fundamental theorem

of all scheduling.

Corollary: Linear speedup when p = O(

P ) (order of parallelism).

Proof.

P =

∞

⇒ p = O(T

∞

) ⇒ T

∞

= O(T

/p),

that is, T

≤ T

/p+T

∞

= O(T

/p), then T

= O(p), linear speedup.

6. Cilk: dynamic multithread language based on language C.

21 Advanced Topics (cont.)

21 ADVANCED TOPICS (CONT.)

Figure 48: Subgraph G

and graph G.