Introduction to Algorithms

Study Notes on Introduction to Algorithms
by Charles Leiserson and Erik Demaine
Qianqian Shan
Contents
1 Analysis of Algorithms, Insertion- and Merge-Sort 3
2 Asymptotic Notation, Recurrences, Substitution, Master Method 5
3 Divide and Conquer (Divide and Rule): Strassen, Fibonacci,
Polynomial Multiplication 9
4 Quick Sort (Tony Hoare, 1962) and Randomized Quick Sort 15
5 Linear-time Sorting: Lower Bounds, Counting Sort, Radix
Sort 21
6 Order Statistics, Median 25
7 Hashing, Hash Functions 27
8 Universal Hashing, Perfect Hashing 34
9 Relation of BSTs to Quicksort - Analysis of Random BST 38
10 Red-black Tree, Rotations, Insertions, Deletions 44
11 Augmenting Data Structures, Dynamic Order Statistics, In-
terval Trees 50
12 Skip Lists 56
1
CONTENTS
13 Amortized Algorithms, Table Doubling, Potential Method 60
14 Competitive Analysis: Self-organizing Lists 65
15 Dynamic Programming, Longest Common Subsequence 70
16 Greedy Algorithms, Minimum Spanning Trees 74
17 Shortest Paths I : Properties, Dijkstra’s Algorithm, Breadth-
first Search 81
18 Shortest Paths II : Bellman-Ford, Linear Programming, Dif-
ference Constraints 85
19 Shortest Paths III : All-pairs Shortest Paths, Matrix Multi-
plication, Floyd-Warshall, Johnson 89
20 Advanced Topics: Parallel Algorithms I 91
21 Advanced Topics (cont.) 94
2
1 ANALYSIS OF ALGORITHMS, INSERTION- AND MERGE-SORT
Background
The following notes were taken when I learned an open course Introduction
to Algorithms taught by Prof. Charles Leiserson and Prof. Erik Demaine at
https://ocw.mit.edu.
1 Analysis of Algorithms, Insertion- and Merge-
Sort
The theoretical study of computer-program performance and resource usage.
Q: In programming, what’s more important than performance?
A: Features, modularity, security, user-friendlyness ···
Q: Why study algorithms and performance?
A: 1. Performance measures the line between feasible and the infeasible. 2.
Algorithm gives a pervasive language to talk about program behavior.
Problem: Sorting
Input: sequence < a
1
, a
2
, ··· , a
n
> of numbers.
Output: permutation of numbers < a
0
1
, ··· , a
0
n
> such that a
0
1
a
0
2
··· a
0
n
.
Algorithm 1 Insertion-Sort(A, n)
1: for j 2 to n do
2: key = A[j]
3: i j 1
4: while i > 0 and A[i] > key do
5: A[i + 1] A[i]
6: i = i 1
7: end while
8: A[i + 1] = key
9: end for
Running time of algorithm:
Depends on input (eg., already sorted).
3
1 ANALYSIS OF ALGORITHMS, INSERTION- AND MERGE-SORT
Depends on input size (6 elements vs 6 × 10
9
elements).
Parameterize input size.
Want upper bounds on the running time, which represents a guarantee
to the users.
Kinds of analysis:
Worst-case (usually), T (n) = max time on any input of size n.
Average-case (somtimes), T (n) = expected time of all inputs of size n.
(Need assumption of statistic distribution).
Best-case (bogus), cheat.
Q: What’s insertion sort’s worst-case time?
A: It depends on computer (relative speed on same machine, absolute speed
on different machines).
BIG IDEA! Asymptotic analysis:
1. Ignore machine dependent constants.
2. Look at the growth of the T (n) as n .
Asymptotic notation:
Θ, drop low order terms and ignore leading constants. For example,
3n
3
+ 90n
2
+ 6046 = Θ(n
3
). As n , Θ(n
2
) algorithm always beat
a Θ(n
3
) algorithm.
Insertion sort analysis:
Worst-case: input reversely sorted,
T (n) =
n
X
j=2
Θ(j) = Θ(n
2
),
which is arithmetic series. For small n, the algorithm is moderately fast, but
not at all for large n.
4
2 ASYMPTOTIC NOTATION, RECURRENCES, SUBSTITUTION, MASTER METHOD
Algorithm 2 Merge-Sort(A, n)
1: If n = 1, done. Θ(1)
2: Recursively sort A[1, dn/2e] and A[dn/2e + 1, n]. 2T (Θ(n/2))
3: Merge two sorted lists. Θ(n) linear time for comparing lists .
Running time is,
T (n) =
Θ(1) if n = 1 (usually omit)
2T (n/2) + Θ(n) if n > 1.
Use recursion tree to compute T (n) = 2T (n/2) + c n, with constant c > 0.
T (n) = c n + T (n/2) + T (n/2)
= c n + c(n/2) + c(n/2) + T (n/4) + T (n/4) + T (n/4) + T (n/4)
= ···
= c n log(n) + Θ(n) = Θ(n log n)
Note: log is abbreviated for log
2
instead of the natural log.
2 Asymptotic Notation, Recurrences, Sub-
stitution, Master Method
Asymptotic notation:
1. Upper bounds,
f(n) = O(g(n)), means there are constants c > 0, n
0
> 0 such
that 0 f(n) c g(n) for all sufficiently large n n
0
. For
example, 2n
2
= O(n
3
).
Set definition of O,
O(g(n)) = {f(n) : c > 0, n
0
> 0 such that 0 f(n) c g(n) for all n n
0
},
Macro convention: A set in a formula represents an anonymous
function in that set.
Ex: f(n) = n
3
+ O(n
2
) means there is a function h(n) = O(n
2
)
such that f(n) = n
3
+ h(n).
5
2 ASYMPTOTIC NOTATION, RECURRENCES, SUBSTITUTION, MASTER METHOD
O Θ
=
Table 1: Analogies
Ex: n
2
+ O(n) = O(n
2
), everything over the left hand side of the
equation is the right hand side. Anything that is n
2
+O(n) is also
O(n
2
), but not the other way around (no symmetric). Interpre-
tation: for any f(n) O(n), there is an h(n) O(n
2
) such that
n
2
+ f(n) = h(n).
2. Lower bounds,
Ω(g(n)) = {f(n) : constants c > 0, n
0
> 0 s.t. 0 c g(n)
f(n) for all n n
0
}.
Ex:
n = Ω(log(n)), for up to constant factors, root n is at least
log(n) for sufficiently large n.
3. Θ(g(n)) = O(g(n)) Ω(g(n)).
4. o, little o corresponds to <. ω corresponds to >. Their definitions are
similar to O and Ω, the difference is, the relationship between f and g
must hold for all c.
Ex: 2n
2
= o(n
3
), with n
0
= 2/c.
Ex: (1/2)n
2
= Θ(n
2
) 6= o(n
2
).
Solving recurrences
1. Substitution method
Guess the form of the solution.
Verify whether the recurrence satisfies this bound by induction.
Solve for constants.
Ex: T (n) = 4T(n/2) + n, trivial case T (1) = Θ(1).
(a) Guess T (n) = O(n
3
).
(b) Assume T (k) c k
3
for k < n.
6
2 ASYMPTOTIC NOTATION, RECURRENCES, SUBSTITUTION, MASTER METHOD
(c)
T (n) = 4T (n/2) + n
4 c (n/2)
3
+ n
=
1
2
cn
3
+ n
= cn
3
(
1
2
cn
3
n)
cn
3
, if
1
2
cn
3
n 0.
The above bound is a bound but not a tight bound.
Ex: Prove T (n) = O(n
2
)
(a) Assume T (k) ck
2
for k < n.
(b)
T (n) = 4T (n/2) + n
4c(n/2)
2
+ n
= cn
2
+ n
not cn
2
True but useless to complete the induction. To fix this, assume a
strengthened induction hypothesis, T (k) c
1
k
2
+ c
2
k for k < n,
T (n) = 4T (n/2) + n
4[c
1
(n/2)
2
c
2
(n/2)] + n
= c
1
n
2
+ (1 2c
2
)n
= c
1
n
2
c
2
n (1 + c
2
)n,
want (1 + c
2
)n > 0, that is, c
2
1. T (1) = Θ(1) c
1
c
2
if c
1
is sufficiently large w.r.t. c
2
.
2. Recursion-tree method: a particular way of adding up recurrence.
Ex: T (n) = T(n/4) + T (n/2) + n
2
, see Figure 1.
Total cost (level-by-level),
T (n) = (1 +
5
16
+
25
256
+ ···
5
k
16
k
+ ···)n
2
2n
2
P
k=0
(1/2)
k
= 2 >
P
k=0
(5/16)
k
.
7
2 ASYMPTOTIC NOTATION, RECURRENCES, SUBSTITUTION, MASTER METHOD
Figure 1: Recursion tree of T (n) = T(n/4) + T (n/2) + n
2
.
3. Master method, only applies to particular recurrences of the form,
T (n) = aT (n/b) + f(n),
where a 1, b > 1, f(n) should be asymptoticly positive (f(n) > 0 for
n n
0
).
Compare f(n) with n
log
b
a
(number of leaves in the recursion tree).
(a) f(n) = O(n
log
b
a
) for some > 0 T (n) = Θ(n
log
b
a
).
(b) f(n) = Θ(n
log
b
a
(log
b
n)
k
) for k 0 T (n) = Θ(n
log
b
a
(log n)
k+1
).
(c) f(n) = Ω(n
log
b
a+
) for some > 0 and af(n/b) (1
0
)f(n) for
some
0
> 0 T (n) = Θ(f(n)).
Ex: T (n) = 4T(n/2) + n with a = 4, b = 2, and f(n) = n.
n
log
b
a
= n
2
case (a) T (n) = Θ(n
2
).
Ex:T (n) = 4T (n/2) + n
2
case (b) with k = 0 T (n) = Θ(n
2
log n)
Ex:T (n) = 4T (n/2) + n
3
8
3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,
POLYNOMIAL MULTIPLICATION
Figure 2: Proof sketch tree of master method.
case (c) T (n) = Θ(n
3
)
Ex:T (n) = 4T (n/2) + n
2
/ log n
Master method doesn’t apply.
Proof sketch/intuition in Figure 2:
Case (a): costs increase geometrically, dominated by Θ(n
log
b
a
).
Case (b): costs at each level is roughly the same, especially the upper-
most levels. Cost is f(n) log
b
n.
Case (c): costs decrease geometrically as we go down the tree. Domi-
nated by f(n).
3 Divide and Conquer (Divide and Rule): Strassen,
Fibonacci, Polynomial Multiplication
A joke on divide and conquer: Try and test way of conquering a land by
dividing into sections of some kind could be different political factions and
9
3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,
POLYNOMIAL MULTIPLICATION
make them no longer like each other like starting a family feud.
Divide and conquer is a basic and powerful algorithm design technique.
Algorithm:
1. Divide the problem (instance) into one or more subproblems.
2. Conquer each subproblem recursively
3. Combine solutions.
Quick review on merge-sort phrased in divide and conquer steps:
1. Divide array into two halves (trivial).
2. Recursively sort each subarray.
3. Combined linear-time merge.
Running time of merge-sort:
T (n) = 2T (n/2) + Θ(n),
where (n/2) is the size of subproblems (i.e., two problems of size n/2), Θ(n)
is the divide and conquer running times (non-recursive). Case (b) of master
method with k = 0. So the cost is T (n) = Θ(n log n).
Ex: Binary search
Find x in a sorted array in a divide and conquer paradigm.
1. Divide. Compare x with the middle elements of the array.
2. Conquer. Recurse in one sub-array.
3. Combined. Trivial.
T (n) = 1 T (n/2) + Θ(1) = Θ(log n). Divide it into one subproblem and one
comparison with the middle element.
Ex: Powering a number
Given number x, integer n 0, compute x
n
.
10
3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,
POLYNOMIAL MULTIPLICATION
Naive algorithm: x · x ···x = x
n
, Θ(n) time due to n multiplications.
Divide and conquer algorithm:
x
n
=
x
n/2
x
n/2
n is even.
x
(n1)/2
x
(n1)/2
x n is odd.
As the first and second term in both cases are the same, computation on
only one part of it is needed.
T (n) = T (n/2) + Θ(1) = Θ(log n).
Ex: Fibonacci numbers
F
n
=
(
0 n < 0
1 n = 1
F
n1
+ F
n2
n 2
Naive recursive algorithm: For each n 2, solve F (n 1) and F (n 2)
recursively. Time is exponential, Ω(φ
n
) with φ = (1 +
5)/2.
Bottom-up algorithm: Compute F
0
, F
1
, ··· , F
n
. Time is linear in n: Θ(n).
Naive recursive squaring algorithm: F
n
= φ
n
/
5 and round to the near-
est integer to obtain the Fibonacci number. However, it’s not allowed (the
round off of a floating machine can ruin the computation).
Recursive squaring algorithm: time Θ(log n), the problems is similar to pow-
ering a number with divide and conquer, but powering a matrix here.
Theorem 1. Fibonacci numbers,
F
n+1
F
n
F
n
F
n1
=
1 1
1 0
n
,
which implies that time is Θ(log n).
Proof. By induction.
Base case, n = 1,
F
2
F
1
F
1
F
0
=
1 1
1 0
1
,
11
3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,
POLYNOMIAL MULTIPLICATION
Suppose we have
F
n
F
n1
F
n1
F
n2
=
1 1
1 0
n1
then
F
n+1
F
n
F
n
F
n1
=
F
n
F
n1
F
n1
F
n2
1 1
1 0
=
1 1
1 0
n
,
Ex: Matrix multiplication
C = [c
ij
] = AB with A = [a
ij
] and B = [b
ij
] for i, j = 1, 2, ··· , n.
Standard algorithm: c
ij
=
P
n
k=1
a
ik
b
kj
, n operations each for n
2
elements,
time Θ(n
3
).
Divide and conquer algorithm:
Idea: n × n matrix = 2 × 2 block matrix of n/2 × n/2 sub-matrices.
C =
r s
t u
= A B =
a b
c d
e f
g h
,
so r = ae + bg, s = af + bh, t = ce + dg, u = cf + dh. There are 8 recursive
multiplications (ae, bg, af, ···) of n/2×n/2 matrices and 4 additions (Θ(n
2
)).
T (n) = 8T (n/2) + Θ(n
2
) = Θ(n
3
),
according to case (a) of master method.
Strassen’s algorithm:
Idea: reduce the number of multiplications in divide-and-conquer algorithm
to 7.
p
1
= a(f h)
p
2
= (a + b)h
p
3
= (c + d)e
p
4
= d(g e)
p
5
= (a + d)(e + h)
p
6
= (b d)(g + h)
p
7
= (a c)(e + f)
12
3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,
POLYNOMIAL MULTIPLICATION
As a result,
r = p
5
+ p
4
p
2
+ p
6
s = p
1
+ p
2
t = p
3
+ p4
u = p
5
+ p
1
p
3
p
7
Algorithm:
1. Divide A, B, compute the terms for the products needed to compute p
values, it takes Θ(n
2
).
2. Conquer by recursively p
1
, ··· , p
7
.
3. Combined r, s, t, u.
The recurrence is T (n) = 7T (n/2) + Θ(n
2
) = Θ(n
log
2
7
) = O(n
2.81
) (strictly
less than n
2.81
).
This is not the best, but the simplest that can beat n
3
. The best so far is
n
2
.376.
Ex: Very Large Scale Integration (VLSI) layout
Problem: Embed a complete binary tree on n leaves in a grid with minimum
area.
Naive embedding as shown in Figure 3:
Height of the tree H(n) = H(n/2) + Θ(1) = Θ(log n).
Width of the tree W (n) = 2W (n/2) + O(1) = Θ(n).
The area is Θ(n log n).
Goal: to get W (n) = Θ(
n) and H(n) = Θ(
n), so the area is Θ(n), for
example, cost in the form of T (n) = 2T (n/4) + O(n
1/2
).
H layout as shown in Figure 4:
L(n) = 2L(n/4) + Θ(1) = Θ(
n) (Case a of master method).
13
3 DIVIDE AND CONQUER (DIVIDE AND RULE): STRASSEN, FIBONACCI,
POLYNOMIAL MULTIPLICATION
Figure 3: Naive embedding.
Figure 4: H layout.
14
4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT
4 Quick Sort (Tony Hoare, 1962) and Ran-
domized Quick Sort
1. It’s divide-and-conquer algorithm.
2. It sorts “in place” (i.e., re-arranging the elements where they are).
3. Very practical (with tuning).
Divide and conquer paradigm:
1. Divide: Partition array into 2 subarrays around pivot x such that ele-
ments in the lower subarray x elements in upper subarray.
x x x
2. Conquer: Recursively sort 2 subarrays.
3. Combined: Trivial.
The key is linear-time (Θ(n)) partitioning subroutine as shown in Algo. 3.
Array x x x ?
Location p i j q
Table 2: Locations p, q are invariant, the partition tries to loop through all
elements and move all elements smaller than the pivot element to one side.
Running time for the partition is T (n) = Θ(n) for n elements.
Ex: Array 6, 10, 13, 5, 8, 3, 2, 1, choose pivot x = 6
6, 5, 13, 10, 8, 3, 2, 11
6, 5, 3, 10, 8, 13, 2, 11
6, 5, 3, 2, 8, 13, 10, 11
Now we have two subarrays:
2, 5, 3, and 8, 13, 10, 11 that are smaller and bigger than pivot 6 respectively.
Quicksort routine is then
15
4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT
Algorithm 3 Partition(A, p, q) or A[p, q]
1: x A[p] pivot A[p]
2: i p
3: for j (p + 1) to q do Loop through all elements
4: if A[j] x then
5: i i + 1
6: Exchange A[i] A[j] Swap
7: end if
8: end for
9: Exchange A[p] A[i]
10: Return i The current pivot element location
Algorithm 4 Quicksort(A, p, q)
1: if p < q then
2: r Partition(A, p, q)
3: Quicksort(A, p, r 1)
4: Quicksort(A, r + 1, q)
5: end if
Initial call: Quicksort(A, 1, n).
Analysis - assume all elements are distinct
1. T (n) = worst-case time:
Input sorted or reverse sorted, one side of partition has no elements.
T (n) = T (0) + T (n 1) + Θ(n)
= Θ(1) + T (n 1) + Θ(n)
= T (n 1) + Θ(n)
= Θ(n
2
) (arithmetic series like insertion sort)
The worst case time is no better than insertion sort.
Recursion tree in Figure 5
cn + c(n 1) + ··· + c1 = Θ
n
X
k=1
ck
!
= Θ(n
2
).
16
4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT
Figure 5: Quick sort tree.
The total time is then
T (n) + Θ(n) + Θ(n
2
) = Θ(n
2
).
2. For intuition only, do best-case analysis: If we’re really lucky, partition
splits the array n/2 : n/2.
T (n) = 2T (n/2) + Θ(n) = n log n.
Suppose split is always
1
10
:
9
10
,
T (n) = T (
1
10
n) + T (
9
10
n) + Θ(n)
Recursion tree in Figure 6
3. Suppose we alternate lucky, unlucky, lucky ...
L(n) = 2U(n/2) + Θ(n) if lucky,
U(n) = L(n 1) + Θ(n) if unlucky,
then
17
4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT
Figure 6: (1/10) : (9/10) split tree.
18
4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT
L(n) = 2 [L(n/2 1) + Θ(n/2)] + Θ(n)
= 2L(n/2 1) + Θ(n)
= Θ(n log n).
Q: How to make sure that we are usually luckly?
A: Randomly arrange the elements, randomly choose the pivot.
Randomized Quicksort: pivot on random element (swap the first element
with some other randomly chosen element in the array in the code, and then
run the ordinary partition).
running time is independent of input ordering.
it makes no assumptions about input distribution.
there is no specific input that can elicit the worst-case behavior.
worst-case is only determined by random number generator.
Analysis:
T (n) = random variable for running time assuming the random numbers are
independent.
For k = 0, 1, ··· , (n 1), let
X
k
=
1 if partition generates a k:(n-k-1) split
0 o.w..
X
k
is an indicator random variable with E[X
k
] = 1/n.
T (n) =
T (0) + T (n 1) + Θ(n) if 0:(n-1) split
T (1) + T (n 2) + Θ(n) if 1:(n-2) split
.
.
.
.
.
.
T (n 1) + T (0) + Θ(n) if (n-1):0 split
=
n1
X
k=0
X
k
[T (k) + T (n k + 1) + Θ(n)] .
So the expected value of T (n)
19
4 QUICK SORT (TONY HOARE, 1962) AND RANDOMIZED QUICK SORT
E[T (n)] = E
"
n1
X
k=0
X
k
[T (k) + T (n k + 1) + Θ(n)]
#
=
n1
X
k=0
E {X
k
[T (k) + T (n k + 1) + Θ(n)]}
=
n1
X
k=0
EX
k
· E [T (k) + T (n k + 1) + Θ(n)] (independence)
=
1
n
n1
X
k=0
E [T (k) + T (n k + 1) + Θ(n)]
=
1
n
n1
X
k=0
E [T (k)] +
1
n
n1
X
k=0
E [T (n k + 1)] +
1
n
n1
X
k=0
E [Θ(n)] (first two terms same)
=
2
n
n1
X
k=0
E [T (k)] + Θ(n) ···(1)
=
2
n
n1
X
k=2
E [T (k)] + Θ(n)
(1). Absorb k = 0, 1 terms into Θ(n) for technical convenience.
Now prove E[T (n)] a n log n for some constant a > 0.
Proof. Choose a big enough so that a n log n E[T (n)] for small n.
Use fact:
n1
X
k=2
k log k
1
2
n
2
log n
1
8
n
2
.
Use purely summation or integral to proof the fact.
20
5 LINEAR-TIME SORTING: LOWER BOUNDS, COUNTING SORT, RADIX SORT
Use substitution,
E[T (n)] =
2
n
n1
X
k=2
E [T (k)] + Θ(n)
2
n
n1
X
k=2
a k log k + Θ(n)
2a
n
1
2
n
2
log n
1
8
n
2
+ Θ(n)
= a n log n
an
4
Θ(n)
a n log n, if the second term is positive / a big enough.
In practice, randomized quicksort is typically three or more times faster than
merge-sort. It tends to work well with caches in virtual memory.
5 Linear-time Sorting: Lower Bounds, Count-
ing Sort, Radix Sort
Q: How fast can we sort?
A: It depends on computational model. The model is what you can do
with elements. For example, randomized quicksort (Θ(n log n)), heapsort
(Θ(n log n)), merge sort (Θ(n log n)), insertion sort (Θ(n
2
)).
Q: Can we do better than Θ(n log n)?
A:
Comparison sorting algorithm(model): only use comparisons to deter-
mine the relative order of elements.
Decision-tree model: more general than comparison model. It’s a graphi-
cal representation of what algorithm does.
Ex: Suppose we’d like to sort < a
1
, a
2
, a
3
>, see Figure 7.
In general, with < a
1
, ··· , a
n
> to be sorted, each internal node has label
i : j, i, j {1, 2, ··· , n}, means compare a
i
vs. a
j
.
21
5 LINEAR-TIME SORTING: LOWER BOUNDS, COUNTING SORT, RADIX SORT
Figure 7: Decision tree for 3 elements.
Left subtree shows comparisons if a
i
a
j
.
Right subtree shows comparisons if a
i
a
j
.
Each leaf node gives a permutation < π(1), π(2), ··· , π(n) > such that
a
π(1)
a
π(2)
··· a
π(n)
.
Decision trees model comparison sorts:
at least one tree for each n.
view algorithm as splitting whenever it compares two elements.
listing all possible executions of the algorithm (instruction traces).
running time (number of comparisons) is the length of the path.
therefore, the worst-case running time of all possible inputs of length
n is the longest path (the height of the tree).
Lower bound for decision tree sorting:
Theorem 2. Any decision tree that sorts n elements has height Ω(n log n).
Proof. Facts to use,
# of leaves must be n! to get the right answer of all possible inputs.
22
5 LINEAR-TIME SORTING: LOWER BOUNDS, COUNTING SORT, RADIX SORT
for a tree of height h, it has to be 2
h
leaves for a binary tree.
So n! 2
h
, take log
2
on both sides,
h log(n!)
log[(n/e)
n
]
= n log(n) n log e
= Ω(n log n)
Corollary: merge sort and heapsort are asymptotically optimal in term of
growth of n among comparison sorting algorithms. Randomized quicksort is
aymptotically optimal in expectation.
Sorting in linear time:
Two algorithms that are sorting faster than n log n.
1. Counting sort.
Input: A[1, ··· , n], each A[j] {1, 2, ··· , k}.
Output: B[1, ··· , n] = sorted of A.
Auxiliary storage: C[1, ··· , k], the range on input values.
Algorithm 5 Counting sort
1: for i 1 to k do
2: C[i] 0 Initialize occurrences of values C
3: end for
4: for j 1 to n do
5: C[A[j]] C[A[j]] + 1 C[i] = |{key = i}|, # of elements equal i
6: end for
7: for i 2 to k do
8: C[i] C[i] + C[i 1] C[i] = |{key i}|
9: end for
10: for j n to 1 do
11: B[C[A[j]]] A[j]
12: C[A[j]] C[A[j]] 1 distribution
13: end for
23
5 LINEAR-TIME SORTING: LOWER BOUNDS, COUNTING SORT, RADIX SORT
Figure 8: Radix sort example operation.
Ex: Suppose A = [4, 1, 3, 4, 3], apply the algorithm for a simple demon-
stration.
(a) The running time for each for loop is Θ(k)+Θ(n)+Θ(k)+Θ(n) =
Θ(n + k), when k = O(n), counting sort takes Θ(n).
(b) If the number you are dealing with is small, then counting sort
may work, however, the auxiliary storage will take too much when
k is too big.
(c) Counting sort is NOT comparison sort.
(d) No comparisons between elements in counting sort.
(e) An important property of counting sort is stability: it preserves
the order of equal elements.
2. Radix sort. It works for a much larger range of numbers in linear time.
One of the oldest sorting algorithm [Herman Hollerith, 1890]. It sorts
on least-significant digit first with auxiliary stable sort.
Ex: Figure 8
(a) Correctness of radix sort: Induct on digit position t
i. Assume by induction sorted on lower-order t 1 digits.
24
6 ORDER STATISTICS, MEDIAN
ii. Sort on digit t. If two elements have same t
th
digit, we remain
the same order for stability, by induction order, they remain
sorted; If two elements have different t
th
digit, put them in
the right order.
(b) Analysis.
i. use counting sort for each digit.
ii. say n integers with each integer b bits long (range is [0, 2
b
1]).
Split each integer into b/r “digits”, each r bits (think number
in base 2
r
). b/r is number of rounds, base 2
r
is in some sense
k in counting sort.
Ex: Split 32-bit (b = 32) word into 4 digits (r = 8).
iii. running time:
T (n, b) = Θ([b/r] · [n+]) = Θ([b/r] · [n + 2
r
]),
which is number of rounds times number of time of each
round.
iv. Choose r to minimize T (n, b) by differentiating and set the
derivative to be zero.
6 Order Statistics, Median
Order statistics: given n elements in array, find i
th
smallest element.
Naive algorithm: sort and return the i
th
element.
The median is i = b(n + 1)/2c or d(n + 1)/2e.
1. Randomized divide and conquer order-statistic selection, Algo. 6
Intuition (all analysis today assume that all elements are distint).
(a) Lucky, (1/10) : (9/10) fraction, assume the lucky case, when the
target element is in the (9/10) part,
T (n) = T (9n/10) + Θ(n) = Θ(n), Case 3 of master method.
(b) Unlucky case (worst-case analysis), there is no element in one side
of partition,
T (n) = T (n 1) + Θ(n) = Θ(n
2
).
25
6 ORDER STATISTICS, MEDIAN
Algorithm 6 RAND-SELECT (A, p, q, i)
Find i
th
smallest of A[p, q]
1: if p = q then
2: return A[p]
3: end if
4: r RAND-PARTITION(A, p, q) r is location of pivot, see Algo.3
5: k r p + 1
6: k = rank(A[r]), # of elements smaller than A[r] (pivot).
7: if i=k then
8: return A[r] Case 1
9: end if
10: if i < k then
11: return RAND-SELECT(A, p, r 1, i) Case 2
12: else
13: return RAND-SELECT(A, r + 1, q, i k) Case 3
14: end if
2. Analysis of expected time of randomized divide and conquer. The anal-
ysis here is similar with that of randomized quicksort.
(a) Let T (n) be the random variable for running time of rand-select
on input of size n. Assume random variables are independent.
(b) Define indicator variable X
k
for k = 0, 1, ··· , (n 1),
X
k
=
1 if Partition generates a k : (n k + 1) split
0 o.w.
(c) Upper bound, assume that the i
th
elements always falls in the
larger side of the partition,
T (n) =
n1
X
k=0
X
k
[T (max{k, n k 1}) + Θ(n)] ,
where T (max{k, n k 1}) + Θ(n) is the running time if there
is k : (n k 1) split.
Take expectations on both sides, similar with the randomized
26
7 HASHING, HASH FUNCTIONS
quick sort analysis,
E[T (n)] = E
"
n1
X
k=0
X
k
[T (max{k, n k 1}) + Θ(n)]
#
=
1
n
n1
X
k=0
E [T (max{k, n k 1})] + Θ(n)
2
n
n1
X
k=bn/2c
E [T (k)] + Θ(n)
Claim: E[T (n)] c n for constant c > 0.
Proof. Use substitution.
E[T (n)]
2
n
n1
X
k=bn/2c
c k + Θ(n)
=
2c
n
n1
X
k=bn/2c
k + Θ(n)
cn
cn
4
Θ(n)
······(1)
(1). Use fact,
P
n1
k=bn/2c
k
3
8
n
2
.
If c is chosen large enough so cn/4 > Θ(n).
(d) To summarize, the algorithm works well in practice and works
fast, but the worst-case is bad.
3. Worst-case linear-time order statistics
(a) An algorithm that runs in linear time in worst-case (Blum, Floyd,
Pratt, Rivest and Tarjan [1973]).
(b) Key idea: Generate a good pivot recursively. A graphical descrip-
tion of the algorithm is in Figure 9.
7 Hashing, Hash Functions
1. Direct-access tables in Figure 10.
27
7 HASHING, HASH FUNCTIONS
Figure 9: Worst-case linear-time order statistics.
Algorithm 7 SELECT(i,n)
Select the i
th
smallest element from array of size n
1: Divide the n elements into groups of 5. Find the median of each
5element group. Time Θ(n)
2: Recursively SELECT the median x of the bn/6c group medians as pivot.
3: Time T (bn/5c)
4: Partition around pivot x. k rank(x). Time Θ(n)
5: if i = k then
6: return x
7: end if
8: if i < k then
9: recursively SELECT i
th
smallest element in the lower part
10: else
11: recursively SELECT i
th
smallest element in the upper part
12: end if Time T (3n/4)
Figure 10: Direct access table.
28
7 HASHING, HASH FUNCTIONS
(a) Operations
Insertion(S, x) : S S {x}
Delete(S, x) : S S {x} (doesn’t take the key, takes the
record)
Search(S, k) : returns x such that key[x] = k or nil if no such
x (search for a given key).
Operations that change the set like delete and insertion are dy-
namic set”.
(b) Direct-access table, the simplest implementation of the operations.
It works when the keys are drawn from a small distribution, U =
{0, 1, ··· , m 1} (assume keys are distinct). It works by setting
up an array T [0, ··· , m 1] to represent the dynamic set S such
that
T [k] =
x if x S and key[x] = k
nil o.w.
Operations take constant time Θ(1).
(c) Issues/Limitations: (m 1) could be a huge number, worse for
character string values.
(d) Keep the table small while still preserving some of the properties
hashing.
2. Hashing: resolving collisions by chaining
(a) Hash function h maps keys “randomly” into slots of table T . See
Figure 11 (left).
When a record to be inserted maps to an already occupied slot, a
collision occurs.
Resolving collisions by chaining: Link records in same slot into a
list (Create a list for each slot and put all the elements that hash
to the same slot into the list).
Ex:Figure 11 (right).
(b) Analysis.
i. Worst case: every key hashes to the same slot. A linked list
to keep the data structure. Access time is Θ(n) if the size of
S, |S| = n.
ii. Average case: assumption of simple uniform hashing: (1).
Each key k S is equally likely to be hashed to any slot in T ;
29
7 HASHING, HASH FUNCTIONS
Figure 11: Hash map (left); Resolving collisions by chaining (right).
(2). Each key is independent of where other keys are hashed.
Load factor of a hash table:
α = n/m = ave # of keys per slot,
where n is number of keys and m is number of slots.
Expected unsuccessful search for a record with a given
key (looking for something that is not in the table and the
return result is Nil):
Θ(1 + α),
where 1 is from the cost to apply hash function and access
slot, α is the cost of searching list.
Expected search time is Θ(1) when α = O(1) or equivalently,
n = O(m) (number of elements in the table is upper bounded
by a constant times n).
Expected successful search is also asymptotically Θ(1+α),
see book for proof.
3. Choosing hash functions
Should distribute keys uniformly into slots.
Regularity in the key distribution should not affect uniformly (for
example, a regularity you often see is all the keys that are being
inserted are even numbers).
30
7 HASHING, HASH FUNCTIONS
Division method (popular method for a quick hash function):
h(k) = k mod m,
where m is number of slots in the table.
Don’t pick m with small divisor d. For example, d = 2 and all
keys are even, then odd slots never used. Another extreme ex-
ample, m = 2
r
, then hash doesn’t depend or all bits of k (if
k = 10101010001110110), and r = 6, then h(k) = 110110, the last
six bits of k.
Good choices is pick m to be a prime not too close to a power of
2 or 10.
Multiplication method (usually superior, heuristic):
Assume the numeber of slots is, m = 2
r
, assume computer has
wbit words. (w = 32, 64 for example).
Hash function
h(k) = ((A · k) mod 2
w
) bitwise right shifted by w r,
where A is an odd integer in the range 2
w1
< A < 2
w
.
Don’t pick A too close to 2
w1
or 2
w
.
Fast method: multiplication mod to 2
w
is fast and right shift is
fast.
Ex: m = 8 = 2
3
and w = 7. Think A a a fractional number as
we only case about the module. See Figure 12.
4. Resolving collisions by open addressing
Idea: no storage for links. Probe table systematically until an empty
slot is found (If I hash to a given slot, and the slot is full, just hash
again with a different(second) hash function. Keep the probe sequence
until I find a place to put it) .
The hash function depends on both the key and probe number
h : U × {0, 1, ··· , m 1} {0, 1, ··· , m 1},
where U is the universe of keys, {0, 1, ··· , m1} before the arrow
is probe number, {0, 1, ··· , m 1} after the arrow is slot.
31
7 HASHING, HASH FUNCTIONS
Figure 12: Multiplication method.
Probe sequence < h(k, 0), ··· , h(k, m 1) > should be a permu-
tation of {0, 1, ··· , m 1}.
Issue: table may fill up, so the number of elements in the table
has to m, the number of slots. Deletion is difficult (you remove
a key out of the table, someone now is doing a probe sequence
who would have hit that key and gone to find it’s empty slot: you
can delete things but keep them marked, it’s messy).
Ex: Figure 13.
Searches use the same probe sequence as shown in Figure 13, if it’s
successful, it finds the record, if it’s unsuccessful, it finds Nil.
Probing strategies:
(a) Linear probing
h(k, i) = (h(k, 0) + i) mod m.
The problem with it is that it suffers from primary clustering,
where regions of hash table get very full, and anything that hashes
into that region has to look through all the stuff over there (long
runs of occupied slots build up).
(b) Double hashing
h(k, i) = [h
1
(k) + i · h
2
(k)] mod m.
32
7 HASHING, HASH FUNCTIONS
Figure 13: Resolve collisions with open addressing.
This is an excellent method, usually pick m = 2 and h
2
(k) to be
odd (to guarantee to hit every slot).
Analysis of open addressing:
Assume uniform of hashing: each key is equally likely to have any one
of the m! permutations as probe sequence, independent of other keys.
Theorem 3. The expected number of probes, if α = n/m < 1,
E[# probes]
1
1 α
.
Proof. Unsuccessful search:
1 probe is always necessary.
With probability n/m, the first probe gets collision, and a second
probe is needed.
With probability (n 1)/(m 1), the second probe gets collision,
and a third probe is needed. Similar for the following probes.
Note:
ni
mi
<
n
m
= α for i = 1, 2, ··· , n 1.
33
8 UNIVERSAL HASHING, PERFECT HASHING
So the expected number of probes
1 +
n
m
1 +
n 1
m 1
···
1 +
1
m n + 1
···
1 + α(1 + α(1 + α(···(1 + α) ···)))
1 + α + α
2
+ ···
=
X
i=0
α
i
=
1
1 α
.
See textbook for more rigorous proof and for proof for analysis of suc-
cessful searches.
If the table is half full, the expected number of probes is 1/(1
0.5) = 2.
If the table is 90% full, the expected number of probes is 1/(1
0.9) = 10. The cost is going dramatically, typical you don’t want
to let the table get too full.
8 Universal Hashing, Perfect Hashing
Fundamental weakness of hashing: for any choice of hash function, there
exists bad set of keys that all hash to the same slot.
Idea: Choose hash function at random, independently from keys universal
hashing.
Universal hashing: Let U be a universe of keys, and let H be a finite
collection of hash functions mapping U to {0, 1, ·, m 1}. H is universe if
x, y U where x 6= y, it has
|{h H : h(x) = h(y) =
|H|
m
}|,
that is, the number of hash functions that hash two keys to the same place is
|H|/m. If h is chosen randomly from H, the probability of a collision between
x and y is 1/m.
Theorem 4. Choose h randomly from H, suppose h hashes n keys into m
slots in table T . Then, for a given key x,
E[# collisions within x] <
n
m
= α (load factor)
34
8 UNIVERSAL HASHING, PERFECT HASHING
Proof. Let C
x
be random variable denoting total # collisions of keys in T
with x, and let indicator
c
xy
=
1 if h(x) = h(y)
0 o.w.
Note E[c
xy
] =
1
m
and the total number of collisions C
x
=
P
yT −{x}
c
xy
, the
expected total number of collisions
E[C
x
] = E
X
yT −{x}
c
xy
=
X
yT −{x}
E[c
xy
]
(linearity)
=
X
yT −{x}
1
m
=
n 1
m
Constructing a universal hash function:
Let m be prime. Decompose any key k into r + 1 digits (representation of k
based m),
k =< k
0
, k
1
, ··· , k
r
>,
where 0 k
i
m. The random strategy is:
Pick a =< a
0
, a
1
, ··· , a
r
>, each a
i
is randomly chosen from {0, 1, ··· , m1}.
Define
h
a
(k) =
r
X
i=0
a
i
k
i
!
mod m,
Number of different hash functions, |H| = m
r+1
(there are m choices for each
a
i
).
Theorem 5. H is universal.
Proof. Pick two distinct keys x =< x
0
, x
1
, ··· , x
r
> and y =< y
0
, y
1
, ··· , y
r
>,
they must differ in at least one digit, without loss of generality,they differ in
position 0. For how many h
a
H do x and y collide?
Must have h
a
(x) = h
a
(y), that is,
35
8 UNIVERSAL HASHING, PERFECT HASHING
z 1 2 3 4 5 6
z
1
1 4 5 2 3 6
Table 3: Theorem 6 example.
r
X
i=0
a
i
x
i
r
X
i=0
a
i
y
i
mod m
r
X
i=0
a
i
(x
i
y
i
) 0 mod m
a
0
(x
0
y
0
) +
r
X
i=1
a
i
(x
i
y
i
) 0 mod m
a
0
(x
0
y
0
)
r
X
i=1
a
i
(x
i
y
i
) mod m
a
0
=
"
r
X
i=1
a
i
(x
i
y
i
)
#
(x
0
y
0
)
1
mod m ······(1)
(1). Since x
0
6= y
0
, using Theorem 6, we know (x
0
y
0
)
1
.
Thus, for any choice of a
1
, a
2
, ··· , a
r
, exactly ONE of the m choices for
a
0
causes x and y to collide, and no collision for other m 1 choices for a
0
..
The number of h
a
s that cause x, y to collide is 1 ·m ·m ···m = m
r
= |H|/m.
So H is universal by definition.
Fact from number theory:
Theorem 6. Let m be prime. For any z Z
m
, where Z
m
= integers mod
m, such that z 6≡ 0, there exists unique z
1
Z
m
such that
z · z
1
1( mod m). the results of zz
1
mod by m is 1.
Note: not true if m is not prime. Think about m = 10.
Ex: m = 7, Table 3. For z = 2, 2 · 4 mod 7 = 1.
36
8 UNIVERSAL HASHING, PERFECT HASHING
Figure 14: Perfect hashing.
Perfect hash:
Given n keys, construct a static hash table of size m = O(n) such that
SEARCH takes O(1) time in the worst case.
IDEA: use a two-level scheme with universal hashing at both levels. NO
collisions at level 2 (if n
i
items that hash to level 1 slot i, then use m
i
= n
2
i
slots in level 2, level 2 hash table will be very sparse). See Figure 14 for an
example.
Analysis of level 2 of perfect hash:
Theorem 7. Hash n keys into m = n
2
slots using random h in universal H
E[# of collisions] < 1/2.
Proof. Probability of 2 given keys collide under h is
1
m
=
1
n
2
.
There are
n
2
pairs of keys that can possibly collide, then the expected
37
9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST
number of collisions
E[# of collisions] =
n
2
·
1
n
=
n 1
2n
<
1
2
.
Markov’s inequality: For random variable x 0
Pr(X t)
E[x]
t
.
Corollary: The probability of no collisions is at least 1/2.
Proof. Apply Markov’s inequality with t = 1 Pr(X 1) E[X] < 1/2
Pr(X = 0) 1/2.
To find a good level-2 hash function. just test a few functions in H, we will
find one quickly since at least half will work.
Analysis of storage of perfect hashing:
1. Level 1: Choose m = n.
2. Level 2: Let n
i
be random variable for # keys that hash to slot i in T .
Use m
i
= n
2
slots in each level-2 table S
i
, the total expected storage
n + E
"
m1
X
i=0
Θ(n
2
i
)
#
= Θ(n) by bucket sort analysis.
9 Relation of BSTs to Quicksort - Analysis
of Random BST
Randomly built binary search tree (BST) definition:
Binary-search-tree sort for array A in algo.8.
Ex:A = [3, 1, 8, 2, 6, 7, 5], running time: O(n) for walk, Ω(n log n) for n tree-
inserts (Figure 15) .
Worst-case: If the array is already sorted or reversely sorted, time is
P
xT ree
depth(x) =
Θ(n
2
), where x is the node.
Lucky-case: balanced, tree height is O(log n), and time is O(n log n).
Running time is the same as quicksort.
38
9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST
Algorithm 8 Binary-search-tree-Sort(A)
1: T Create empty BST
2: for i = 1 to n do
3: TREE-INSERT(T, A[i]) Insert A[i] to the tree
4: end for
5: Perform an inorder tree walk of T Inorder traversal
Figure 15: Inorder tree traversal.
39
9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST
Figure 16: Difference between height of tree and expected average depth.
Relation to quicksort: BST sort and quicksort make the same compar-
isons in a different order.
Randomized BST SORT:
Randomly permuate A.
BST sort(A).
Time = time (randomized quicksort) and E[time] = E[randomizedquicksorttime] =
Θ(n log n). The corresponding tree is called randomized built BST.
Expected average depth:
E
"
1
n
X
xT
depth(x)
#
=
1
n
Θ(n log n) = log n,
which is NOT the height of the tree (maximum depth of any node). See
Figure 16 for an example.
Theorem 8.
E(height of rand. built BST ) = O(log n).
40
9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST
Proof. Outline:
1. Prove Jensen’s inequality f(E(X)) E(f(X)) for convex function f.
2. Instead of analyzing the random variable that tells the height of the
BST on n nodes, X
n
, analyze its exponentiation, Y
n
= 2
X
n
.
3. Prove that E[Y
n
] = O(n
3
).
4. Conclude that
E
2
X
n
= E[Y
n
] = O(n
3
),
so E[X
n
] log O(n
3
) = 3 log n + O(1).
Prove 1:
f : R R is convex if α, β 0 s.t. α + β = 1, we have for any x, y R
f(αx + βy) αf(x) + βf (y)
Lemma: f : R R is convex and x
1
, x
2
, ··· , x
n
R, α
1
, ··· , α
n
0 and
P
n
k=1
α
k
= 1. Then
f
"
n
X
k=1
α
k
x
k
#
n
X
k=1
α
k
f(x
k
).
Use induction and convexity definition to prove it.
Jensen’s inequality (n ):
f(E[X]) = f
"
X
k=−∞
k Pr(X = k)
#
X
k=−∞
f(k) · Pr(X = k)
= E[f(x)]
Prove 2:
X
n
r.v. of height of randomly built BST on n nodes. Y
n
= 2
X
n
, a convex
function. rank(k) is the rank in the sense of the index in assorted order, for
example, for root of a BST tree, it has (k 1) elements on the left (smaller
than root value) and (n k) elements on the right, then the rank of the root
is k.
41
9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST
If root r has rank k, then the height of the tree is the max of the two subtrees
plus 1 (the height of top level)
X
n
= 1 + max{X
k1
, X
nk
},
and
Y
n
= 2 · max{Y
k1
, Y
nk
}
Prove 3:
Define indicator random variable Z
nk
as
Z
n
k =
n
1 if root has rank k
0 o.w.
So
Pr(Z
nk
= 1) = E[Z
nk
] = 1/n
and
Y
n
=
n
X
k=1
Z
nk
[2max{Y
k1
, Y
nk
}]
The expectation
42
9 RELATION OF BSTS TO QUICKSORT - ANALYSIS OF RANDOM BST
EY
n
= E
(
n
X
k=1
Z
nk
[2max{Y
k1
, Y
nk
}]
)
=
n
X
k=1
E {Z
nk
[2max{Y
k1
, Y
nk
}]}
= 2E[Z
nk
]
n
X
k=1
E [max{Y
k1
, Y
nk
}]
=
2
n
n
X
k=1
E [max{Y
k1
, Y
nk
}]
2
n
n
X
k=1
E [Y
k1
+ Y
nk
]
=
2
n
n
X
k=1
E [Y
k1
] +
n
X
k=1
E [Y
nk
]
=
4
n
n
X
k=1
E [Y
k1
]
=
4
n
n1
X
k=0
E [Y
k
]
Claim: EY
n
cn
3
.
Prove by substitution: base case, n = O(1) is true if c is sufficiently large.
Induction,
EY
n
=
4
n
n1
X
k=0
E [Y
k
]
4
n
n1
X
k=0
ck
3
4
n
c
Z
n
0
x
3
dx
= cn
3
Prove 4:
43
10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS
f(x) = 2
x
, then
2
E[X
n
]
E
2
X
n
= EY
n
cn
3
Take log
2
on both sides
E[X
n
] 3 log n + O(1),
E[X
n
] 2.9882 log n (Luke Devroy, 1986).
10 Red-black Tree, Rotations, Insertions, Dele-
tions
Goal: Get a search tree structure so we can insert, delete and search.
Balanced search trees: search tree data structure maintaining dynamic
set of n elements using tree of height of O(log n). Balanced search trees are
not necessarily binary. Focus on binary case in this lecture. Balanced search
tree examples:
AVL trees
2-3 trees
2-3-4 trees
B-trees
Red-black trees (binary search tree)
Skip lists (more or less a tree)
treaps
Red-black tree: binary search trees, with extra color field for each node
satisfying red-black properties:
1. Every node is either red or black.
44
10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS
Figure 17: Simple paths of red-black tree (property 4).
2. The root and leaves (Nil’s) are all black.
3. Every red node has black parent (can never have two red nodes con-
secutive).
4. All simple paths (don’t repeat any vertices) from a node x to a descen-
dant leaf of x have same number of black nodes = black height(x)
(the height doesn’t include x itself). See Figure 17.
Ex: See Figure 18 for an example of a red-black tree.
Height of red-black tree:
Theorem 9. Red-black tree with n keys has height
h 2 log(n + 1).
Proof. Proof sketch (see textbook for detailed proof with induction).
Merge each red node into its black parent to a 2-3-4 tree as in Figure 19.
every internal node has 2-4 children.
every leaf has the same depth (the black height of the root) by property
4 of red-black tree.
45
10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS
Figure 18: Red-black tree example.
Figure 19: Merged red-black tree.
46
10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS
Denote the height of original tree as h and the merged tree h
0
. First try to
bound h
0
and then relate it to h.
1. The number of leaves of each tree is n + 1, where n is the number of
internal nodes.
2. In the 2-3-4 tree, the number of leaves is in range [2
h
0
, 4
h
0
], can branch
at most 4
0
ways at each node, every node branches at least 2 ways.
3. 2
h
0
n + 1 4
h
0
h
0
log(n + 1).
4. The number of red nodes can be at most half of the root-to-leaf path,
so h 2h
0
.
5. h 2 log(n + 1).
Corollary: Queries
Queries (search, min, max, successor, predecessor) all run in O(log n) time
on a red-black tree with n nodes.
Updates (insert, delete) must modify the tree, to preserve the tree proper-
ties,
BST operation (tree insert, tree delete)
Change the colors of some of the nodes
Restructing of links via rotations
Rotation: See Figure 20 for left and right rotations. Rotation (1). preserves
inorder ordering of keys, suppose a α, b β, c γ a A b B c.
(2). can be performed in O(1) time.
RB-Insert(x):
Idea
1. insert x in binary search tree (search for x wherever it’s supposed to
go, create a new node hanging off the original nodes).
47
10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS
Figure 20: Rotation operations of tree.
2. color it as red (so only property 2 is possibly violated as parent might
be red) move the violation up the tree by recoloring until it can be
fixed with rotations and recoloring.
Ex: Insert 15 into the previous example red-black tree in Figure 21.
Algorithm 9 RB-Insert(T, x)
1: TREE-INSERT(T, x)
2: color[x] RED
3: while x 6= root[T] and color[p[x]] = RED do p[x]: the parent of x
4: if p[x] = left[p[p[x]]] then
5: Case A: left child of parent of parent of x
6: y right[p[p[x]]] right child of grandparent, uncle/aunt of x
7: if color[y] = RED then
8: Case 1
9: else if x = right[p[x]] then Case 2 x is right child of its parent
10: Case 3 x is left child of its parent
11: end if
12: else swap “left” and “right” in if Case B
13: end if
14: end while
15: color[root[T]] BLACK
The three cases are in Figure 22.
48
10 RED-BLACK TREE, ROTATIONS, INSERTIONS, DELETIONS
Figure 21: RB-INSERTION of 15.
49
11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL
TREES
Summarization of RB-INSERT:
1. RB-INSERT add x to the dynamic set and preserves red-blackness.
2. Case 1 only recolors, case 2 and case 3 perform 1 or 2 rotations and
terminate.
3. The number of case 1 is at most log n and number of case 2 or 3 is at
most 1, case 2 and 3 together is at most 2.
4. See introduction on RB-DELETE from textbook.
11 Augmenting Data Structures, Dynamic Or-
der Statistics, Interval Trees
Start of the design phase of the class. How to design efficient data structures,
efficient algorithms for various problems (Textbook appendix B).
Dynamic order statistics
OS-SELECT(i, S): returns i
th
smallest element in the dynamic set S.
OS-RANK(x, S): returns rank of x S in the sorted order of S’s
elements.
Idea: Use a red-black tree for the set S, keep the size of subtrees in nodes of
red-black tree. See Figure 23 for an example OS-tree and the size of x is
size(x) = size[left[x]] + size[right[x]] + 1,
where left[x] is the left child of x.
Note: As size of Nil is zero, and in practical programming, the use of size of
Nil usually has error, the implementation trick is to use a sentinel (dummy
record) such that size[Nil] = 0 (instead of any place I would have used Nil
in the tree, have a special record that I will call Nil but it will be a whole
record, that way, I can set it as size 0, it’s a common type of programming
50
11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL
TREES
Figure 22: Three cases in Algo 9. RB-INSERT.
51
11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL
TREES
Figure 23: OS-tree example.
trick to simply code).
Code for OS-SLECT in Algo. 10.
Algorithm 10 OS-SELECT(x, i): i
th
smallest in subtree rooted at x
1: k size [left[x]] + 1 k is rank of x
2: if i = k then
3: return x
4: end if
5: if i < k then
6: return OS-SELECT(left[x], i)
7: else
8: return OS-SELECT(right[x], i k)
9: end if
Ex: OS-SELECT(root, 5), the algorithm returns a pointer to the node whose
key is H. Try it.
Q: Why not keep ranks themselves in the nodes instead of subtree size?
A: Hard to maintain when the red-black tree is modified. The modifying
operations are INSERT and DELETE.
Strategy: update subtree sizes when inserting or deleting.
Ex: Insert element key K after key H and update the subtree sizes as shown
52
11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL
TREES
Figure 24: OS-tree insertion example.
in Figure 24. But must also handle rebalancing of the tree: (1). r-b tree
color changes, no effect on subtree sizes; (2). rotations, look at children and
fix up in O(1) time per rotation.
INSERT, DELTE still takes O(1) time.
Data-structure augmentation methodology with example of order-statistics
trees. Checklist:
1. Choose underlying data structure (red-black tree).
2. Determine additional information to be stored in the data structure
(subtree sizes).
3. Verify that this information CAN be maintained for modifying opera-
tions (RB-INSERT, RB-DELETE— don’t forget rotations).
4. Develop new operations that use the stored information (OS-SELECT,
OS-RANK).
53
11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL
TREES
Figure 25: Interval tree example.
There are usually interactions between the above steps.
Ex: Interval trees
Maintain a dynamic set of intervals (eg., time intervals).
Query: find an interval in the set that overlaps a given query interval. See
Figure 25.
Methodology:
1. Choose an underlying data structure (red-black tree keyed on low/left
endpoint).
2. Store in node the largest value m in the subtree rooted at that node.
m[x] = max{high[int[x]], m[left[x]], m[right[x]]}.
3. Modifying operations.
INSERT: Fix m’s on the way down.
ROTATIONS: Fix up m’s during rotation takes O(1) time.
So the total INSERT time is O(log n). DELETE similar.
54
11 AUGMENTING DATA STRUCTURES, DYNAMIC ORDER STATISTICS, INTERVAL
TREES
Algorithm 11 INTERVAL-SEARCH(i): find an interval that overlaps in-
terval i.
1: x root
2: while x 6= Nil and (low[i] > high[int[x]] or low[int[x]] > high[i]) do
3: i and the current int[x] don’t overlap, change x to node at subtree
4: if left[x] 6= Nil and low[i] m[left[x]] then
5: x left[x]
6: else x right[x]
7: end if
8: end while
9: return x
4. New operations. INTERVAL-SEARCH(i). See Figure 25 and Algo.11.
Analysis:
(a) Time is O(log n) to go down the tree, takes time proportional to
the time of the tree.
(b) List all intervals (suppose there are k overlapping intervals), search,
list, delete and repeat k times takes O(k log n) time (output-
sensitive bound).
(c) Best to date time is O(k + log n) with a different structure.
Correctness of the algo. 11 :
Theorem 10. Let L the set of intervals in the left subtree, L = {L
0
left[x]} and R intervals in the right subtree, R = {R
0
right[x]}.
If the search goes right, then
{i
0
L : i
0
overlaps i} = .
If the search goes left, then
If {i
0
L : i
0
overlaps i} = {i
0
R : i
0
overlaps i}
That is, it’s always safe to take only 1 of the 2 children: we’ll either
find something or nothing was to be found.
Proof. (a) Suppose search goes right. If left[x] = Nil, done.
Otherwise, the algorithm 11 states that low[i] > m[left[x]], where
m[left[x]] is the high endpoint of some interval j L by definition,
no other interval that can have a larger high endpoint than high[j]
{i
0
L : i
0
overlaps i} = .
55
12 SKIP LISTS
(b) Suppose search goes left and {i
0
L : i
0
overlaps i}. By Algo.11,
low[i] m[left[x]] = high[j] for some j L. Since j L doesn’t
overlap with i, high[i] < low[j]. According to binary-search-
tree property, low[j] low[i
0
] for all i
0
R. high[i] < low[i
0
]
for all i
0
R.
12 Skip Lists
Another balanced search structure that maintains a dynamic set subject to
insertion, deletion and search. (Others efficient include treaps, red-black
trees, B-trees).
dynamic serach structure (William Pugh, 1989)
randomized structure
simple
running time for each operation is O(log n) in expectation with high
probability ( 1
1
n
α). (A strong bound on the tail of the distribution).
1. Starting from scratch
Sorted linked list is the simplest data structure.
Searches take Θ(n) time in worst case.
2. Two sorted linked list.
Ex: What is the sequence? 14, 23, 34, 42, 50, 59, 66, 72, 79, 86, 96, 103, 110, 116, 125.
It’s stations of NYC 7th Avenue Line. The idea is: two subway lines,
express line connects a few of stations and local line connects all sta-
tions. There are links between two lines at common stations. See
Figure 26.
Searching in two linked lists (Algo. 12):
Q: What keys go in L
1
(worst case)?
A: Best to spread them out uniformly.
Analysis:
56
12 SKIP LISTS
Figure 26: Two linked lists.
Algorithm 12 SEARCH(x)
1: Walk right in top linked list (L
1
) until going right would go too far.
2: Walk down to bottom linked list L
2
.
3: Walk right in L
2
until find x (or some value greater than x if x is not in
the list).
Cost of search is
|L
1
| +
L
2
L
1
( uniformly dist. ).
Minimize cost of search.
Denote |L
2
| = n. Up to constant factors,
|L
1
| =
n
|L
1
|
|L
1
| =
n.
Search cost is then roughly 2
n.
3. log n linked lists.
For k sorted linked lists, cost is roughly k ·
k
n.
For k = log n, log n ·
log n
n = 2 log n (2
log n
= n).
Ideal skip list: log n linked list structure. See Figure 27.
4. INSERTION.
Skip list maintains roughly subject to updates (insertion and deletion).
(Cannot maintain exactly). To insert an element x into a skip list,
57
12 SKIP LISTS
Figure 27: log n linked lists.
Search(x) to see where x fits in bottom list.
Always insert into the bottom list.
Invariant: bottom list contains all elements.
Issue: there could be more insertion around a certain node and
the balance of the lists can be broken. To which other lists should
we add x as well?
A: Flip a (fair) coin to next level up and flip again.
On average:
1/2 of elements promoted 0 levels.
1/4 of elements promoted 1 levels.
1/8 of elements promoted 2 levels.
···
Try to build a skip list from scratch by repeated insertion using a
real coin.
A small change to the previous skip lists: add −∞ to every list
so the search can be implemented with the same algorithm.
5. To summarize, a skip list is the result of insertions (and deletions) from
an initially empty structure (containing only −∞). Skip lists are good
in terms of speed and balance (almost always).
Theorem 11. With high probability, every search in nelement skip list
costs O(log n).
58
12 SKIP LISTS
Definition: with high probability (w.h.p.):
Event E occurs w.h.p. if, for any α 1, there is a suitable choice of constants
for which E occurs with probability at (1 O(1/n
α
)). (informal definition).
Boole’s inequality/union bound:
For any random event E
1
, E
2
, ··· , E
k
,
Pr(E
1
E
2
. . . E
k
) Pr(E
1
) + Pr(E
2
) + . . . + Pr(E
k
).
If k = n
O(1)
, each E
i
occurs w.h.p., then so does E
1
. . . E
k
.
Proof. Warm up:
Lemma: With high probability, n-element skip list has O(log n) levels.
Proof: For any element x
Pr(more than c log n levels)
n Pr(element x promoted at least c log n levels) Boole’s inequality
= n
1
2
c log n
= n
1
n
c
=
1
n
c1
.
The probability is polynomially small. Can choose c arbitrarily large.
COOL IDEA: analyze search backward (leaf to root).
Search starts (ends) at a node in the bottom list.
At each node visited, if node wasn’t promoted higher, then we go left;
if node was promoted higher, we go up.
Search stops (starts) at the root (or −∞).
Theorem proof:
# of up moves < # of levels c log n (w.h.p. by lemma).
w.h.p. # moves # coin flips needed to get c log n heads (move
up).
59
13 AMORTIZED ALGORITHMS, TABLE DOUBLING, POTENTIAL METHOD
Number of coin flips until c log n HEADS = Θ(log n) w.h.p. (See proof
on textbook)
13 Amortized Algorithms, Table Doubling,
Potential Method
Motivating question: how large should a hash table be?
A: We should make it as large as possible so searching is cheap; Flipside is
we should make it as small as possible so not to waste space. Happy medium
is make it Θ(n).
Q: What if we don’t know the proper size in advance?
A: Dynamic tables.
IDEA: Whenever the table overflows (gets too full), “grow” it.
Allocate a new, larger table (malloc in C or new in Java).
Move all items from old table to new.
Free the storage of the old table.
Ex: Dynamic table example in Figure 28. Whenever there is overflow, create
a table of twice the size.
Analysis of sequence of n insert operations.
Worst-case cost of 1 insert = Θ(n).
Worst-case cost of n insert = nΘ(n) = Θ(n
2
). Wrong analysis: n
inserts in fact take Θ(n) time.
Correct analysis.
Let c
i
= cost of the i
th
insertion and
c
i
=
n
i if i 1 is an exact power of 2
1 o.w.
See Figure 29 for an example of c
i
values.
60
13 AMORTIZED ALGORITHMS, TABLE DOUBLING, POTENTIAL METHOD
Figure 28: Dynamic table example.
Figure 29: Amortized costs.
61
13 AMORTIZED ALGORITHMS, TABLE DOUBLING, POTENTIAL METHOD
Costs of n insertions
n
X
i=1
n +
blog(n1)c
X
j=0
2
j
3n geometric sequence of second term
= Θ(n).
Thus, the average cost of each dynamic-table operation is Θ(n)/n =
Θ(1): amortized analysis.
Amortized analysis: Analyze a sequence of operations to show that average
cost per operation is small, even though several operations may be expensive.
Probability is NOT involved even though we take average.
Types of amortized analysis:
1. Aggregate method (just saw, less precise).
2. Accounting method (more precise because they allocate more specific
amortized cost to each operation).
Charge i
th
operation a fictitious amortized cost ˆc
i
($1 pays for 1
unit of work).
Fee is consumed to perform operation.
Unused amount stored in “bank” for use by later operations.
Bank balance must NOT go negative. That is
n
X
i=1
c
i
n
X
i=1
ˆc
i
(or
n
X
i=1
ˆc
i
n
X
i=1
c
i
0)
for all n.
Now go back to the dynamic table example as shown in Figure 29, set
Charge ˆc
i
= $3 for i
th
insert. $1 pays for immediate insert, $2
stored for table doubling.
When table doubles, $1 pays to move the recent item (since last
doubling), $1 moves old item.
62
13 AMORTIZED ALGORITHMS, TABLE DOUBLING, POTENTIAL METHOD
3. Potential method (more precise because they allocate more specific
amortized cost to each operation).
“Bank account” viewed as potential energy of dynamic set.
Framework:
Start with an initial data structure D
0
.
Operation i transforms D
i1
to D
i
.
The cost of operation i is c
i
.
Define potential function
Φ : {D
i
} R,
such taht Φ(D
0
) = 0 and Φ(D
i
) 0 for all i.
Analysis:
Define amortized cost ˆc
i
with respect to Φ
ˆc
i
= c
i
+ Φ(D
i
) Φ(D
i1
),
where Φ(D
i
) Φ(D
i1
) = ∆Φ
i
is the potential difference.
If ∆Φ
i
> 0, ˆc
i
> c
i
. Operation i stores work in data structure
for later use.
If ∆Φ
i
< 0, ˆc
i
< c
i
. Data structure delivers up stored work
to help pay for pperation i.
Total amortized cost of n operations is
n
X
i=1
ˆc
i
=
n
X
i=1
[c
i
+ Φ(D
i
) Φ(D
i1
)]
=
"
n
X
i=1
c
i
#
+ Φ(D
n
) Φ(D
0
)
n
X
i=1
c
i
(by definition of potential function).
Amortized cost is an upper bound of total cost.
63
13 AMORTIZED ALGORITHMS, TABLE DOUBLING, POTENTIAL METHOD
Table doubling potential analysis.
Define the potential of the table after the i
th
insertion (assume
2
dlog 0e
= 0)
Φ(D
i
) = 2i 2
dlog ie
,
which satisfies the definition of potential function (non-negative).
Amortized csot of i
th
insert.
ˆc
i
= c
i
+ Φ(D
i
) Φ(D
i1
)
=
n
i if i 1 is an exact power of 2
1 o.w.
o
+
2i 2
dlog ie
2(i 1) 2
dlog(i1)e
=
n
i if i 1 is an exact power of 2
1 o.w.
o
+ 2 2
dlog ie
+ 2
dlog(i1)e
Case 1: i 1 is an exact power of 2.
ˆc
i
= i + 2 2
dlog ie
+ 2
dlog(i1)e
= i + 2 2(i 1) + (i 1) (2
dlog ie
= 2
[log(i1)]+1
)
= 3.
Case 2: i 1 is NOT power of 2.
ˆc
i
= 1 + 2 2
dlog ie
+ 2
dlog(i1)e
(2
dlog ie
= 2
log(i1)
)
= 3.
Thus n insertions cost Θ(n) in the worst case. (Bug: the cost of
the first operation can be 2).
Summarization
Amortized costs provide a clean abstraction of data-structure perfor-
mance.
Any of the analysis methods can be used when an amortized analysis
is called, but each method has some situations where it is arguably the
simplest or most precise.
Different potential functions (in potential analysis) or accounting costs
(in accounting method) may yield different bounds.
64
14 COMPETITIVE ANALYSIS: SELF-ORGANIZING LISTS
Figure 30: Self-organizing lists example.
14 Competitive Analysis: Self-organizing Lists
1. Self-organizing lists: List L of n elements
Operation Access(x) (access x in the list, could be by searching,
or however we want to do) costs rank
L
(x) = distance of x from
the head of L. (what users can do)
L can be reordered by transposing adjacent elements at a cost of
1.
Ex: See Figure 30 for an example list.
(1). Access element 14 costs 4.
(2). Transpose 3 and 50 costs 1.
2. On-line algorithm
Definition:
A sequence S of operations is provided one at a time. For each opera-
tion, an on-line algorithm A must execute the operation immediately
without knowledge of future operations.
Off-line algorithm definition:
May see all of S in advance.
GOAL: Minimize the on-line total cost C
A
(S).
Analysis:
Worst-case analysis.
An adversary always accesses tail element of L. Then the cost for
65
14 COMPETITIVE ANALYSIS: SELF-ORGANIZING LISTS
an on-line algorithm A
C
A
(S) = Ω(|S| · n),
where |S| is the size of operations S, n is the number of elements
of the list.
Average-case analysis.
Suppose element x is accessed with probability p(x) (a priori dis-
tribution of the elements). Then
E[C
A
(S)] =
X
xL
p(x) · rank
L
(x),
which is minimized when L is sorted in decreasing order with re-
spect to p.
If the distribution of x is unknown, a heuristic way is: keep a count
of the number of times each element is accessed, and maintain L
in order of decreasing count.
3. In practice, implementers found that the move-to-front (MTF) heuris-
tic empirically yields good results: after accessing x, move x to the head
of L using transposes with cost = 2 ·rank
L
(x). (It works well when one
element is accessed, then it’s more likely that it will be accessed again,
responds well to locality).
4. Competitive analysis.
Definition: An on-line algorithm A is α-competitive if there exists a
constant k such that for any sequence S of operations,
C
A
(S) α · C
OP T
(S) + k,
where C
OP T
(S) is the cost of optimal off-line algorithm (God’s algo-
rithm).
Theorem 12. MTF is 4-competitive for self-organizing lists.
The theorem is the basis of many types of analysis of on-line algorithms (no
need to make statistical assumptions).
Proof. Steps:
66
14 COMPETITIVE ANALYSIS: SELF-ORGANIZING LISTS
Figure 31: Potential function computation example.
Notation
Potential function
What happens on an access?
Compute the costs
1. Notation:
Let L
i
be MTF’s list after the i
th
access
Let L
i
be the OPT’s list after the i
th
access
c
i
= MTF’s cost for the i
th
operation = 2·rank
L
i1
(x) if it accesses
x
c
i
= OPT’s cost for the i
th
operation = rank
L
i1
(x) + t
i
if OPT
performs t
i
number of transposes
Compare c
i
with c
i
at any point of time.
2. Potential function: count how different of two lists, the more differences
there are, the more stored work has to be done for later paying.
Define Φ : {L
i
} R by
Φ(L
i
) = 2 · |{(x, y) : x
L
i
y and y
L
i
x}| = 2 · (# of inversions),
where
L
i
is precedes-operation in list L
i
, if x
L
i
y, then hit x first
when accessing list L
i
.
Ex: See Figure 31 for two lists, compute potential of L
i
.
67
14 COMPETITIVE ANALYSIS: SELF-ORGANIZING LISTS
Figure 32: Relationship of A, B, C, D.
Φ(L
i
) = 2 · |{(E, C), (E, A), (E, D), (E, B), (D, B)}| = 10.
Check the properties of potential function:
Φ(L
i
) 0 for any i
Φ(L
0
) = 0 if MTF and OPT start with the same list
Q: How much doesn Φ change from 1 transpose?
A: A transpose creates/destroy 1 inversion. So ∆Φ = ±2.
3. What happens when opearation i accesses x?
Define
A = {y L
i1
: y
L
i1
x and y
L
i1
x}
B = {y L
i1
: y
L
i1
x and y
L
i1
x}
C = {y L
i1
: y
L
i1
x and y
L
i1
x}
D = {y L
i1
: y
L
i1
x and y L
i1
x},
see Figure 32 for a picture of relationship of the sets.
Denote r = rank
L
i1
(x) and r
= rank
L
i1
(x), so
r = |A| + |B| + 1
r
= |A|+ |C| + 1
Q: How many inversions are created/destroyed?
A: When MTF moves x to the front, it creates |A| inversions and de-
stroys |B| inversions.
Each transpose by OPT creates 1 inversion. So
Φ(L
i
) Φ(L
i1
) 2(|A| |B| + t
i
).
68
14 COMPETITIVE ANALYSIS: SELF-ORGANIZING LISTS
4. Compute the costs.
Amortized cost for the i
th
operation of MTF w.r.t. Φ
ˆc
i
= c
i
+ Φ(L
i
) Φ(L
i1
)
2r + 2(|A| |B| + t
i
) by cost definition
= 2(|A| + |B| + 1) + 2(|A| |B| + t
i
)
= 4|A| + 2 + 2t
i
4(r
+ t
i
) (as r
= |A| + |C| + 1)
= 4c
i
by definition
Thus
C
MT F
(S) =
S
X
i=1
c
i
=
S
X
i=1
c
i
+ Φ(L
i1
Φ(L
i
))]
=
"
S
X
i=1
c
i
#
+ Φ(L
0
) Φ(L
S
)
"
S
X
i=1
c
i
#
+ Φ(L
0
) Φ(L
S
)
4C
OP T
(S) (since Φ(L
0
) = 0 and Φ(L
S
) 0)
Addendum:
1. If we count transpositions that move x toward the front as “free” (con-
stant time), then MTF is 2-competitive.
2. If L
0
6= L
0
, then Φ(L
0
) might be Θ(n
2
) in worst case (one list is
the reverse of the other, then there are choose 2 from n inveresions).
C
MT F
(S) 4C
OP T
(S) + Θ(n
2
), which is still 4-competitive as n
2
is
constant as |S| .
69
15 DYNAMIC PROGRAMMING, LONGEST COMMON SUBSEQUENCE
Figure 33: Longest common subsequence example.
15 Dynamic Programming, Longest Common
Subsequence
Programming here is an old word that means any tabular method for accom-
plishing something.
Dynamic programming is a design technique like other design techniques such
as divide and conquer.
Ex: Longest common subsequence (LCS). See Figure 33.
Given two sequences x[1, ··· , m] and y[1, ··· , n], find a (not “the”) longest
subsequence common to them both.
1. Brute-force LCS algorithm:
Check every subsequence of x[1, ··· , m] to see if it’s also a subsequence
of y[1, ··· , n].
Analysis:
For each subsequence of x, it took O(n) to check.
There are 2
m
subsequences of x.
So the worse-case running time is O(n2
m
) (exponential time, slow!)
2. Towards a better algorithm
Simplification:
Look at length of LCS(x, y).
Extend the algorithm to find LCS itself.
70
15 DYNAMIC PROGRAMMING, LONGEST COMMON SUBSEQUENCE
Figure 34: When x[i] = y[j] in proof of Theorem 13.
Strategy:
Consider prefixes of x and y.
Define
c[i, j] = |LCS(x[1, ··· , i], y[1, ··· , j])|,
and then
c[m, n] = |LCS(x, y)|.
Theorem 13.
c[i, j] =
c[i 1, j 1] + 1 if x[i] = y[j]
max{c[i 1, j], c[i, j 1]} o.w.
Proof. Discuss two cases separately.
Case 1: x[i] = y[j] See Figure 34.
Let z[1, ··· , k] = LCS(x[1, ··· , i], y[1, ··· , j]) with c[i, j] = k, then
z[k] = x[i] = y[j] (given x[i] = y[j]).
z[1, ··· , k 1] is common sequence of x[1, ··· , i1] and y[1, ··· , j
1].
Claim: z[1, k 1] = LCS(x[1, ··· , i 1], y[1, ··· , j 1]).
Proof: Suppose w is longer CS of x[1, ··· , i 1] and y[1, ··· , j 1],
that is, |w| > k 1. Then sequence {w, z[k]} is a common subsequence
of x[1, ··· , i] and y[1, ··· , j] and its length is larger than (k1+1) = k.
Contradiction.
c[i, j] = c[i 1, j 1] + 1.
Case 2: Similar.
3. Hallmark (characteristic) #1 of dynamic programming on optimal sub-
structure:
71
15 DYNAMIC PROGRAMMING, LONGEST COMMON SUBSEQUENCE
Figure 35: Recursion tree with m = 3, n = 4.
An optimal solution to a problem (instance) contains optimal solutions
to subproblems.
Ex: If z = LCS(x, y), then any prefix of z is an LCS of a prefix of x
and a prefix of y.
4. Recursive algorithm for LCS in Algo. 13. Worst-case is x[i] 6= y[j],
then the algorithm needs to evaluate two subproblems.
Ex: A recursion tree of LCS with m = 3, n = 4 in Figure 35. The
Algorithm 13 LCS(x, y, i, j)
1: if x[i] = y[j] then
2: c[i, j] LCS(x, y, i 1, j 1) + 1
3: else c[i, j] max{LCS(x, y, i 1, j), LCS(x, y, i, j 1)}
4: end if
height of the tree is m + n, which works potentially exponential.
5. Hallmark # 2 of dynamic programming on overlapping subproblems:
A recursive solution contains a “small” number of distinct subproblems
72
15 DYNAMIC PROGRAMMING, LONGEST COMMON SUBSEQUENCE
Figure 36: Dynamic programming table for Memo algorithm.
repeated many times. Subproblem space contains m · n distinct LCS
subproblems for two strings of length m and n.
6. Memoization (memo) algorithm in Algo. 14, same as Algo. 13 except
the first step to check if c[i, j] has been calculated or not:
After computing a solution to a subproblem, store it in a table.
Subsequent calls check the table to avoid redoing work.
So the time = Θ(mn) (constant work per table entry). Space is
Θ(mn).
Dynamic programming algorithm to compute the table. See Fig-
ure 36.
Compute the table bottom-up.
Reconstruct LCS by tracing backwards.
73
16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES
Algorithm 14 LCS(x, y, i, j)
1: if c[i, j] = Nil then
2: if x[i] = y[j] then
3: c[i, j] LCS(x, y, i 1, j 1) + 1
4: else c[i, j] max{LCS(x, y, i 1, j), LCS(x, y, i, j 1)}
5: end if
6: end if
16 Greedy Algorithms, Minimum Spanning
Trees
Read textboof Appendix B for background information of graphs.
Graphs (review):
1. Definition: directed graph (digraph) G = (V, E) is an ordered pair
consisting of
a set V of vertices
a set E V × V of edges.
2. Definition: undirected graph G = (V, E), the edge set E consists of
unordered pairs of vertices.
3. The number of edges (directed or undirected) is |E| = O(V
2
) (order of
the size of V
2
, ignored vertical bars around V
2
for simplicity).
4. If G is connected (there is a path from any vertex to any other vertex
in the graph), then |E| |V | 1 log |E| = Θ(log V ) (by the lower
and upper bound).
Graph representations:
1. Adjacency matrix of G = (V, E), where V = {1, 2, ··· , n}, is the n ×n
matrix A given by
A[i, j] =
1 if (i, j) E
0 if (i, j) 6∈ E
Ex: Adjacency matrix example in Figure 37.
74
16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES
Figure 37: Adjacent matrix.
The storage is Θ(n
2
), is a dense representation. (A good representa-
tion if the number of edges is close to the possible number of edges).
A tree is an example of connected graph that has no cycles (free tree).
A complete graph is an example of graph that’s dense (All 1’s of matrix
A).
2. Adjacency list of vertex v V is the list Adj[v], a set of vertices adjacent
to v.
Ex: The adjacency list graph in Figure 37 is
Adj[1] = {2, 3}
Adj[2] = {3}
Adj[3] = {}
Adj[4] = {3}
In undirected graphs, degree(v) = |Adj[v]|
In directed graphs, out degree(v) = |Adj[v]|
Handshaking lemma: For undirected graph
X
vV
degree(v) = 2|E|.
75
16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES
Figure 38: MST example.
It applies that adjacency lists use Θ(V + E) storage a sparse repre-
sentation often better than adjacency matrix.
Minimum spanning trees (MST):
One of the world’s most important algorithms, important in distributed sys-
tems.
1. Input: Connected, undirected graph G = (V, E) with weight function
w : E R.
Assume all edge weights are distinct for simplicity (see textbook
for general case).
2. Output: A spanning tree T a tree that connects all vertices of
minimum weight:
w(T ) =
X
(u,v)T
w(u, v). [edge (u, v)]
Ex: MST example in Figure 38.
76
16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES
Figure 39: Optimal substructure example.
Optimal substructure:
MST has a great optimal substructure property.
1. Set-up: some MST:T in Figure 39. By removing any edge (u, v) T ,
T is partitioned into two subtrees T
1
and T
2
.
2.
Theorem 14. The subtree T
1
is an MST for G
1
= (V
1
, E
1
), and the
subgraph of G
1
induced by vertices in T
1
, that is
V
1
= vertices of T
1
E
1
= {(x, y) E : x, y V
1
} (pairs of vertices of (x, y) that have
edges in E
1
)
Similarly for T
2
.
Proof. Cut and paste:
w(T ) = w(u, v) + w(T
1
) + w(T
2
)
77
16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES
Figure 40: “Greedy” algorithm hallmark proof.
Suppose T
0
1
is better than T
1
for G
1
(has lower weight), then
T
0
= {(u, v) T
0
1
T
2
}
will be a spanning tree better than T for G. (violated T is MST).
Q: Do we have overlapping subproblems?
A: Yes, dynamic programming may work. BUT, MST exhibits another pow-
erful property which leads to an even more EFFICIENT algorithm.
Greedy algorithm.
Hallmark for “greedy” algorithm: A locally optimal choice is globally
optimal.
Theorem 15. Let T be MST of G = (V, E), and let A V . Suppose
(u, v) E is the least-weight edge connecting A to V A. Then (u, v) T
(in MST).
I only need to check the least-weight for local part instead of the whole tree
by applying this algorithm.
Proof. Suppose (u, v) T . Cut and paste.
See Figure 40, (u, v) is least-weight edge connecting A to V A.
Consider unique simple path from u to V in T (textbook appendix
B.5.1).
78
16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES
Figure 41: Prim’s algorithm.
Swap (u, v) with first edge (in general, there may be many alternates,
just pick the first one) on this path that connects a vertex in A to a
vertex in V A.
A lower weight spanning tree than T results.
Contradicts.
Prim’s algorithm:
IDEA: Main V A as a priority queue Q. Key each vertex in Q with the
weight of the least-weight edge connecting it to a vertex in A.
The algorithm is in Figure 41.At the end {v, π[v]} forms the MST.
Handshaking Lemma states that there are Θ(E) implicit DRECREASE-
KEYS in k[v] w(u, v).
So the total time is Θ(V )T
EXT RACT MIN
+ Θ(E)T
DECREASEKEY
79
16 GREEDY ALGORITHMS, MINIMUM SPANNING TREES
Q T
EXT RACT MIN
T
DECREASEKEY
Total
array O(V ) O(1) O(V
2
)
binary heap O(log V ) O(log V ) O(E log V )
Fibonacci head O(log V ) O(1) O(E + V log V )
Table 4: Analysis of Prim
Figure 42: Prim’s algorithm example.
If it’s a dense graph (E is close to V
2
), array is better; if its a sparse
graph (E is much smaller than V
2
), binary heap is better; See table 4.
Ex: Prim’s algorithm example in Figure 42.
Kruskal’s algorithm uses disjoint-set data structure with O(E log V ),
see textbook.
80
17 SHORTEST PATHS I : PROPERTIES, DIJKSTRA’S ALGORITHM, BREADTH-FIRST
SEARCH
17 Shortest Paths I : Properties, Dijkstra’s
Algorithm, Breadth-first Search
Paths review:
1. Paths in graphs:
Consider a digraph G = (V, E) with edge-weight function w : E R.
The weight of path p = v
1
v
2
···v
k
is defined as
w(p) =
k1
X
i=1
w(v
i
, u
i+1
),
is the sum of the weights of the edges along the path.
2. Shortest path from u to v is a path of minimum weight from u to v.
The shortest path weight is
δ(u, v) = min{w(p) : p is a path from u to v}.
Note:
δ(u, v) = if no path from u to v exists.
Negative edge weights some shortest paths may not exist (for
example, when the sum of weights of a cycle is negative, one can
go through the cycle as many times as possible to obtain smaller
weights, in this case, δ(u, v) = −∞).
Theorem 16. A subpath of a shortest path is a shortest path.
Proof. Cut and paste. If the subpath is not shortest, then there must be
another path that is shorter than the current shortest path.
Triangle inequality:
Theorem 17. For all vertices u, v, x V , we have δ(u, v) δ(u, x) +δ(x, v)
Proof. Prove by graph.
81
17 SHORTEST PATHS I : PROPERTIES, DIJKSTRA’S ALGORITHM, BREADTH-FIRST
SEARCH
Single-source shortest paths:
A particular version shortest path.
1. Problem: given source vertex s V , find the shortest-path weights
δ(s, v) for all v V .
2. Further restrictions: assume all edge weights w(u, v) are nonnegative,
then all shortest-path weights must exist with δ(u, v) > −∞.
3. IDEA: greedy.
Maintain a set of S of vertices whose shortest-path distances from
s are known (s S).
At each step add to S the vertex v (V S) whose distance
estimate from s is minimal.
Update the distance estimates of vertices adjacent to v.
4. Algorithm.
Dijkstra’s algorithm
Ex: See Figure 43.
Shortest path tree: the union of all shortest paths. For each ver-
tex in graph, consider the last edge into that vertex that was relaxed
among all vertices u.
Correctness:
1. Invariant: d[v] δ(s, v) for all v V .
2.
3.
Breadth-first search (BFS) (unweighted graphs, w(u, v) = 1 for all (u, v)
E), see Algorithm 16:
BFS is actually Dijkstra algorithm, two changes are
BFS doesn’t use a priority queue, it uses a FIFO queue instead.
82
17 SHORTEST PATHS I : PROPERTIES, DIJKSTRA’S ALGORITHM, BREADTH-FIRST
SEARCH
Algorithm 15 Dijkstra’s algorithm
1: d[s] 0 d is some array indexed by vertices
2: d[x] is distance estimate from s to x
3: d[x] is δ(s, x) when x S, is the output
4: for each v V {s} do
5: d[v]
6: end for
7: S
8: Q V Priority queue keyed on d
9: Above is initialization
10: while Q 6= do
11: u EXT RACT MIN(Q)
12: S = S {u}
13: for each v Adj[u] do
14: if d[v] > d[u] + w(u, v) then
15: Relaxation (the triangular inequal.) step, implicit
DECREASE-KEY
16: d[v] d[u] + w(u, v)
17: end if
18: end for
19: end while
Algorithm 16 BFS
1: while Q 6= do
2: u DEQUEUE(Q)
3: for each v Adj[u] do
4: if d[v] = then
5: d[v] d[u] + 1
6: ENQUEUE(Q, v)
7: end if
8: end for
9: end while
83
17 SHORTEST PATHS I : PROPERTIES, DIJKSTRA’S ALGORITHM, BREADTH-FIRST
SEARCH
Figure 43: Dijkstra’s algorithm example.
84
18 SHORTEST PATHS II : BELLMAN-FORD, LINEAR PROGRAMMING, DIFFERENCE
CONSTRAINTS
In order to relax, if d[v] = (we haven’t visited v yet), then we declare
it to have the shortest path weight d[v] = d[u] + 1, and add v to the
end of the queue.
We start with the queue with just contains the vertex s (the only thing we
know the shortest path for), the queue is used for, known the shortest path,
deal with it when you can, look at the outgoing edges when you can. Initially,
just s, for all the outgoing edges, s has zero, all the outgoing edges from there
have weight 1 (shortest path weight from source), add all the vertices to the
end of the queue, process things in order, keep incrementing their values d[·].
The time is O(V + E) because we only look at each edge once, as soon
as we set d[v] to something, it will remain unchanged (added to S, only
look at once).
By proving that FIFO queue does exactly the same thing as the priority
queue, then can use the correctness of Dijkstra analysis to prove the
correctness of BFS.
BFS can be used to
Find all the vertices
Find the shortest path weights from s to every other vertex when
the weights are all one.
18 Shortest Paths II : Bellman-Ford, Linear
Programming, Difference Constraints
Deal with negative weight cycles.
Bellman-Ford algorithm: simpler than Dijkstra but not as fast.
Computes shortest path weights δ(s, v) from source vertex s V to all
vertices v V . Or
reports that a negative-weight cycle exists. See algorithm 17.
Ex: See Figures 44 and 45.
Correctness of Bellman-Ford Algorithm:
85
18 SHORTEST PATHS II : BELLMAN-FORD, LINEAR PROGRAMMING, DIFFERENCE
CONSTRAINTS
Figure 44: Bellman-Ford algorithm pass 1
.
86
18 SHORTEST PATHS II : BELLMAN-FORD, LINEAR PROGRAMMING, DIFFERENCE
CONSTRAINTS
Algorithm 17 Bellman-Ford algorithm
1: d[s] 0
2: for each v V {s} do
3: d[v]
4: end for initialization
5: for i 1 to |V | 1 do i is just a counter
6: for each edge (u, v) E do
7: if d[v] > d[u] + w(u, v) then
8: d[v] d[u] + w(u, v)
9: end if
10: end for
11: end for Relaxation step
12: for each edge (u, v) E do
13: if d[v] > d[u] + w(u, v) then
14: report that a negative-weight cycle exists
15: end if
16: end for
17: At the end, d[v] = δ(s, v), if no negative-weight cycles. Time is O(V E).
Figure 45: Bellman-Ford algorithm continued
.
87
18 SHORTEST PATHS II : BELLMAN-FORD, LINEAR PROGRAMMING, DIFFERENCE
CONSTRAINTS
Theorem 18. If G = (V, E) has no negative-weight cycles, then B-F algo-
rithm terminates with d[v] = δ(s, v) for all v V .
Proof. By monotonicity of the d values and correctness part I d[v] δ(s, v),
only need to show that d[v] = δ(s, v) at some time.
Consider a shortest path (in terms of the total weight of the path) p from
s to v with the minimum number of edges (so the path p is a simple path,
which doesn’t repeat any vertices),
p = v
0
v
1
v
2
··· v
k
,
where s = v
0
and v = v
k
, so we have
δ(s, v
i
) = δ(s, v
i1
) + w(v
i1
, v
i
).
Then
d[v
0
] = 0 = δ(s, v
0
) (d[v
0
] δ(s, v
0
))
.
.
. After pass 1
d[v
1
] = δ(s, v
1
)
.
.
. After pass 2
d[v
2
] = δ(s, v
2
)
.
.
.
d[v
k
] = δ(s, v
k
)
Since there is no negative-weight edges, p is simple,the longest simple path
has |V | 1.
Corollary: If a value d[v] fails to converge after |V | 1 passes for B-F al-
gorithm, there exists a negative-weight cycle in G reachable from s.
Linear Programming (LP): the farther and mother of shortest path.
If A be an m × n matrix, b is an m-vector, and c is an n-vector. Find an
n-vector x that maximizes c
T
x subject to Ax b, or determine that no such
solution exists. See Figure 46.
Many efficient algorithms (and code) to solve LPs.
88
19 SHORTEST PATHS III : ALL-PAIRS SHORTEST PATHS, MATRIX
MULTIPLICATION, FLOYD-WARSHALL, JOHNSON
Figure 46: Linear programming.
Simplex, practical but exponential time in worst cases.
Ellipsoid, first algorithm to solve LP in polynomial time, impractical.
Interior-point methods, mixture (polynomial time but practical).
Random sampling, relatively new, completely randomized, simple.
Corollary: The Bellman-Ford algorithm can solve a system of m difference
constrains on n variables in O(mn) time.
19 Shortest Paths III : All-pairs Shortest
Paths, Matrix Multiplication, Floyd-Warshall,
Johnson
The last two lectures:
Single-source shortest path: find the shortest path from a source vertex
to every other vertex
Unweighted path: BFS, O(V + E)
Non-negative edge weights: Dijkstra, O(E + V log V )
General case: Bellman-Ford, O(V E)
Directed Acyclic Graph (DAG): topological sort, then run B-F
given the order from topological sort one round, a linear time
algorithm.
In this lecture, we discuss all-pairs shortest paths, where we want to know
the shortest path weight between every pair of vertices.
89
19 SHORTEST PATHS III : ALL-PAIRS SHORTEST PATHS, MATRIX
MULTIPLICATION, FLOYD-WARSHALL, JOHNSON
Unweighted path: V times of BFS, |V | × BF S O(V E).
Non-negative edge weights: Dijkstra’s algorithm |V | times, O(V E +
V
2
logV ).
General case: focus of today.
|V B-F, O(V
2
E)
3 algorithms will be discussed today.
Definition of all-pairs shortest paths:
Input: Digraph G = (V, E), where V = {1, 2, ··· , n}, with edge weight
function w : E R.
Output: n × n matrix of shortest-path lengths δ(i, j) for all i, j V .
If it’s a dense graph, that is, E = O(V
2
), then the time is three times of B-F
is O(V
4
), first goal is to beat this time complexity.
Dynamic programming:
Let A be a weighted adjacency matrix:
A = (a
ij
),
where a
ij
= w(ij) if (ij) is an edge or a
ij
= if there no edge.
Define subproblems: Let d
(m)
ij
the weight of a shortest path from i to j
that uses at most m edges, then from B-F algorithm, d
(m1)
ij
= d
(m)
ij
=
δ(i, j) if there is no negative weight cycles.
Claim:
d
(0)
ij
=
n
0 if i = j
if i 6= j
for m = 1 and
d
(m)
ij
= min
k
{d
(m1)
ik
+ a
kj
}
90
20 ADVANCED TOPICS: PARALLEL ALGORITHMS I
20 Advanced Topics: Parallel Algorithms I
For serial algorithms, most people have basic model sometimes been called
a random access machine model, which was what we have been using to an-
alyze things.
In the parallel space, there are a huge number of models and there is no
general agreement on what is the best (varies with different machines/con-
figurations...).
1. Dynamic multithreading: appropriate for the multicore machines
or for shared memory programming, NOT appropriate for distributed
memory programs particularly (processors are not able to access things).
Ex: Fib(n). Keywords explanation:
Algorithm 18 F ib(n)
1: if n < 2 then return n
2: end if
3: x SPAWN F ib(n 1) thread A (line 1 - 3)
4: y SPAWN F ib(n 2) thread B (line 4)
5: SYNC
6: return (x + y) thread C (line 5 -6)
(a) SPAWN: subroutine can execute at the same time as its parents.
(b) SYNC: wait until all children are done.
The above is the logical parallelism, not actual. A scheduler determines
how to map dynamically unfolding execution onto whatever processors
available.
2. Multithread computation: the parallel instruction stream are di-
rected acyclic graph (dag). Vertices of dag are threads, which are max-
imal sequences of instructions not containing parallel control (SPAWN,
SYNC and return from a SPAWNed procedure). See Figure 47 for a
graphic demonstration of the treads. Three types of edges (spawn,
return and continuation) are used.
91
20 ADVANCED TOPICS: PARALLEL ALGORITHMS I
Figure 47: Directed acryclic graph for multithread computation.
92
20 ADVANCED TOPICS: PARALLEL ALGORITHMS I
3. Performance measures: Let
T
p
= running time on p processors,
T
1
=work = serial time, the running time on one processor,
T
= critical-path length = longest path in the dag.
Ex: For unit time threads, Fib(4) has
T
1
= 17 (count number of small circles in Fig ??, how many instructions were there)
T
= 9 (A A A –B A –C –C–C),
so
T
p
T
1
p,
p processors can do p work at most in one step. p processors can do at
most p work in one step. And
T
p
T
,
p processors cannot do more work than processors.
4. Speed up on p processors:
T
1
T
p
.
(a) If speedup = Θ(p), it’s linear speedup.
(b) If sppedup > p, it’s superlinear speedup. (never happens due to
lower bound above in this model).
(c) Max possible speedup given T
1
, T
p
is T
1
/T
= parallelism = av-
erage amount of work that can be done in parallel along each step
of critical path =
¯
P .
5. Scheduling: map computation to p processors, typically done by a
runtime system. Algorithm that maps the executing program (spawn,
return...) onto the processors of the machine as it executes.
Online schedulers are complex. Here we illustrate the ideas using off-
line scheduling.
Greedy scheduler (p processors): Do as much as possible on every
step. For different types of step,
(a) Complete step: at least p threads ready to run. Strategy: execute
any p.
93
21 ADVANCED TOPICS (CONT.)
(b) Incomplete step: fewer than p thread ready to run. Strategy:
execute all of them.
Theorem 19. (Graham and Brent) A greedy scheduler executes any
computation G with work T
1
and critical path length T
time,
T
p
T
1
/p + T
Note: T
p
is within a factor of two of optimal time.
Proof. Count how many complete steps and how many incomplete
steps.
Complete steps: the largest number of complete steps is T
1
/p since
otherwise more than T
1
works would be done.
Incomplete steps: let G
0
be the subgraph of G that remains to be
executed, see Figure 48.
(a) Threads with in-degree zero (no predecessor) in G
0
are ready
to be executed.
(b) After executing all of the in-degree zero thread (incomplete
case), the critical path length of the remaining subgraph de-
creases by one. If complete case, no decrease of the critical
path length.
(c) The number of incomplete steps is at most T
. (amortized
argument).
Add up the two contributions. This is the fundamental theorem
of all scheduling.
Corollary: Linear speedup when p = O(
¯
P ) (order of parallelism).
Proof.
¯
P =
T
1
T
p = O(T
1
/T
) T
= O(T
1
/p),
that is, T
p
T
1
/p+T
= O(T
1
/p), then T
1
/T
p
= O(p), linear speedup.
6. Cilk: dynamic multithread language based on language C.
21 Advanced Topics (cont.)
94
21 ADVANCED TOPICS (CONT.)
Figure 48: Subgraph G
0
and graph G.
95