Basic idea: Maintain a set which contains all words seen so far. For each word, check if its mirror image is a member of the set.
allocate a set, implemented as a hash table with separate chaining,
where the chaining is implemented using linked lists
do the following for each word w in the file:
if reverse(w) exists in the set
print "yes"
exit the program
else
insert w into the set
print "no"
Assuming a uniform cost model.
Let us assume that the hash function does constant work per character, i.e. that the time needed to compute a hash value for the word w is θ(∣w∣) time units, where ∣w∣ is the length of w.
The number of characters in the file is denoted by N.
Let us now assume that we have a perfect hash function, and consider an iteration of the loop body:
We find the next word w. This operation takes O(∣w∣ + s) time units, where s is the size of any surrounding white space that is traversed.
We reverse the word.
We look up the reversed word.
We may also insert w into the set.
The last three operations take O(∣w∣) time units (amortised in the case of insertion, due to potential resizing/rehashing). We get that the loop body is amortised linear in the number of processed characters.
Each character in the file is processed at most once, so the total time complexity is O(N). In the worst case the program must read all the input, so this bound is tight; we get the worst-case time complexity θ(N).
Note that in this scenario most of the time is likely to be spent reading the program's input.
If all words hash to the same bucket, then looking up the word w takes O(i∣w∣) time units, where i is the current size of the hash table. (Note that comparing two words w₁ and w₂ for equality takes O(min(∣w₁∣, ∣w₂∣)) time units.) The other operations have the same time complexity as in the previous case, so the loop body's time complexity is O(i∣w∣ + s) (amortised).
Assume now that we have a worst-case scenario where the file does not contain any duplicated words and no two words are mirror images of each other. (It is not immediately obvious that such an input can be constructed for all alphabet sizes and all N, but this assumption will at least give us an upper bound on the time complexity.) In this case the worst-case time complexity is
O(∑ i = 0n - 1(i∣wi∣ + si)) = O(cn² + N),
where n is the number of words, and every word has at most c characters. If we have both c = θ(N) and n = θ(N), then we get the time complexity O(N³).
Is the scenario above reasonable? Can we find a constant alphabet size A such that, for all (large) file sizes N, there is a file without duplicated or mirrored words, satisfying c = θ(N) and n = θ(N), and requiring θ(c) character comparisons in order to see that any two distinct words are not equal? No: if θ(N) character comparisons are required, then every word has θ(N) characters, in which case we cannot have θ(N) words.
Challenge: Improve the analysis above (either prove that the bound is tight, or come up with a better bound).
Let us represent graphs as follows:
One hash table mapping node identifiers (strings) to node objects.
identifier
: The node's identifier.adjacent
: An adjacency list, containing (pointers to) node objects.Note that this representation ensures that one can find the successors of a node without having to look up strings.
Basic idea:
Program:
l ← parse the file into a list of lists of strings, where
* the outer list contains one inner list per line in the file,
and
* every inner list contains one node identifier (string) per
word on the line, in the order in which the words occurred
h ← an empty hash table mapping strings (node identifiers) to node
objects
for each inner list l′ of l do
id ← the first element of l′
n ← new node object: {identifier = id, adjacent = empty list}
h.insert(id, n)
for each inner list l′ of l do
id ← the first element of l′
adj ← h.lookup(id).adjacent
for each element id′ of l′, except for the first do
adj.insert(h.lookup(id′))
Add a new node to the graph, connected to all the other nodes. Return the result of running the Hamiltonian cycle algorithm on the extended graph.
Denote the node added to the input graph by v.
The algorithm is correct if we can prove that the input graph has a Hamiltonian path iff the extended graph has a Hamiltonian cycle.
The only if direction:
Assume that the input graph has a Hamiltonian path v1, v2, …, vn. Note that all these nodes are distinct from each other and from v. We get that the extended graph has a Hamiltonian cycle v, v1, v2, …, vn, v.
The if direction:
Assume that the extended graph has a Hamiltonian cycle. We can assume, without loss of generality, that the cycle starts and ends in v: v, v1, v2, …, vn, v. Here the nodes v1, v2, …, vn form a Hamiltonian path in the input graph.
If the input graph is G = (V, E), then it takes O(|V|) time units to add the extra node (if a suitable adjacency list representation is used). The linear Hamiltonian cycle algorithm is now applied to the larger graph; its running time is O((∣V∣ + 1) + (∣E∣ + ∣V∣)) = O(∣V∣ + ∣E∣). Hence the total running time is also O(∣V∣ + ∣E∣), if our assumption about the graph's representation is correct (perhaps the Hamiltonian cycle algorithm uses some non-standard graph representation).