I would expect you to find that the "efficient" implementation is much more efficient than the quadratic one for large input, and that the optimised library implementation is quite a bit faster than your own "efficient" implementation, but not wildly so.
As for testing, a good idea seems to be to test that the three implementations always give the same result (up to reordering of duplicates).
Two possible answers:
Radix sort. See exercise 5c from last year's ordinary exam. However, this answer is rather theoretical.
Quicksort or mergesort. LaMarca and Ladner (The Influence of Caches on the Performance of Sorting, Journal of Algorithms, 31(1):66-104, 1999) implement optimised variants of quicksort, heapsort, mergesort and radix sort, and run actual experiments. While radix sort is claimed to have the lowest instruction count, cache effects make it roughly 50% slower than quicksort or mergesort for sorting one million 64-bit integers on the given hardware. See the tables on page 23 (or page 98 of the journal version of the paper) for a quick summary.
If the input is likely to be partially sorted, then one can also consider using an adaptive sorting algorithm.
Assume that the input consists of n elements, out of which n / c have the same key k. Then quicksort using this partitioning strategy (and no optimisations, such as switching to other sorting algorithms) requires Ω(n² / c²) steps.
Proof: Whenever the input is partitioned, all elements with key k end up in the same subset (S₁ or S₂), except perhaps for one (the pivot). Quicksort does not terminate until every element has been selected as a pivot once, so we need to partition the input at least n / c times. Computing these partitions takes at least
∑ i = 1n / c i = Ω(n² / c²)
steps.