This assembly implementation is marginally faster than
the non-size-optimized C version for large copies, but
is around half the code size.
Unaligned loads/stores will be used on platforms that
support it: though slower than aligned accesses, this
is still faster than copying byte-by-byte and has the
advantage of simplicity and small code size.
Change-Id: Ieee73d7557318d510601583f190ef3aa018c9121