Thanks!

]]>```
import scala.collection.TraversableLike
implicit class TraversableExtras[A, Repr <: TraversableOnce[A]](t: TraversableLike[A, Repr]) {
def nonEmptyTails: Iterator[Repr] = {
val bufferedInits = t.inits.buffered
new Iterator[Repr] {
def hasNext() = bufferedInits.head.nonEmpty
def next() = bufferedInits.next()
}
}
}
```

TraversableLike has a second type parameter (called Repr here) which is the type of the collection itself, therefore we can return a value of that type rather than a plain Traversable, in order to preserve the type information in the result. Note that this also means that we have to introduce a constraint that Repr is a subtype of TraversableOnce here – this is so the compiler can prove that the nonEmpty operation is available on it.

Thanks!

Tim

]]>http://stackoverflow.com/a/15297616/679628 ]]>

Pick Fp s.t. i = -1 is not in Fp.

Work over Fp^2, where x in Fp^2 = a + i*b.

If a, b are 16-bit numbers, then multiplication over Fp^2 is:

(ac – bd) + i(bc + ad).

This can be done efficiently in parallel: (ac | bd) in one 64-bit multiplication for example. Or use SIMD.

The biggest advantage of Fp^2 is that taking the inverse (for division) is done over Fp. This can be accelerated in a number of ways since Fp is only 16 bits.

]]>I was sort of expecting there to be a final example where the tmp1/2/3 was created before the loop and only summed together after the loop. In your code you create vectors of length 3 (because every tmp gets assigned 3 times, not because there are 3 tmps). If you move the tmp creation & final summation outside the loop you don’t flush the floating point pipe after 3 operations.

Maybe it gives no improvement, but I’m curious. Maybe the biggest gain is to be had from the SIMD instructions (obviously, a factor of 4) rather than the pipelining, but I’d be curious to see for sure.

]]>Also can you update with performance numbers using the cblas implementation?

BTW, great work here. Very interesting demonstration.

]]>