# Fast vector arithmetic over $\mathbb{F}_{3}$ 

K. Coolsaet


#### Abstract

We show how binary machine instructions can be used to implement fast vector operations over the finite field $\mathbb{F}_{3}$. Apart from the standard operations of addition, subtraction and dot product, we also consider combined addition and subtraction, weight, Hamming distance, and iteration over all vectors of a given length.

Tests show that our implementation can be as much as 10 times faster than the standard method of using modular arithmetic on arrays of bytes. For computing the Hamming distance even a factor of 33 can sometimes be reached, provided a recent CPU is used.


## 1 Introduction

It is common knowledge that the fastest way to work with vectors over the field $\mathbb{F}_{2}$ of two elements is to represent such vectors as bit vectors and use binary CPU instructions to process them, in particular 'exlusive or' for addition, and 'and' for multiplication. The inherent bit parallellism of a 64-bit CPU allows vectors of $\mathbb{F}_{2}^{64}$ to be processed as fast as single integers.

It is less well known that a similar technique can be used to parallellize vector operations over the field $\mathbb{F}_{3}$ of three elements, representing every element of the field as a pair of bits and emulating the field operations of $\mathbb{F}_{3}$ as combinations of standard binary machine instructions.

In [3] Kawahara et al. use such a representation to perform fast vector addition and subtraction needing six binary operations for each ternary operation. They also show that fewer operations do not suffice.

[^0]In this paper we shall additionally present fast binary methods for computing the dot product of two vectors and determining the Hamming distance between two vectors. We also show that if we need both the sum and the difference of the same pair of vectors, then this can be done considerably faster than adding and subtracting them separately. (This is for instance useful when generating all linear combinations of a given set of vectors.) We also describe a fast method for iterating through all vectors of a given length. All of these techniques should prove very effective for computer applications in ternary codes and finite geometries.

We implemented several benchmark programs to estimate how fast our operations are in comparison to the standard way of representing vectors as arrays of bytes with modular arithmetic to add, subtract or multiply them. We ran these tests on various types of CPU and also compared several variants of binary representations. One slightly unexpected result was that an implementation of addition and subtraction using seven operations ran at exactly the same speed as the implementation of Kawahara et al., which uses only six. The number of steps in which operations can be performed in parallel seems to be a better predictor for the actual execution time than simply the number of operations.

In summary, for vectors of length 64 , addition, subtraction and dot product can be done up to 10 times faster using our methods instead of the standard implementations. For computing the Hamming distance we even managed to improve the running time by a factor of 33 (provided the CPU is sufficiently recent).

In Section 2 we present and prove the formulas that form the basis of our implementation (cf. Theorem 1). In Section 3.1 we describe how vector addition, subtraction and a combination of those can be done by representing a vector of $\mathbb{F}_{3}^{n}, n \leq 64$, as two 64 -bit words. Weight and Hamming distance are treated in Section 3.2, the dot product in Section 3.3 and iteration over all elements of $\mathbb{F}_{3}^{n}$ in Section 3.4. In Section 3.5 we describe an alternative representation which uses three 64 -bit words for each vector, and one which needs only a single 64 -bit word, provided $n \leq 32$. The results of our benchmarks are presented in Section 4 and in Section 5 we end with some final remarks.

## 2 Basic operations

Although in this text we will be working at the same time with elements of both $\mathbb{F}_{2}$ and $\mathbb{F}_{3}$, it will always be clear from context which is which and therefore we use the same standard mathematical notations for addition, subtraction and multiplication, irrespective of the field in which we are working.

Recall that binary addition is the same as binary 'exclusive or' and multiplication is the same as binary 'and'. We shall use the notation ' $\mid$ ' for binary 'or' and ' $\neg$ ' for binary 'not'. In terms of field operations we have $x \mid y=x y+x+y$ and $\neg x=x+1$. The following properties are easily derived :

$$
\begin{equation*}
x|y=x|(x+y)=y|(x+y), \quad x|(y+z)=(x \mid y)+(x \mid z)+x \tag{1}
\end{equation*}
$$

for all $x, y, z \in \mathbb{F}_{2}$.

For $d \in \mathbb{F}_{3}$ we define $d_{0}, d_{1}, d_{2} \in \mathbb{F}_{2}$ according to the following table :

| $d$ | $d_{0}$ | $d_{1}$ | $d_{2}$ |
| :---: | :---: | :---: | :---: |
| 0 | 0 | 1 | 1 |
| 1 | 1 | 0 | 1 |
| 2 | 1 | 1 | 0 |

Note that $d_{0}+d_{1}+d_{2}=0$ and that in all cases $d_{0}\left|d_{1}=d_{1}\right| d_{2}=d_{2} \mid d_{0}=1$, or equivalently $d_{2}=d_{0}+d_{1}=d_{0} d_{1}+1, d_{0}=d_{1}+d_{2}=d_{1} d_{2}+1, d_{1}=d_{2}+d_{0}=$ $d_{2} d_{0}+1$. The permutation $d \mapsto d+1$ in $\mathbb{F}_{3}$ translates to $d_{0} \mapsto d_{1} \mapsto d_{2} \mapsto d_{0}$ in $\mathbb{F}_{2}$. The involution $d \leftrightarrow-d$ corresponds to the interchange $d_{1} \leftrightarrow d_{2}$.

In Section 3 we shall discuss several representations of (vectors of) ternary digits. In most of the cases we shall represent one digit $d$ by a pair of bits $\left(d_{1}, d_{2}\right)$, although also a 3-bit representation $\left(d_{0}, d_{1}, d_{2}\right)$ shall be considered.

The following theorem serves as the basis for the rest of this paper.
Theorem 1. Let $v, w \in \mathbb{F}_{3}$. Then

$$
\begin{align*}
& n=-v \Leftrightarrow\left\{\begin{array}{l}
n_{1}=v_{2} \\
n_{2}=v_{1}
\end{array}\right.  \tag{3}\\
& s=v+w \Leftrightarrow\left\{\begin{aligned}
s_{1} & =\left(v_{0}+w_{1}\right)\left|\left(v_{1}+w_{0}\right)=\left(v_{0}+w_{1}\right)\right|\left(v_{2}+w_{2}\right) \\
& =\left(v_{2}+w_{2}\right) \mid\left(\left(v_{1}+w_{1}\right)+v_{2}\right) \\
& =\left(\left(v_{1}+w_{1}\right) \mid\left(v_{2}+w_{2}\right)\right)+v_{2} w_{2} \\
s_{2} & =\left(v_{0}+w_{2}\right)\left|\left(v_{2}+w_{0}\right)=\left(v_{0}+w_{2}\right)\right|\left(v_{1}+w_{1}\right) \\
& =\left(v_{1}+w_{1}\right) \mid\left(\left(v_{2}+w_{2}\right)+v_{1}\right) \\
& =\left(\left(v_{1}+w_{1}\right) \mid\left(v_{2}+w_{2}\right)\right)+v_{1} w_{1}
\end{aligned}\right.  \tag{4}\\
& d=v-w \Leftrightarrow\left\{\begin{aligned}
d_{1} & =\left(v_{0}+w_{2}\right)\left|\left(v_{1}+w_{0}\right)=\left(v_{0}+w_{2}\right)\right|\left(v_{2}+w_{1}\right) \\
& =\left(v_{2}+w_{1}\right) \mid\left(\left(v_{1}+w_{2}\right)+v_{2}\right) \\
& =\left(\left(v_{1}+w_{2}\right) \mid\left(v_{2}+w_{1}\right)\right)+v_{2} w_{1} \\
d_{2} & =\left(v_{0}+w_{1}\right)\left|\left(v_{2}+w_{0}\right)=\left(v_{0}+w_{1}\right)\right|\left(v_{1}+w_{2}\right) \\
& =\left(v_{1}+w_{2}\right) \mid\left(\left(v_{2}+w_{1}\right)+v_{1}\right) \\
& =\left(\left(v_{1}+w_{2}\right) \mid\left(v_{2}+w_{1}\right)\right)+v_{1} w_{2}
\end{aligned}\right.  \tag{5}\\
& p=v w \Leftrightarrow\left\{\begin{aligned}
p_{1} & =v_{1} w_{2} \mid v_{2} w_{1}=\left(v_{1} \mid w_{1}\right)\left(v_{2} \mid w_{2}\right) \\
p_{2} & =v_{1} w_{1} \mid v_{2} w_{2}=\left(v_{1} \mid w_{2}\right)\left(v_{2} \mid w_{1}\right) \\
p_{0} & =v_{0} w_{0} \\
\neg p_{2} & =v_{0} w_{0}\left(v_{1}+w_{1}\right)=v_{0} w_{0}\left(v_{2}+w_{2}\right) \\
& =\left(v_{1}+w_{1}\right)\left(v_{2}+w_{2}\right)=\left(v_{1} \mid w_{2}\right)+\left(v_{2} \mid w_{1}\right)
\end{aligned}\right. \tag{6}
\end{align*}
$$

Proof. The result for the negation $n$ is a trivial consequence of the definition (2). The results for the difference $d$ follow from those of the sum $s$ by substituting $-w$ for $w$ (i.e., interchanging $w_{1} \leftrightarrow w_{2}$ ).

Now consider $s_{1}$. Using the first of the identities (1) twice, we find

$$
\begin{aligned}
\left(v_{0}+w_{1}\right) \mid\left(v_{1}+w_{0}\right) & =\left(v_{0}+w_{1}\right) \mid\left(v_{1}+w_{0}+v_{0}+w_{1}\right) \\
& =\left(v_{0}+w_{1}\right) \mid\left(v_{2}+w_{2}\right) \\
& =\left(v_{0}+w_{1}+v_{2}+w_{2}\right) \mid\left(v_{2}+w_{2}\right) \\
& =\left(v_{1}+w_{1}+w_{2}\right) \mid\left(v_{2}+w_{2}\right) .
\end{aligned}
$$

Again by (1) we obtain

$$
\begin{aligned}
\left(\left(v_{1}+w_{1}\right)+w_{2}\right) \mid\left(v_{2}+w_{2}\right)= & \left(\left(v_{1}+w_{1}\right) \mid\left(v_{2}+w_{2}\right)\right)+\left(w_{2} \mid\left(v_{2}+w_{2}\right)\right) \\
& +\left(v_{2}+w_{2}\right) \\
= & \left(\left(v_{1}+w_{1}\right) \mid\left(v_{2}+w_{2}\right)\right)+\left(v_{2} \mid w_{2}\right)+\left(v_{2}+w_{2}\right) \\
= & \left(\left(v_{1}+w_{1}\right) \mid\left(v_{2}+w_{2}\right)\right)+v_{2} w_{2} .
\end{aligned}
$$

This proves that the three formulas for $s_{1}$ in (4) are indeed equivalent. The same holds for $s_{2}$, which can be obtained from $s_{1}$ by interchanging $v_{1} \leftrightarrow v_{2}$ and $w_{1} \leftrightarrow w_{2}$.

Note that the formulas for $s_{1}$ and $s_{2}$ are invariant under the transformations $v \mapsto v+1, w \mapsto w-1$, i.e., $v_{0} \mapsto v_{1} \mapsto v_{2} \mapsto v_{0}, w_{0} \mapsto w_{2} \mapsto w_{1} \mapsto w_{0}$. It is therefore sufficient to prove the validity of these formulas in the special case $v=$ $w$. And indeed, in that case the first formula in (4) reduces to $s_{1}=v_{0}+v_{1}=v_{2}$ and similarly $s_{2}=v_{0}+v_{2}=v_{1}$. Therefore $s=-v=2 v$, as expected.

This leaves (6). We have $v w=0$ if and only if $v=0$ or $w=0$ if and only if $v_{0}=0$ and $w_{0}=0$ if and only if $v_{0} w_{0}=0$. This proves the value for $p_{0}$. Similarly, $v w=1$ if and only if $v=w=1$ or $v=w=2$, i.e., if and only if $v_{1}=w_{1}=0$ or $v_{2}=w_{2}=0$. And this is equivalent to $\left(\neg v_{1}\right)\left(\neg w_{1}\right) \mid\left(\neg v_{2}\right)\left(\neg w_{2}\right)=1$, or by De Morgan's laws, $\left(v_{1} \mid w_{1}\right)\left(v_{2} \mid w_{1}\right)=0$, yielding the second formula for $p_{1}$. By distributivity of 'or' over 'and' we also find

$$
v_{1} w_{2} \mid v_{2} w_{1}=\left(v_{1} \mid v_{2}\right)\left(v_{1} \mid w_{1}\right)\left(w_{2} \mid v_{2}\right)\left(w_{2} \mid w_{1}\right)=\left(v_{1} \mid w_{1}\right)\left(v_{2} \mid w_{2}\right)
$$

which proves the equivalence of the first and second expression for $p_{1}$. The expressions for $p_{2}$ can be proved in a similar way.

Again by De Morgan's laws, we find

$$
\neg p_{2}=\neg\left(\left(v_{1} w_{1}\right) \mid\left(v_{2} w_{2}\right)\right)=\left(v_{1} w_{1}+1\right)\left(v_{2} w_{2}+1\right)=\left(v_{1}+w_{1}\right)\left(v_{2}+w_{2}\right) .
$$

Also

$$
\begin{aligned}
\left(v_{1} \mid w_{2}\right)+\left(v_{2} \mid w_{1}\right) & =v_{1} w_{2}+v_{1}+w_{2}+v_{2} w_{1}+v_{2}+w_{1} \\
& =v_{1} w_{2}+v_{2} w_{1}+\left(v_{1}+v_{2}\right)+\left(w_{1}+w_{2}\right) \\
& =v_{1} w_{2}+v_{2} w_{1}+v_{1} v_{2}+w_{1} w_{2}=\left(v_{1}+w_{1}\right)\left(v_{2}+w_{2}\right)
\end{aligned}
$$

Finally

$$
\begin{aligned}
v_{0} w_{0}\left(v_{2}+w_{2}\right) & =\left(v_{0} v_{2}\right) w_{0}+\left(w_{0} w_{2}\right) v_{0}=\left(v_{1}+1\right) w_{0}+\left(w_{1}+1\right) v_{0} \\
& =\left(v_{1}+1\right)\left(w_{1}+w_{2}\right)+\left(w_{1}+1\right)\left(v_{1}+v_{2}\right) \\
& =v_{1} w_{2}+w_{1}+w_{2}+w_{1} v_{2}+v_{1}+v_{2}
\end{aligned}
$$

which is $\left(v_{1}+w_{1}\right)\left(v_{2}+w_{2}\right)$ as before.
(An alternative proof of this theorem consists of trying all 9 possible combinations of $v, w$ and checking each result. This can be automated and provides a good test for any implementation of these operations.)

## 3 Vector arithmetic

Consider an $n$-tuple $V=\left(v^{(1)}, \ldots, v^{(n)}\right) \in \mathbb{F}_{3}^{n}$, i.e., a vector of length $n$ with elements in $\mathbb{F}_{3}$. We shall write $V_{i}$ for the bit vector $V_{i} \stackrel{\text { def }}{=}\left(v_{i}^{(1)}, \ldots, v_{i}^{(n)}\right) \in \mathbb{F}_{2}^{n}$. If $V, W$ are vectors of the same length, then we write $V+W, V-W, V W$ for the elementwise sum, difference and product of $V$ and $W$ (both in $\mathbb{F}_{2}^{n}$ and $\mathbb{F}_{3}^{n}$ ). We also write $V_{i} \mid W_{j}$ for the elementwise binary 'or' of two vectors $V_{i}, W_{j} \in \mathbb{F}_{2}^{n}$. The elementwise multiplication $V W$ should not be confused with the dot product $V \cdot W \stackrel{\text { def }}{=} V^{(1)} W^{(1)}+\cdots+V^{(n)} W^{(n)} \in \mathbb{F}_{3}$, equal to the sum of the elements of the vector $V W$. We denote the weight of $V$ by $\|V\|$, i.e., the number of elements of $V$ that are different from zero. The Hamming distance $\|V-W\|$ counts the number of positions $i$ for which $V^{(i)}$ and $W^{(i)}$ differ.

We have used Theorem 1 in three different implementations of vector arithmetic over $\mathbb{F}_{3}$ which we discuss below (including some additional variants). We present fast methods for computing $V+W, V-W, V \cdot W,\|V\|$ and $\|V-W\|$ largely in terms of the binary vector operations $V_{i} W_{j}, V_{i}+W_{j}$ and $V_{i} \mid W_{j}$ which, when $n \leq 64$, correspond to single CPU instructions on standard 64-bit microprocessors. For the dot product, weight and the Hamming distance we also use 64-bit remainder, multiplication and shifts and a special 'population count' instruction, if available.

In what follows we will always assume that $n \leq 64$. Vectors of larger dimension can still be handled by partitioning them into blocks of size $\leq 64$.

### 3.1 Two words for each vector

The most direct way to apply Theorem 1 is to store a vector $V$ in two separate machine words that correspond directly to $V_{1}$ and $V_{2}$. Negation, addition, subtraction and elementwise multiplication can then be implemented as straight translations of the formulas of the theorem.

For addition (and similarly, subtraction) we may use the second and fifth line of (4) to obtain an implementation in as few as six operations:

$$
\begin{array}{rlr}
T_{1} & \leftarrow V_{1}+W_{1} & T_{2} \leftarrow V_{2}+W_{2} \\
U_{1} & \leftarrow T_{1}+V_{2} & U_{2} \leftarrow T_{2}+V_{1}  \tag{7}\\
S_{1} & \leftarrow T_{2} \mid U_{1} & S_{2} \leftarrow T_{1} \mid U_{2}
\end{array}
$$

( $S$ contains the result, $T$ and $U$ are auxiliary variables.)
This is the same implementation discussed in [3], where it is also proved that at least six operations are needed. However, on modern CPUs the number of operations is not always the best measure of speed. We have also tested the following implementation, based on the third and sixth line of (4), which needs seven operations:

$$
\begin{array}{rllll}
T_{1} & \leftarrow V_{1}+W_{1} & T_{2} \leftarrow V_{2}+W_{2} \quad U_{1} \leftarrow V_{1} W_{1} \quad U_{2} \leftarrow V_{2} V_{2} \\
U_{*} \leftarrow T_{1} \mid T_{2} & &
\end{array}
$$

It turns out that our benchmark tests (cf. Section 4) show no significant difference in speed on modern CPUs between the 6- and 7-op versions of addition and subtraction. (For the somewhat less recent AMD Opteron 2212, the 7-op version is slower by a factor of $\approx 1.05$.) The reason for this is probably that the CPU manages to execute several operations in parallel, and needs only three parallel steps in both cases.

Where both $V+W$ and $V-W$ are needed at the same time, it is possible to shave off another 2 operations. There are various ways to accomplish this. For example, based on the first and fourth lines of (4-5), we may write

$$
\begin{array}{lllllll}
V_{0} \leftarrow V_{1}+V_{2} & W_{0} \leftarrow W_{1}+W_{2} & & & \\
T_{1} \leftarrow V_{0}+W_{1} & T_{2} \leftarrow V_{0}+W_{2} & U_{1} \leftarrow W_{0}+V_{1} & U_{2} \leftarrow W_{0}+V_{2} \\
S_{1} \leftarrow T_{1} \mid U_{1} & S_{2} \leftarrow T_{2} \mid U_{2} & D_{1} \leftarrow T_{2} \mid U_{1} & D_{2} \leftarrow T_{1} \mid U_{2} \tag{9}
\end{array}
$$

yielding 10 (instead of 12) operations in 3 parallel steps. Because we use the same number of parallel steps, one might even expect that (9) and (7) take the same execution time, and indeed on recent CPUs this is almost the case, cf. Section 4.

### 3.2 Weight and Hamming distance

Theorem 2. Let $v, w \in \mathbb{F}_{3}$. Then

$$
\begin{aligned}
v \neq w & \Longleftrightarrow\left(v_{1}+w_{1}\right) \mid\left(v_{2}+w_{2}\right)=1, \\
v \neq 0 & \Longleftrightarrow v_{1}+v_{2}=1 .
\end{aligned}
$$

Proof. We have $v_{1}+w_{1}=1$ if and only if $v_{1} \neq w_{1}$, and $v_{2}+w_{2}=1$ if and only if $v_{2} \neq w_{2}$. Because $v \neq w$ if and only if $v_{1} \neq w_{1}$ or $v_{2} \neq w_{2}$, the first part of the theorem follows. The second part follows from the fact that $v$ is zero precisely when $v_{0}$ is zero.

This theorem can be used to compute both the Hamming distance $\|V-W\|$ between two vectors, or the weight $\|V\|$ of a vector. These problems are now reduced to determining the weight of the binary vectors $\left(V_{1}+W_{1}\right) \mid\left(V_{2}+W_{2}\right)$ and $V_{1}+V_{2}$.

For older computers there are lots of well-known tricks to compute the weight of a binary vector $\|\beta\|$. The following method (taylored to 64 bits) seems to be the fastest at this time of writing [1]. We give a version in the programming language C :

```
    t = beta - ((beta >> 1) & 0x55555555555555555L);
    t = (t & 0x3333333333333333L) + ((t >> 2) & 0x3333333333333333L);
    t = ((t + (t >> 4)) & 0xOf0f0f0f0f0f0f0fL;
    weight = (t * 0x0101010101010101L) >> 56;
```

Fortunately, recent CPUs (e.g., the Nehalem-based Intel Xeon processors) have a 'population count' machine instruction that computes $\|\beta\|$ directly and is much faster than (10).

### 3.3 The dot product

An implementation of the elementwise multiplication $V W$ in 6 operations can be obtained directly from (6). This operation does not occur very often in practice and the only reason it is treated here is because we can use it to compute the dot product $V \cdot W$, which is the sum of all entries of $V W$, computed modulo 3 .

With $P=V W$, the dot product satisfies

$$
\begin{equation*}
V \cdot W=\left\|P_{2}\right\|-\left\|P_{1}\right\| \quad(\bmod 3), \tag{11}
\end{equation*}
$$

for indeed, every ternary digit 0 in $P$ will now contribute $1-1$ to the result, every 1 contributes $1-0$ and every 2 contributes $0-1$.

If your CPU supports the 'population count' instruction, formula (11) is almost the fastest implementation of the dot product, although it is even better to implement it as $\left\|P_{2}\right\|+2\left\|P_{1}\right\|(\bmod 3)$ because then you avoid taking remainders of negative numbers. We can improve this further by using the following identity instead :

$$
\begin{equation*}
V \cdot W=\left\|P_{0}\right\|+\left\|\neg P_{2}\right\| \quad(\bmod 3) \tag{12}
\end{equation*}
$$

Note that $P_{0}$ and $\neg P_{2}$ can be computed together in only five operations.
For older processors we again have to resort to tricks. We first establish a fast method of computing $\|\beta\|$ modulo 3 , faster than first computing $\|\beta\|$ using (10) and then reducing the result modulo 3. $\left(\beta \in \mathbb{F}_{2}^{n}\right.$.)

Write $\beta_{0}, \beta_{1}, \ldots$ for the subsequent bits of $\beta$ considered as elements of $\mathbb{Z}$ (lowest significant bit first). We want to compute $\|\beta\| \bmod 3$. Consider the positive integer $[[\beta]]$ that has $\beta$ as its binary representation, i.e.,

$$
[[\beta]] \stackrel{\text { def }}{=} \beta_{0}+\beta_{1} 2+\beta_{2} 2^{2}+\beta_{3} 2^{3}+\beta_{4} 2^{4}+\beta_{5} 2^{5}+\cdots=\sum_{i=0}^{n-1} \beta_{i} 2^{i}
$$

(with additions and multiplications in $\mathbb{Z}$ ). Transform $[[\beta]]$ to a new number $[[\gamma]]$ by shifting the bits one position towards the lower significant end and zeroing the bits on odd positions, i.e.,

$$
\gamma=\left(\beta_{1}, 0, \beta_{3}, 0, \beta_{5}, 0, \ldots\right), \quad[[\gamma]]=\beta_{1}+\beta_{3} 2^{2}+\beta_{5} 2^{4}+\cdots .
$$

Subtracting $[[\gamma]]$ from $[[\beta]]$ then yields

$$
[[\beta]]-[[\gamma]]=\left(\beta_{0}+\beta_{1}\right)+\left(\beta_{2}+\beta_{3}\right) 2^{2}+\left(\beta_{4}+\beta_{5}\right) 2^{4}+\cdots
$$

Finally, because $2^{2 k}=4^{k}=1(\bmod 3)$, we have

$$
[[\beta]]-[[\gamma]]=\sum_{i=1}^{b} \beta_{i}=\|\beta\| \quad(\bmod 3)
$$

the required result.
In notation of the programming language $C$ this translates to

```
( beta - ((beta >> 1) & 0x5555555555555555L) ) % 3
```

which needs only 4 operations. Note the similarity to the first line of (10). By means of this 'weight modulo 3' operation, formula (12) translates to

$$
\begin{equation*}
V \cdot W=\left(\left\|P_{1}\right\| \bmod 3+\left\|\neg P_{2}\right\| \bmod 3\right) \bmod 3 . \tag{14}
\end{equation*}
$$

The additional remainder operation is needed because we want results to be either 0,1 or 2 . This is unfortunate because now we need three modulo operations in total, and these are slow in comparison to the other machine operations.

We can avoid this problem by postponing taking the remainder until the very last moment. This leads to the following C expression to compute the dot product

```
( (p0 - ((p0 >> 1) & 0x55555555555555555L)) +
    (notp2 - ((notp2 >> 1) & 0x55555555555555555L)) ) % 3
```

Sadly, because the addition in this expression could lead to an overflow, this method only works when the length of the vectors is strictly smaller than 64.

### 3.4 Iteration

Sometimes it is necessary to iterate through all possible vectors of a given length $n$. In the case of $\mathbb{F}_{2}^{n}$ this is straightforward : we simply iterate through all integers $0,1, \ldots, 2^{n}-1$ in their bit representation. Something similar but slightly more complicated, works for $\mathbb{F}_{3}^{n}$.

For $\beta \in \mathbb{F}_{2}^{n}, \beta \neq(0, \ldots, 0)$, define pred $\beta$ to be the unique element of $\mathbb{F}_{2}^{n}$ that satisfies

$$
[[\operatorname{pred} \beta]]=[[\beta]]-1 \quad(\text { subtraction in } \mathbb{Z}) .
$$

Computing pred $\beta$ translates to a simple 64-bit integer subtraction by 1.
For $V \in \mathbb{F}_{3}^{n}, V \neq(2, \ldots, 2)$, define succ $V$ to be the unique element of $\mathbb{F}_{3}^{n}$ that satisfies

$$
W=\operatorname{succ} V \Longleftrightarrow\left\{\begin{array}{l}
W_{1}=\left(\text { pred } V_{2}\right) \mid \neg V_{1}  \tag{16}\\
W_{2}=V_{1}
\end{array}\right.
$$

It takes only three machine operations to compute succ $V$ from $V$. Note that this operation is well defined : $W_{1}^{(i)}$ and $W_{2}^{(i)}$ cannot be zero at the same time for any $i$, and hence $\left(W_{1}, W_{2}\right)$ is always a valid binary representation of some ternary vector.

Theorem 3. Consider the sequence $\mathcal{S}$ of vectors of $\mathbb{F}_{3}^{n}$ defined by $\mathcal{S}_{1} \stackrel{\text { def }}{=}(0, \ldots, 0)$, $\mathcal{S}_{i+1} \stackrel{\text { def }}{=} \operatorname{succ} \mathcal{S}_{i}$, for all $i, i=1, \ldots, 3^{n}-1$. Then

- $\mathcal{S}$ contains every element of $\mathbb{F}_{3}^{n}$ exactly once,
- $\mathcal{S}_{3^{n}}=(2, \ldots, 2)$ and hence $\mathcal{S}$ is well defined.

Proof. Let us first rephrase the 'succ' operation directly in terms of $\mathbb{F}_{3}$. Because $V \neq(2, \ldots, 2)$ we may always find $d \leq n$ such that $V$ can be written as

$$
\begin{equation*}
V=\left(2, \ldots, 2, V^{(d)}, V^{(d+1)}, \ldots, V^{(n)}\right) \text { with } V^{(d)} \neq 2 . \tag{17}
\end{equation*}
$$

We claim that

$$
\begin{equation*}
\operatorname{succ} V=\left(0, \ldots, 0, V^{(d)}+1,-V^{(d+1)}, \ldots,-V^{(n)}\right) \tag{18}
\end{equation*}
$$

Indeed, if $V$ is of the form (17), then $V_{2}$ is of the form

$$
V_{2}=\left(0, \ldots, 0,1, V_{2}^{(d+1)}, \ldots, V_{2}^{(n)}\right)
$$

and then

$$
\text { pred } V_{2}=\left(1, \ldots, 1,0, V_{2}^{(d+1)}, \ldots, V_{2}^{(n)}\right)
$$

and

$$
\left(\text { pred } V_{2}\right) \mid \neg V_{1}=\left(1, \ldots, 1, \neg V_{1}^{(d)}, V_{2}^{(d+1)}\left|\neg V_{1}^{(d+1)}, \ldots, V_{2}^{(n)}\right| \neg V_{1}^{(n)}\right) .
$$

Now, in general for $v \in \mathbb{F}_{3}$, we have $v_{2}\left|\neg v_{1}=v_{2}\right|\left(v_{1}+1\right)=v_{2}\left(v_{1}+1\right)+v_{2}+$ $v_{1}+1=v_{2}$, hence

$$
\left(\text { pred } V_{2}\right) \mid \neg V_{1}=\left(1, \ldots, 1, \neg V_{1}^{(d)}, V_{2}^{(d+1)}, \ldots, V_{2}^{(n)}\right)
$$

Also, if $V^{(d+1)}=0$ then $\left(\neg V_{1}^{(d)}, V_{1}^{(d)}\right)=(0,1)$ which represents $1 \in \mathbb{F}_{3}$, and if $V^{(d+1)}=1$ then $\left(\neg V_{1}^{(d)}, V_{1}^{(d)}\right)=(1,0)$ which represents 2 . This proves (18).

Now, consider $V, W \in \mathbb{F}_{3}^{n}$ such that $V^{(1)}=W^{(1)}, V^{(2)}=W^{(2)}, \ldots, V^{(k)}=W^{(k)}$ for some $k \leq n$, i.e., such that $V$ and $W$ have the same $k$-prefix. We claim that also succ $V$ and succ $W$ must have the same $k$-prefix. Indeed, if $\left(V^{(1)}, \ldots, V^{(k)}\right) \neq$ $(2, \ldots, 2)$ then (18) applies to both $V$ and $W$ for the same value of $d<k$, and then the required property is immediate. If on the other hand $V^{(1)}=\cdots=V^{(k)}=$ $W^{(1)}=\cdots=W^{(k)}=2$, the $k$-prefix of both succ $V$ and succ $W$ will be equal to $(0, \ldots, 0)$. (In that case we apply (18) for possibly different values of $d$, but both for $V$ and $W$ we have $d \geq k$.)

It follows from this that the sequence of $k$-suffixes of $\mathcal{S}$ must be periodical. We will prove by induction on $k$ that the period length of this repetition is $3^{k}$ and that the $k$-prefix of $\mathcal{S}_{3^{k}}$ has all entries equal to 2.

The first few terms of $\mathcal{S}$ are easily computed to be the following :

$$
(0,0, \ldots, 0),(1,0, \ldots, 0),(2,0, \ldots, 0),(0,1,0, \ldots, 0)
$$

so the 1-prefixes have a period length of 3 and $\mathcal{S}_{3}$ has the required form. This proves the base case $k=1$ of our induction.

Now assume the properties hold for a given prefix length $k$. We claim that during one fixed period of $k$-prefixes, the $k+1$-th entries of subsequent vectors repeatedly change sign (i.e, $(\operatorname{succ} V)^{(k+1)}=-V^{(k+1)}$ ). Indeed, for every vector except the last one of a period, we may apply (18) with a value of $d$ that is at most $k$.

As a consequence, the first $3^{k}$ elements $V$ of $\mathcal{S}$ will have $V^{(k+1)}=0$. Element $\mathcal{S}_{3^{k}}$ has $k+1$-prefix equal to $(2, \ldots, 2,0)$ and hence, by (18), element $\mathcal{S}_{3^{k}+1}$ has $k+1$-prefix equal to $(0, \ldots, 0,1)$. In the second period, $k+1$-th entries alternate between 1 and $-1=2$, and because the length of a period is odd, we find that $\mathcal{S}_{2 \cdot 3^{k}}$ must have $k+1$-prefix equal to $(2, \ldots, 2,1)$. Similarly, it follows that the
$k+1$-prefix of $\mathcal{S}_{2 \cdot 3^{k}+1}$ is $(0, \ldots, 0,2)$ and that of $\mathcal{S}_{3^{k+1}}$ is $(2, \ldots, 2,2)$. This proves that the first 3 periods of $k$-prefixes all yield different $k+1$-prefixes, and hence the period length for $k+1$-prefixes must be three times that of the $k$-prefixes (for it cannot be larger).

Finally, it follows that the period of $n$-prefixes is $3^{n}$, and hence that the elements of $\mathcal{S}$ are all different.

The sequence $\mathcal{S}$ has a somewhat unnatural ordering. For example, the 9 elements of the sequence for $n=2$ are

$$
(0,0),(1,0),(2,0),(0,1),(1,2),(2,1),(0,2),(1,1),(2,2) .
$$

### 3.5 Other representations

A second way of representing a vector $V \in \mathbb{F}_{3}^{n}$ in computer memory is to store three machine words $V_{0}, V_{1}, V_{2}$ instead of just two. The advantage of this representation is that we now need only 5 operations to compute $V W$ :

$$
\begin{align*}
& P_{0} \leftarrow V_{0} W_{0} \quad T_{1} \leftarrow V_{1}\left|W_{1} \quad T_{2} \leftarrow V_{2}\right| W_{2} \\
& P_{1} \leftarrow T_{1} T_{2}  \tag{19}\\
& P_{2} \leftarrow P_{0}+P_{1}
\end{align*}
$$

and only 3 to compute $P_{0}$ and $\neg P_{2}$ in preparation for the dot product (cf. Section 3.3).

Addition (and subtraction) need 7 operations: compute $S_{1}, S_{2}$ as before and then finish with $S_{0} \leftarrow S_{1}+S_{2}$. Combined addition and subtraction can be done in 10 operations : drop the first two statements of (9) and add $S_{0} \leftarrow S_{1}+S_{2}$, $D_{0} \leftarrow D_{1}+D_{2}$ at the end.

If $n \leq 32$ you can represent a vector $V \in \mathbb{F}_{3}^{n}$ in a single 64-bit word : store $V_{1}$ in one half word and $V_{2}$ in the other (we shall denote the resulting word by $V_{1}: V_{2}$ ). In some cases it is now possible to combine two 32-bit operations into a single 64 -bit operation. For example, the product $P=V W$ can now be computed as follows :

$$
\begin{align*}
& T_{1}: T_{2} \leftarrow\left(V_{1}: V_{2}\right) \mid\left(W_{1}: W_{2}\right) \quad U_{1}: U_{2} \leftarrow\left(V_{1}: V_{2}\right)\left(W_{1}: W_{2}\right)  \tag{20}\\
& P_{1}: P_{2} \leftarrow\left(T_{1}: U_{1}\right) \mid\left(T_{2}: U_{2}\right)
\end{align*}
$$

This looks like a 3-op implementation, but note that we need to split and recombine two 64-bit words on the way, something which admittedly can be done fast.

Addition (and similarly, subtraction) also needs few operations. The following implementation is derived directly from (7) :

$$
\begin{align*}
T_{1}: T_{2} & \leftarrow\left(V_{1}: V_{2}\right)+\left(W_{1}: W_{2}\right) \\
U_{1}: U_{2} & \leftarrow\left(T_{1}: T_{2}\right)+\left(V_{2}: V_{1}\right)  \tag{21}\\
S_{1}: S_{2} & \leftarrow\left(T_{2}: T_{1}\right) \mid\left(U_{1}: U_{2}\right),
\end{align*}
$$

good for 3 standard operations and two half word 'swaps' $\left(X_{1}: X_{2} \rightarrow X_{2}: X_{1}\right)$ each of which can be encoded as a single machine instruction.

## 4 Benchmarks

We have implemented and measured the speed of the operations discussed in the previous sections in various settings. We used the following test programs :

1. To test vector addition and subtraction we computed the echelon forms of 200000 square $n \times n$ matrices.
2. To test combined addition and subtraction, we computed all 6561 elements of the vector space generated by 8 vectors of length $n$ (and did this 5000 times).
3. We determined the Hamming distance between every pair of vectors in a set of 10000 vectors of length $n$.
4. Likewise, we computed the dot product of every such pair.

We did not measure the speed of iteration (Section 3.4) because it is mostly irrelevant : it is not the iteration itself that will determine the final running time of a program, but the action that is performed at every iteration.

We ran the tests above for consecutive values of $n$. The running time of each test was compared to that of a reference implementation in which vectors are represented as arrays of bytes equal to either 0,1 or 2 , and modulo 3 arithmetic was used for all operations. We did our best to use reasonably efficient code also for the reference implementations. For example, the dot product was first computed over $\mathbb{Z}$ and the remainder was only carried out at the end. Not only does this reduce the operation count, but it also allows the processor to make better use of its SIMD (i.e., vector processing) capabilities.

The test programs were written in C and compiled with an optimizing compiler (Gnu GCC). The source code of our test programs is available from http://caagt.ugent.be/fast/.

Spot checks on the generated assembly code convinced us that the compiler managed to use single machine instructions also in those cases where they did not have a direct $C$ equivalent. For example, ' $(\mathrm{v} \ll 32) \mid(\mathrm{v} \gg 32$ )' was indeed translated to a single 'rotate right by 32 ' instruction. It turned out to be important to use a recent version of the compiler : with the newer versions the standard implementations made better use of the SIMD instructions of the CPU, making the overall speed gain of our new methods a little less pronounced.

We compiled and ran the tests on 6 different types of 64 -bit CPU :

| Type | Release date |
| :--- | :--- |
| AMD Opteron 2212 | $15 / 08 / 2006$ |
| Intel Xeon X5355 | $14 / 11 / 2006$ |
| Intel Core2 E8500 | $20 / 01 / 2008$ |
| Intel Core2 Q9550 | $25 / 03 / 2008$ |
| Intel Xeon X5570 | $30 / 03 / 2009$ |
| Intel Xeon X6560 | $16 / 03 / 2010$ |

In general, the older the processor, the larger the speed gain of our new methods. This sounds a bit counterintuitive, but the reason for this is that our reference


Figure 1: Addition and subtraction of vectors
implementation runs slower on older machines, because the SIMD support is not so good. For the results listed in the following pages we used the timings of the most recent processor (the X6560).

We have implemented and compared 5 different representations of vectors of $\mathbb{F}_{3}^{n}:$

1. The representation of Section 3.1 where we use two 64 -bit words for each vector.
2. The same representation but using two 32 -bit words when $n \leq 32$. On recent CPUs this makes no significant difference in speed, although using only half the memory might be of advantage in some applications.
3. A representation that uses three 64 -bit words for each vector (cf. Section 3.5).
4. A representation that packs two vectors $V_{1}, V_{2} \in \mathbb{F}_{2}^{n}$ (with $n \leq 32$ ) into a single 64 -bit word $V_{1}: V_{2}$ (cf. Section 3.5).
5. A variant of this, which stores both $V_{1}: V_{2}$ and $V_{2}: V_{1}$ to avoid 'half word swap' operations. It turns out that this variant always performs worse than the previous one, and we shall not discuss it further.

Let us now turn to the results. In the graphs we plot the 'speed gain' of different methods against the lengths $n$ of the vectors considered. Speed gain is defined as the ratio between the running time of the reference implementation and that of the new implementation.

In Figure 1 we display the results for addition and subtraction, measured by computing the echelon form of a square matrix. We show the results for the six and seven operations version in the two word representation, cf. (7) and (8), and


Figure 2: Combined vs. separate addition and subtraction of vectors
also the seven operation version in the three word representation of Section 3.5. It turns out that there is hardly any speed difference between the three implementations.

Figures 2 and 3 display the results of the subspace generation benchmark. The first figure shows that combining addition and subtraction by means of (9) is more than 1.7 faster than executing both operations separately. (This is at least so for the most recent CPUs. For older types the factor is closer to 1.2, reflecting the $12 / 10$ ratio in number of operations.)

This benchmark is the only one in which there is a significant difference between the $1-, 2$ - and 3 -word implementations (cf. Figure 3). However, because the same phenomenon appears when the addition and subtraction are not combined, this is probably a side effect of the generation algorithm itself rather than of the specific implementation of the ternary operations. It does however clearly illustrate that using more memory may significantly degrade performance, as more data needs to be moved around and the chance of cache misses becomes higher.

The most spectacular of our results is the Hamming distance benchmark where speed gains of up to 33 are reached, at least on modern CPUs that have a 'population count' instruction (cf. Section 3.2). But even on older computers a factor of 10 can still be obtained. (See Figure 4.)

Our last benchmark is used to test the dot product. In Figure 5 we compare three versions. The first uses the 'population count' instruction, the second is based on formula (15) and the third and slowest one uses formula (14) which requires three remainders to be taken. As mentioned before, the second method can only be used for $n<64$, hence on an old computer, with $n=64$ only the last version is available. The strange nonlinearity of the graphs is not a peculiarity of


Figure 3: Combined addition and subtraction of vectors


Figure 4: Computing the Hamming distance with and without 'population count' instruction


Figure 5: The dot product in three versions
our implementation but is the effect of a good optimizing compiler and a recent CPU on the reference program. Apparently on our test CPU SIMD-instructions allow the standard dot product to be computed 16 elements at a time.

As was mentioned at the start of Section 3.5, one of the advantages of the three word representation is a faster multiplication. Figure 6 shows that the dot product can indeed be computed slightly faster, by a factor of $\approx 1.09$.

## 5 Final remarks

Our methods yield the best results for vectors of length exactly 64, but in practice $n$ will often be smaller. One way to handle a vector $V \in \mathbb{F}_{3}^{n}$ of shorter length is to extend it silently with zeroes. This means that the corresponding bit vectors $V_{1}, V_{2}$ (and $V_{0}$ ) should be extended to 64 -bit words by adding ones, which is a bit counterintuitive and easily leads to mistakes. Note however that none of the formulas in Theorem 1 use a binary 'not'. As a consequence, applying these formulas to the (invalid) pairs $\left(v_{1}, v_{2}\right)=(0,0)$ and $\left(w_{1}, w_{2}\right)=(0,0)$ will always yield the result $(0,0)$. In other words, if you extend the bit vectors with zeroes instead of ones you will not really run into trouble. Also weight and Hamming distance remain correct. The only exception is the iteration of Section 3.4.

The question naturally arises whether techniques similar to those of this paper would also be useful for vector arithmetic over other small algebraic structures, like $\mathbb{F}_{4}, \mathbb{F}_{5}$ or $\mathbb{Z} / 4 \mathbb{Z}$. Because any Boolean function can be implemented in terms of binary operations, in principle there is no reason why this would be impossible. Only experiments will tell whether the resulting implementations will be sufficiently fast.

In the case of $\mathbb{F}_{4}$ Bouyukliev and Bakoev have already done some preliminary work by implementing addition and subtraction and multiplication with a


Figure 6: The dot product in 2 and 3 word representations
scalar [2]. It would be useful to extend this work with dot products, weights and Hamming distances as we did for $\mathbb{F}_{3}$, and with fast implementations of $V+\alpha W$ and $V+\alpha^{2} W$, where $\alpha, \alpha^{2}$ are the elements of $\mathbb{F}_{4}$ different from 0 and 1 . These last two operations could then be used for computing the echelon form of a matrix. (Bouyukliev and Bakoev also considered $\mathbb{F}_{3}$, but their implementation is slower than ours.)

## References

[1] S. E. Anderson, Counting bits set, in parallel, Bit Twiddling Hacks, URL http: //graphics.stanford.edu/~seander/bithacks.html\#CountBitsSetParallel (2011)
[2] I. Bouyukliev, V. Bakoev, Efficient computing of some vector operations over GF(3) and GF(4), Serdica J. Computing 2(2) (2008), 101-108
[3] Y. Kawahara, K. Aoki, T. Takagi, Faster Implementation of $\eta_{T}$ Pairing over GF $\left(3^{m}\right)$ Using Minimum Number of Logical Instructions for GF(3)-Addition, Pairing 2008, Lecture Notes in Computer Science 5209 (2008), 282-296, SpringerVerlag Berlin Heidelberg

Department of Applied Mathematics and Computer Science, Ghent University, Krijgslaan 281-S9, B-9000 Gent, Belgium Kris.Coolsaet@UGent.be


[^0]:    Received by the editors June 2012.
    Communicated by J. Doyen.
    2000 Mathematics Subject Classification : 65Y04, 12E30, 12-04.
    Key words and phrases : Fast vector arithmetic, GF(3), 64-bit operations, Hamming distance, dot product.

