Indexed or unindexed columns

One important feature of the Table container is the ability to compute arithmetic operations with and without taking the index into account. Using the indexed or the unindexed operation is left as a choice for the user. This notebook tries to explain the differences, and the consequences of using each option. We’ll start by creating a sparse Table.

from gtable import Table
t = Table()
t.add_column('a', [1,2,3,4,5,6])
t.add_column('b', [1,2,3], align="bottom")

If we access the column by attribute, we’ll get a type called Column, that includes information about the index

t.a
<Column[ int64 ] object at 0x7f5348736c50>

Accessing by key is a shortcut to the data stored within the Table, and has no information about how the table is indexed.

t['a']
array([1, 2, 3, 4, 5, 6])

The column a is not a particularly good example, but b is. The data stored in the latter column has only three elements. Where those elements are actually placed within the table is stored in the index.

t['b']
array([1, 2, 3])

The easiest and safest way to operate with columns is to take the index into account

c = t.a + t.b
c.values
array([ 2.,  3.,  4.])

See that, since the b column only had three elements, the result of the addition with the a column has only three elements. There are no NaNs or NAs. However, using the index to perform arithmetic operations has some cost, particularly in the case of large dense columns. Assume that we want to scale the a column by the last element of b. We can do that either accessing the full column or by accessing the raw data

c = t.a * t.b[-1]
c.values
array([ 3,  6,  9, 12, 15, 18])

Using columns is more convenient, since in many cases arithmetic operations do what they are supposed to do, but they have an important caveat: performance:

%%timeit
t.a * t.b[-1]
26.2 µs ± 224 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
t['a'] * t['b'][-1]
6.19 µs ± 88.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But since the data of each column has a different length, using the raw data or the colum will have different outcomes

t.a + t.b
<Column[ float64 ] object at 0x7f531fc411d0>
t['a'] + t['b']
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-12-2fb36be086ed> in <module>()
----> 1 t['a'] + t['b']


ValueError: operands could not be broadcast together with shapes (6,) (3,)

A caveat of columns is that they are designed to perform fast operations using the column as a whole, and in consequence, accessing individual item of a column is O(N).

Another important difference is that we can create new columns by attribute, but not by index

t.c = t.a + t.b
t
<Table[ a[6] <int64>, b[3] <int64>, c[3] <float64> ] object at 0x7f5348736fd0>
t['d'] = t['a']
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-16-46490dc9bd03> in <module>()
----> 1 t['d'] = t['a']


~/projects/gtable/gtable/table.py in __setitem__(self, key, value)
    158
    159     def __setitem__(self, key, value):
--> 160         self.data[self.keys.index(key)] = value
    161
    162     def __delitem__(self, key):


ValueError: 'd' is not in list