Gtable and Pandas

Let’s start by creating a table

from gtable import Table
import numpy as np
import pandas as pd
t = Table()
t.a = np.random.rand(10)
t.b = pd.date_range('2000-01-01', freq='M', periods=10)
t.c = np.array([1,2])
t.add_column('d', np.array([1, 2]), align='bottom')

You can create a column by assignment to an attribute. You can also use the add_column method if the default alignment is not the one you want. The usual representation of the table gives information about the actual length of each column and its type.

t
<Table[ a[10] <float64>, b[10] <object>, c[2] <int64>, d[2] <int64> ] object at 0x7f4f0eae68d0>

You can translate the table to a Pandas dataframe by just calling the to_pandas method, and leverage the great notebook visualization of the Dataframe

df = t.to_pandas()
df
a b c d
0 0.772970 2000-01-31 1.0 NaN
1 0.863153 2000-02-29 2.0 NaN
2 0.112185 2000-03-31 NaN NaN
3 0.319948 2000-04-30 NaN NaN
4 0.657329 2000-05-31 NaN NaN
5 0.367910 2000-06-30 NaN NaN
6 0.264345 2000-07-31 NaN NaN
7 0.172011 2000-08-31 NaN NaN
8 0.007853 2000-09-30 NaN 1.0
9 0.705190 2000-10-31 NaN 2.0

Now that we have the same data stored as a Table and as a Dataframe, let’s see some of the differences between them. The first one is that while the DataFrame has an index (an integer just keeps the order in this case), the Table is just a table trivially indexed by the order of the records

df.index.values
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Another important difference how data is stored in each container.

df.c.values
array([  1.,   2.,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])
t.c.values
array([1, 2])

While Pandas relies on NaN to store empty values, the Table uses a bitmap index to differentiate between a missing element and a NaN

t.index
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]], dtype=uint8)

The mechanism for tracking NAs is the bitmap index. Of course, a bitmap index has pros and cons. One of the interesting pros is that computations with sparse data are significantly faster, while keeping data indexed.

df.c.values
array([  1.,   2.,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])
t.c.values
array([1, 2])

The main benefit of the Table class is that both assignment and computation with sparse data is significantly faster. It operates with less data, and it does not have to deal with the index

%%timeit
2*t['c']
1.63 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
2*df['c']
73.6 µs ± 5.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The amount of features of the Dataframe dwarfs the ones present in the Table. But that does not mean that the Table is completely feature-less, or that the features are slow. Table allows to filter the data in a similar fashon to the Dataframe with slightly better performance.

%%timeit
df[df.c>0]
474 µs ± 89.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df[df.c>0]
a b c d
0 0.772970 2000-01-31 1.0 NaN
1 0.863153 2000-02-29 2.0 NaN
%%timeit
t.filter(t.c > 0)
131 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
t.filter(t.c > 0).to_pandas()
a b c
0 0.772970 2000-01-31 1
1 0.863153 2000-02-29 2

See that, as Table sees that there have not been results for the fourth column, the generated dataframe omits that column.

One of the consequences of the Table’s mechanism of indexing is that data cannot be accessed through the index, and there is no such thing as the Dataframe’s iloc. If we extract the data of the column and we assign a value to one of its items, we may get the result we want.

t['c'][1] = 3
t.filter(t.c > 0).to_pandas()
a b c
0 0.772970 2000-01-31 1
1 0.863153 2000-02-29 3

But we cannot assign an element that does not exist

#t['c'][9]

Since the data of that column only has two elements

t['c']
array([1, 3])

Up to this point we have created the Dataframe from the table, but we can make the conversion the other way round

t1 = Table.from_pandas(df)
t1
<Table[ idx[10] <int64>, a[10] <float64>, b[10] <datetime64[ns]>, c[10] <float64>, d[10] <float64> ] object at 0x7f4ee2ae1c18>

See that some datatypes have changed, and the sparsity of the table is lost, since Pandas cannot distinguish between NA and NaN. Note also that another column has been added with the index information. If we already know that all NaN are in fact NA, we can recover the sparse structure with

t1.dropnan()
t1
<Table[ idx[10] <int64>, a[10] <float64>, b[10] <datetime64[ns]>, c[2] <float64>, d[2] <float64> ] object at 0x7f4ee2ae1c18>

We can recover the types casting the columns, that are numpy arrays. To restore the original columns we can also delete the index

t1['c'] = t1['c'].astype(np.int)
t1['d'] = t1['d'].astype(np.int)
t1.del_column('idx')
t1
<Table[ a[10] <float64>, b[10] <datetime64[ns]>, c[2] <int64>, d[2] <int64> ] object at 0x7f4ee2ae1c18>