Gtable and Pandas¶

Let’s start by creating a table

from gtable import Table
import numpy as np
import pandas as pd

t = Table()

t.a = np.random.rand(10)
t.b = pd.date_range('2000-01-01', freq='M', periods=10)
t.c = np.array([1,2])
t.add_column('d', np.array([1, 2]), align='bottom')

You can create a column by assignment to an attribute. You can also use the add_column method if the default alignment is not the one you want. The usual representation of the table gives information about the actual length of each column and its type.

<Table[ a[10] <float64>, b[10] <object>, c[2] <int64>, d[2] <int64> ] object at 0x7f4f0eae68d0>

You can translate the table to a Pandas dataframe by just calling the to_pandas method, and leverage the great notebook visualization of the Dataframe

df = t.to_pandas()
df

	a	b	c	d
0	0.772970	2000-01-31	1.0	NaN
1	0.863153	2000-02-29	2.0	NaN
2	0.112185	2000-03-31	NaN	NaN
3	0.319948	2000-04-30	NaN	NaN
4	0.657329	2000-05-31	NaN	NaN
5	0.367910	2000-06-30	NaN	NaN
6	0.264345	2000-07-31	NaN	NaN
7	0.172011	2000-08-31	NaN	NaN
8	0.007853	2000-09-30	NaN	1.0
9	0.705190	2000-10-31	NaN	2.0

Now that we have the same data stored as a Table and as a Dataframe, let’s see some of the differences between them. The first one is that while the DataFrame has an index (an integer just keeps the order in this case), the Table is just a table trivially indexed by the order of the records

df.index.values

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Another important difference how data is stored in each container.

df.c.values

array([  1.,   2.,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])

t.c.values

array([1, 2])

While Pandas relies on NaN to store empty values, the Table uses a bitmap index to differentiate between a missing element and a NaN

t.index

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]], dtype=uint8)

The mechanism for tracking NAs is the bitmap index. Of course, a bitmap index has pros and cons. One of the interesting pros is that computations with sparse data are significantly faster, while keeping data indexed.

df.c.values

array([  1.,   2.,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])

t.c.values

array([1, 2])

The main benefit of the Table class is that both assignment and computation with sparse data is significantly faster. It operates with less data, and it does not have to deal with the index

%%timeit
2*t['c']

1.63 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
2*df['c']

73.6 µs ± 5.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The amount of features of the Dataframe dwarfs the ones present in the Table. But that does not mean that the Table is completely feature-less, or that the features are slow. Table allows to filter the data in a similar fashon to the Dataframe with slightly better performance.

%%timeit
df[df.c>0]

474 µs ± 89.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df[df.c>0]

	a	b	c	d
0	0.772970	2000-01-31	1.0	NaN
1	0.863153	2000-02-29	2.0	NaN

%%timeit
t.filter(t.c > 0)

131 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

t.filter(t.c > 0).to_pandas()

	a	b	c
0	0.772970	2000-01-31	1
1	0.863153	2000-02-29	2

See that, as Table sees that there have not been results for the fourth column, the generated dataframe omits that column.

One of the consequences of the Table’s mechanism of indexing is that data cannot be accessed through the index, and there is no such thing as the Dataframe’s iloc. If we extract the data of the column and we assign a value to one of its items, we may get the result we want.

t['c'][1] = 3
t.filter(t.c > 0).to_pandas()

	a	b	c
0	0.772970	2000-01-31	1
1	0.863153	2000-02-29	3

But we cannot assign an element that does not exist

#t['c'][9]

Since the data of that column only has two elements

t['c']

array([1, 3])

Up to this point we have created the Dataframe from the table, but we can make the conversion the other way round

t1 = Table.from_pandas(df)
t1

<Table[ idx[10] <int64>, a[10] <float64>, b[10] <datetime64[ns]>, c[10] <float64>, d[10] <float64> ] object at 0x7f4ee2ae1c18>

See that some datatypes have changed, and the sparsity of the table is lost, since Pandas cannot distinguish between NA and NaN. Note also that another column has been added with the index information. If we already know that all NaN are in fact NA, we can recover the sparse structure with

t1.dropnan()

t1

<Table[ idx[10] <int64>, a[10] <float64>, b[10] <datetime64[ns]>, c[2] <float64>, d[2] <float64> ] object at 0x7f4ee2ae1c18>

We can recover the types casting the columns, that are numpy arrays. To restore the original columns we can also delete the index

t1['c'] = t1['c'].astype(np.int)
t1['d'] = t1['d'].astype(np.int)
t1.del_column('idx')

t1

<Table[ a[10] <float64>, b[10] <datetime64[ns]>, c[2] <int64>, d[2] <int64> ] object at 0x7f4ee2ae1c18>

Previous topic

Next topic

This Page

Gtable and Pandas¶