Gtable and Pandas¶
Let’s start by creating a table
from gtable import Table
import numpy as np
import pandas as pd
t = Table()
t.a = np.random.rand(10)
t.b = pd.date_range('2000-01-01', freq='M', periods=10)
t.c = np.array([1,2])
t.add_column('d', np.array([1, 2]), align='bottom')
You can create a column by assignment to an attribute. You can also use
the add_column
method if the default alignment is not the one you
want. The usual representation of the table gives information about the
actual length of each column and its type.
t
<Table[ a[10] <float64>, b[10] <object>, c[2] <int64>, d[2] <int64> ] object at 0x7f4f0eae68d0>
You can translate the table to a Pandas dataframe by just calling the
to_pandas
method, and leverage the great notebook visualization of
the Dataframe
df = t.to_pandas()
df
a | b | c | d | |
---|---|---|---|---|
0 | 0.772970 | 2000-01-31 | 1.0 | NaN |
1 | 0.863153 | 2000-02-29 | 2.0 | NaN |
2 | 0.112185 | 2000-03-31 | NaN | NaN |
3 | 0.319948 | 2000-04-30 | NaN | NaN |
4 | 0.657329 | 2000-05-31 | NaN | NaN |
5 | 0.367910 | 2000-06-30 | NaN | NaN |
6 | 0.264345 | 2000-07-31 | NaN | NaN |
7 | 0.172011 | 2000-08-31 | NaN | NaN |
8 | 0.007853 | 2000-09-30 | NaN | 1.0 |
9 | 0.705190 | 2000-10-31 | NaN | 2.0 |
Now that we have the same data stored as a Table and as a Dataframe, let’s see some of the differences between them. The first one is that while the DataFrame has an index (an integer just keeps the order in this case), the Table is just a table trivially indexed by the order of the records
df.index.values
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Another important difference how data is stored in each container.
df.c.values
array([ 1., 2., nan, nan, nan, nan, nan, nan, nan, nan])
t.c.values
array([1, 2])
While Pandas relies on NaN to store empty values, the Table uses a bitmap index to differentiate between a missing element and a NaN
t.index
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1]], dtype=uint8)
The mechanism for tracking NAs is the bitmap index. Of course, a bitmap index has pros and cons. One of the interesting pros is that computations with sparse data are significantly faster, while keeping data indexed.
df.c.values
array([ 1., 2., nan, nan, nan, nan, nan, nan, nan, nan])
t.c.values
array([1, 2])
The main benefit of the Table class is that both assignment and computation with sparse data is significantly faster. It operates with less data, and it does not have to deal with the index
%%timeit
2*t['c']
1.63 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
2*df['c']
73.6 µs ± 5.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The amount of features of the Dataframe dwarfs the ones present in the Table. But that does not mean that the Table is completely feature-less, or that the features are slow. Table allows to filter the data in a similar fashon to the Dataframe with slightly better performance.
%%timeit
df[df.c>0]
474 µs ± 89.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df[df.c>0]
a | b | c | d | |
---|---|---|---|---|
0 | 0.772970 | 2000-01-31 | 1.0 | NaN |
1 | 0.863153 | 2000-02-29 | 2.0 | NaN |
%%timeit
t.filter(t.c > 0)
131 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
t.filter(t.c > 0).to_pandas()
a | b | c | |
---|---|---|---|
0 | 0.772970 | 2000-01-31 | 1 |
1 | 0.863153 | 2000-02-29 | 2 |
See that, as Table sees that there have not been results for the fourth column, the generated dataframe omits that column.
One of the consequences of the Table’s mechanism of indexing is that data cannot be accessed through the index, and there is no such thing as the Dataframe’s iloc. If we extract the data of the column and we assign a value to one of its items, we may get the result we want.
t['c'][1] = 3
t.filter(t.c > 0).to_pandas()
a | b | c | |
---|---|---|---|
0 | 0.772970 | 2000-01-31 | 1 |
1 | 0.863153 | 2000-02-29 | 3 |
But we cannot assign an element that does not exist
#t['c'][9]
Since the data of that column only has two elements
t['c']
array([1, 3])
Up to this point we have created the Dataframe from the table, but we can make the conversion the other way round
t1 = Table.from_pandas(df)
t1
<Table[ idx[10] <int64>, a[10] <float64>, b[10] <datetime64[ns]>, c[10] <float64>, d[10] <float64> ] object at 0x7f4ee2ae1c18>
See that some datatypes have changed, and the sparsity of the table is lost, since Pandas cannot distinguish between NA and NaN. Note also that another column has been added with the index information. If we already know that all NaN are in fact NA, we can recover the sparse structure with
t1.dropnan()
t1
<Table[ idx[10] <int64>, a[10] <float64>, b[10] <datetime64[ns]>, c[2] <float64>, d[2] <float64> ] object at 0x7f4ee2ae1c18>
We can recover the types casting the columns, that are numpy arrays. To restore the original columns we can also delete the index
t1['c'] = t1['c'].astype(np.int)
t1['d'] = t1['d'].astype(np.int)
t1.del_column('idx')
t1
<Table[ a[10] <float64>, b[10] <datetime64[ns]>, c[2] <int64>, d[2] <int64> ] object at 0x7f4ee2ae1c18>