Tables introduction¶

The astropy Table class provides an extension of NumPy structured arrays for storing and manipulating heterogeneous tables of data. A few notable features of this package are:

Initialize a table from a wide variety of input data structures and types.
Modify a table by adding or removing columns, changing column names, or adding new rows of data.
Handle tables containing missing values.
Include table and column metadata as flexible data structures.
Specify a description, units and output formatting for columns.
Perform operations like database joins, concatenation, and grouping.
Manipulate multidimensional columns.
Methods for Reading and writing Table objects to files
Integration with Astropy Units and Quantities

Why not Pandas

The Pandas package provides a powerful, high-performance table object via the DataFrame class. Unfortunately there are a few issues that prevent use as a generalize table object in astronomy. The most crucial is lack of support for multidimensional table columns. This is commonly used in standard FITS data products, for instance the Chandra energy response matrix used to analyze spectral data.

Documentation

For more information about the features presented below, you can read the astropy.table docs.

Tutorial

from __future__ import print_function, division
from astropy.table import Table
import numpy as np

Creating tables

There is great deal of flexibility in the way that a table can be initially constructed:

Read an existing table from a file or web URL
Add columns of data one by one
Add rows of data one by one
From an existing data structure in memory:
List of data columns
Dict of data columns
List of row dicts
Numpy homgeneous array or structured array
List of row records (new in 0.4)

See the documentation section on Constructing a table for the gory details and plenty of examples.

t = Table()
t['name'] = ['source 1', 'source 2', 'source 3', 'source 4']
t['flux'] = [1.2, 2.2, 3.1, 4.3]

Looking at your table

In IPython notebook, showing a table will produce a nice HTML representation of the table:

t

If you did the same in a terminal session you get a different view that isn't as pretty but does give a bit more information about the table:

>>> t
<Table rows=4 names=('name','flux')>
array([('source 1', 1.2), ('source 2', 2.2), ('source 3', 3.1),
       ('source 4', 4.3)], 
      dtype=[('name', 'S8'), ('flux', '<f8')])

To get a plain view which is the same in notebook and terminal use print():

print(t)

  name   flux
-------- ----
source 1  1.2
source 2  2.2
source 3  3.1
source 4  4.3

To get the table column names and data types using the colnames and dtype properties:

t.colnames

['name', 'flux']

t.dtype

dtype([('name', 'S8'), ('flux', '<f8')])

Accessing parts of the table

We can access the columns and rows as for numpy structured arrays. Notice that the outputs are Column, Row or Table objects depending on the context.

t['flux']  # Flux column (notice meta attributes)

<Column name='flux' unit=None format=None description=None>
array([ 1.2,  2.2,  3.1,  4.3])

t['flux'][1]  # Row 1 of flux column

2.2000000000000002

t[1]  # Row obj for with row 1 values

<Row 1 of table
 values=('source 2', 2.2)
 dtype=[('name', 'S8'), ('flux', '<f8')]>

t[1]['flux']  # Flux column of row 1

2.2000000000000002

t[1:3]  # 2nd and 3rd rows in a new table

t[[1, 3]]

One of the most powerful concepts is using boolean selection masks to filter tables

mask = t['flux'] > 3.0  # Define boolean mask for all flux values > 3
t[mask]  # Create a new table with only the "high flux" sources

Modifying the table

Once the table exists with defined columns there are a number of ways to modify the table in place. These are fully documented in the section Modifying a Table.

To give a couple of simple examples, you can add rows with the add_row() method or add new columns using dict-style assignment:

t.add_row(('source 5', 10.1))  # Add a new source at the end
t['logflux'] = np.log10(t['flux'])  # Compute the log10 of the flux
t

Notice that the logflux column really has too many output digits given the precision of the input values. We can fix this by setting the format using normal Python formatting syntax:

t['flux'].format = '%.2f'
t['logflux'].format = '%.2f'
t

Converting the table to numpy

Sometimes you may not want or be able to use a Table object and prefer to work with a plain numpy array. This is easily done by passing the table to the np.array() constructor.

This makes a copy of the data. If you have a huge table and don't want to waste memory, supply copy=False to the constructor, but be warned that changing the output numpy array will change the original table.

np.array(t)

array([('source 1', 1.2, 0.07918124604762482),
       ('source 2', 2.2, 0.3424226808222063),
       ('source 3', 3.1, 0.4913616938342727),
       ('source 4', 4.3, 0.6334684555795865),
       ('source 5', 10.1, 1.0043213737826426)], 
      dtype=[('name', 'S8'), ('flux', '<f8'), ('logflux', '<f8')])

Masked tables

t2 = Table([['x', 'y', 'z'], 
            [1.1, 2.2, 3.3]],
           names=['name', 'value'],
           masked=True)
t2

t2['value'].mask = [False, True, False]

print(t2)

name value
---- -----
   x   1.1
   y    --
   z   3.3

t2['value'].fill_value = -99
print(t2.filled())

name value
---- -----
   x   1.1
   y -99.0
   z   3.3

High-level table operations

So far we've just worked with one table at a time and viewed that table as a monolithic entity. Astropy also supports high-level Table operations that manipulate multiple tables or view one table as a collection of sub-tables (groups).

Documentation	Description
Grouped operations	Group tables and columns by keys
Stack vertically	Concatenate input tables along rows
Stack horizontally	Concatenate input tables along columns
Join	Database-style join of two tables

Here we'll just introduce the join operation but go into more detail on the others in the exercises.

from astropy.table import join

Now recall our original table t:

t

Now say that we now got some additional flux values from a different reference for a different, but overlapping sample of sources:

t2 = Table()
t2['name'] = ['source 1', 'source 3', 'source 8']
t2['flux2'] = [1.4, 3.5, 8.6]

Now we can get a master table of flux measurements which are joined matching the values the name column. This includes every row from each of the two tables, which is known as an outer join.

t3 = join(t, t2, keys=['name'], join_type='outer')
print(t3)

  name    flux logflux flux2
-------- ----- ------- -----
source 1  1.20    0.08   1.4
source 2  2.20    0.34    --
source 3  3.10    0.49   3.5
source 4  4.30    0.63    --
source 5 10.10    1.00    --
source 8   nan     nan   8.6

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/ma/core.py:3847: UserWarning: Warning: converting a masked element to nan.
  warnings.warn("Warning: converting a masked element to nan.")

np.mean(t3['flux2'])

4.5

Alternately we could choose to keep only rows where both tables had a valid measurement using an inner join:

join(t, t2, keys=['name'], join_type='inner')

Writing data

t3.write('test.fits', overwrite=True)

t3.write('test.vot', format='votable', overwrite=True)

Reading data

You can read data using the Table.read() method:

t4 = Table.read('test.fits')
t4

Practical Exercises¶

In these exercises you will work with data from the Chandra Source Catalog. In particular we will analyze the properties of sources from a series of 21 Chandra Deep Field South observations taken during 2000 and 2007.

The data sets used in these exercises were generated using the code in the supplemental/chandra_deep_field_south notebook to query the CSC via the web API. You can try this yourself (but maybe not all at once!) if you have the requests module installed. More information on querying the CSC can be found in the command line interface documentation.

Read the data¶

To start with, read in the two data files representing the master source list and observations source list. The fields for the two tables are respectively documented in:

master_sources = Table.read('data/cdfs_master_sources.fits')
obs_sources = Table.read('data/cdfs_obs_sources.fits')

master_sources

Each distinct X-ray source identified on the sky is represented in the catalog by a single "master source" entry and one or more "source observation" entries, one for each observation in which the source has been detected. The master source entry records the best estimates of the properties of a source, based on the data extracted from the set of observations in which the source has been detected. The subset of fields in our exercise table file are:

Name	Description
msid	Master source ID
name	Source name in the Chandra catalog
ra	Source RA (deg)
dec	Source Dec (deg)

obs_sources

The individual source entries record all of the properties about a detection extracted from a single observation, as well as associated file-based data products, which are observation-specific. The subset of fields in our exercise table file are:

Name	Description
obsid	Observation ID
obi	Observation interval
targname	Target name
gti_obs	Observation date
flux_aper_b	Broad band (0.5 - 7 keV) flux (erg/cm2/sec)
src_cnts_aper_b	Broad band source counts
ra_b	Source RA (deg)
dec_b	Source Dec (deg)
livetime	Observation duration (sec)
posid	Position ID
theta	Off-axis angle (arcmin)
msid	Master source ID

Level 1¶

Exploring the tables¶

Do the following to explore the two tables:

Display the data for each table in IPython notebook using the normal way of showing the value of a variable.
Get a list of the column names for each table. Hint: use <TAB> completion to easily discover all the attributes and methods, e.g. type master_sources. and then hit the <TAB> key.
Find the length of each table.
Find the column datatypes for each table.

Normally one displays a table in IPython notebook by entering the variable name in a cell and pressing shift-Enter. In a terminal session the default method is using something like print(my_table). In both cases the Table object prefers to display only a screenful of data to prevent having a zillion lines of output if the table is huge. If you really want to see all the data you can use the Table.pprint method. Clicking on that documentation link will tell you the relevant arguments for the method.

Display all the rows of the master_sources table using its pprint() method.
If you are working in a regular terminal window (not IPython notebook), try the more() method as well.

Modifying tables¶

For our analysis we don't actually need the obi (observation interval) column in the obs_sources table.

Remove the obi column from the obs_sources table.

The gti_obs column name is a bit obscure (GTI is a good time interval, FWIW).

Rename the gti_obs column to obs_date.

It would be nice to have a count rate in addition to the source counts.

Add a new column src_rate_aper_b which is the source counts divided by observation duration in sec.

Some of the sources have a negative net flux in the broad band

Level 2¶

Looking at the observation source data¶

For each source detected in an individual observation (in the obs_sources table), let's look at the source flux values.

Use the matplotlib hist() function to make a histogram of the source fluxes. Since the fluxes vary by orders of magnitude, use the numpy.log10 to put the fluxes in log space.

HINT: if you did not start notebook with --pylab=inline then try:

import matplotlib.pyplot as plt
%matplotlib inline

Now make the same plot but using only sources within 4 arcmin of the center. HINT: use a boolean mask to select values of theta that are less than 4.0.

Join the master_sources and obs_sources tables¶

The master_sources and obs_sources tables share a common msid column. What we now want is to join the master RA and Dec positions and master source names with the individual observations table.

Use the table.join() function to make a single table called sources that has the master RA, Dec, and name included for each observation source.

HINT: the defaults for keys and join_type='inner' are correct in this case, so the simplest possible call to join() will work!

Intermediate: Is the length of the new sources the same as obs_sources? What happened?
Advanced: Make a scatter plot of the RA (x-axis) and Dec (y-axis) difference between the master source position and the observation source position. You'll need to use coordinates!

Level 3¶

Grouped properties of `sources`¶

Finally, we can look at the variability properties of sources in the CDFS using the group_by() functionality.

This method makes a new table in which all the sources with identical master ID are next to each other.

Make a new table g_sources which is the sources table grouped by the msid key using the group_by() method.

The g_sources table is just a regular table with all the sources in a particular order. The attribute g_sources.groups is an object that provides access to the msid sub-groups. You can access the \(i^{th}\) group with g_sources.groups[i].

In addition the g_sources.groups.indices attribute is an array with the indicies of the group boundaries.

Using np.diff() find the number of repeat observations of each master sources. HINT: use the indices, Luke.
Print the 50th group and note which columns are the same for all group members and which are different. Does this make sense? In these few observations how many different target names were provided by observers?

Aggregation¶

The real power of grouping comes in the ability to create aggregate values for each of the groups, for instance the mean flux for each unique source. This is done with the aggregate() method, which takes a function reference as its input. This function must take as input an array of values and return a single value.

Aggregate returns a new table that has a length equal to the number of groups.

Compute the mean of all columns for each unique source (i.e. each group) using aggregate and the np.mean function. Call this table g_sources_mean.
Notice that aggregation cannot form a mean for certain columns and these are dropped from the output. Use the join() function to restore the master_sources information to g_sources_mean.

Back to top

name	flux	logflux
source 1	1.2	0.0791812460476
source 2	2.2	0.342422680822
source 3	3.1	0.491361693834
source 4	4.3	0.63346845558
source 5	10.1	1.00432137378

name	flux	logflux
source 1	1.20	0.08
source 2	2.20	0.34
source 3	3.10	0.49
source 4	4.30	0.63
source 5	10.10	1.00