Handling ASCII table files

An ASCII data table refers to a tabular data set where the values are encoded in ASCII and are written in a row-oriented format where each table row is (effectively) on one line. Frequently these data files are somewhat free format in nature and do not conform to precisely definition standards such as FITS.

ASCII data tables are a fact of life for scientists and you need to be able to read and write them effectively.

astropy.io.ascii

The astropy.io.ascii package provides methods for reading and writing a wide range of ASCII data table formats from basic character-separated tables to astronomy-specific or even application-specific formats.

This package isn't the fastest ASCII table reader/writer available, but it is one of the most flexible and most of the time it should just work out of the box. [About speed, there is a Google Summer of Code 2014 student working on making it fast!]

The following shows a few of the ASCII formats that are available, while the section on Supported formats contains the full list.

Format Description
Basic basic table with customizable delimiters and header configurations
CDS CDS format table (also Vizier and ApJ machine readable tables)
DAOphot table from the IRAF DAOphot package
Fixed width table with fixed-width columns (see also Fixed-width Gallery)
IPAC IPAC format table
LaTex LaTeX table with data in the tabular environment
SExtractor SExtractor format table

Choice of interface

There are two equally valid ways of reading and writing ASCII tables within astropy:

Internally there is no difference since they both end up calling the astropy.io.ascii routines. It depends somewhat on your taste:

Functional interface

>>> from astropy.io import ascii
>>> pets = ascii.read(data)
>>> ascii.write(pets, 'pets.tex', format='latex')  # format='<format_name>'

Table interface

>>> from astropy.table import Table
>>> pets = Table.read(data, format='ascii')  # format='ascii'
>>> pets.write('pets.tex', format='ascii.latex')  # format='ascii.<format_name>'

Overall we recommend using the Table interface, because this also lets you read and write many tables in FITS or VOTable format using the same syntax. Note that for ASCII tables there is absolutely no functional difference between the interfaces, while for FITS and VOTable the direct interfaces (astropy.io.fits and astropy.io.votable) have more functionality than the Table interface.

Reading tables

First example:

In [1]:
import numpy as np
from astropy.table import Table
In [2]:
lines = """animal        number
           "siamese cat"    1   
             dog            3   """

pets = Table.read(lines, format='ascii')
pets
Out[2]:
animalnumber
siamese cat1
dog3

Arguments to read()

The online documentation on Parameters for reading has more information.

table argument -- Show me the table

The required first argument for every table read call is table, which provides access to the actual table data. It can be provided several different formats:

  • Name of a file (string)
  • Single string containing all table lines separated by newlines
  • File-like object with a callable read() method
  • List of strings where each list element is a table line
format argument -- What's the format Kenneth?

This specifies the file format, with available options describe in Supported Formats.

  • Table.read(): specifying format='ascii' is the generic "try to guess the format" option, or format='ascii.<format_name>'.
  • ascii.read(): do not provide format for the generic "guess the format" option, or format='<format_name>'.
In [3]:
lines = [r'\begin{table}',
         r'\begin{tabular}{ l c r }',
         r'X & Y & Z \\',
         r'1 & 2 & 3 \\',
         r'4 & 5 & 6 \\',
         r'7 & 8 & 9 \\',
         r'\end{tabular}',
         r'\end{table}']
t = Table.read(lines, format='ascii.latex')
t
Out[3]:
XYZ
123
456
789
guess argument -- To guess or not to guess

The default behavior of ascii.read() is that it tries to guess the table format. This often works:

In [4]:
Table.read(['# a | b',  # Commented header contains column names
            '1 | 2'], 
           format='ascii')
Out[4]:
ab
12

The reader uses the Will it blend strategy in reverse: throw the input data into each format reader and see if something nice comes out.

  • Most of the time the reader gives garbage or an exception, but the first time a reader returns a nice looking table then guessing stops and that table is returned.
  • The order of input formats is carefully tuned to maximize the likelihood of guessing the right format.
  • The concept of what makes a "nice looking table" also relies on some assumptions that may not always be the case.

Guessing doesn't always work:

In [5]:
lines = ['year & event',
         '2015 & Pigs fly']
try:
    Table.read(lines, format='ascii')
except Exception as err:
    print(err)

ERROR: Unable to guess table format with the guesses listed below:
Reader:Basic fill_values: [('', '0')] strict_names: True
Reader:Rdb fill_values: [('', '0')] strict_names: True
Reader:Tab fill_values: [('', '0')] strict_names: True
Reader:Cds fill_values: [('', '0')] strict_names: True
Reader:Daophot fill_values: [('', '0')] strict_names: True
Reader:SExtractor fill_values: [('', '0')] strict_names: True
Reader:Ipac fill_values: [('', '0')] strict_names: True
Reader:Latex fill_values: [('', '0')] strict_names: True
Reader:AASTex fill_values: [('', '0')] strict_names: True
Reader:HTML fill_values: [('', '0')] strict_names: True
Reader:CommentedHeader delimiter: '|' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:CommentedHeader delimiter: '|' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:CommentedHeader delimiter: ',' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:CommentedHeader delimiter: ',' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:CommentedHeader delimiter: ' ' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:CommentedHeader delimiter: ' ' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:CommentedHeader delimiter: '\\s' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:CommentedHeader delimiter: '\\s' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:Basic delimiter: '|' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:Basic delimiter: '|' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:Basic delimiter: ',' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:Basic delimiter: ',' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:Basic delimiter: ' ' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:Basic delimiter: ' ' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:Basic delimiter: '\\s' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:Basic delimiter: '\\s' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:NoHeader delimiter: '|' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:NoHeader delimiter: '|' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:NoHeader delimiter: ',' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:NoHeader delimiter: ',' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:NoHeader delimiter: ' ' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:NoHeader delimiter: ' ' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:NoHeader delimiter: '\\s' fill_values: [('', '0')] quotechar: '"' strict_names: True
Reader:NoHeader delimiter: '\\s' fill_values: [('', '0')] quotechar: "'" strict_names: True
Reader:Basic fill_values: [('', '0')]
ERROR: Unable to guess table format with the guesses listed above.
Check the table and try with guess=False and appropriate arguments to read()


When this happens you need to give ascii.read a bit of help. The first step is to turn off guessing and provide what you think is the right format. In this case this is a character-delimited table (using &).

In [6]:
try:
    Table.read(lines, format='ascii.basic', guess=False)
except Exception as err:
    print(err)
Number of header columns (3) inconsistent with data columns (4) at data line 0
Header values: ['year', '&', 'event']
Data values: ['2015', '&', 'Pigs', 'fly']

In [7]:
Table.read(lines, format='ascii.basic', delimiter='&', guess=False)
Out[7]:
yearevent
2015Pigs fly

delimiter, comment, and quotechar -- Define table characteristics

These arguments mostly apply to the basic format:

  • delimiter: the characters <SPACE>, '|', ',', '\t', and '\s' (tab or space) are defaults.
  • comment: if this regular expression matches the beginning of a table line then that line will be discarded from header or data processing.
  • quotechar: This specifies the quote character and will typically be either the single or double quote character.
In [8]:
lines = ['a * b',  # Don't ever make a table like this!
         '^hello * world^ * 5',
         '! Comment line',
         'more * values']
Table.read(lines, format='ascii.basic', delimiter='*', comment='!', quotechar='^')
Out[8]:
ab
hello * world5
morevalues

header_start, data_start, and data_end -- Locate the header and data

The header is defined as the part of the table which contains column names and any other information such as column types and other metadata. Frequently the header is a single line of column names, but it maybe long or might not exist.

The data section is the part which has the data values.

  • header_start : This specifies in the line index where the header begins. Comment lines are not included in this count and the counting starts from 0 (first non-comment line has index=0). If set to None this indicates that there is no header line and the column names will be auto-generated. The default is dependent on the format.
  • data_start: This specifies in the line index where the data lines begin where the counting starts from 0 and does not include comment lines. The default is dependent on the format.
  • data_end: If this is not None then it allows for excluding lines at the end that are not valid data lines. A negative value means to count from the end, so -1 would exclude the last line, -2 the last two lines, and so on.
In [9]:
lines = ['Some irrelevant stuff here',    # Line 0
         '',                              #   Blank (only whitespace) => Not included in count
         '# Properly commented out line', #   Comment => Not included
         ' col0  col1',                   # Line 1 (header start)
         '',                              #   Blank
         'More junk from table creator',  # Line 2
         ' 10   20 ',                     # Line 3 (data start)
         '# comment line',                #   Comment
         ' 20   40 ',                     # Line 4 (-2, data end)
         '/ Why do people do this?',      # Line 5 (-1)
         '/ Just to make life hard.'      # Line 6 (-0)
    ]
In [10]:
Table.read(lines, format='ascii', header_start=1, data_start=3, data_end=-2, guess=False)
Out[10]:
col0col1
1020
2040

converters argument -- Give me the right types

Sometimes the automatic type inference doesn't do you wanted.

In [11]:
lines = """
char float
"1" 1
"2" 2
"3" 3
"4" 4
"""
In [12]:
t = Table.read(lines, format='ascii')
t.dtype
Out[12]:
dtype([('char', '<i8'), ('float', '<i8')])
In [13]:
from astropy.io import ascii
converters = {'char': [ascii.convert_numpy(np.str)],
              'float': [ascii.convert_numpy(np.float)]}
In [14]:
t = Table.read(lines, format='ascii', converters=converters)
t.dtype
Out[14]:
dtype([('char', 'S1'), ('float', '<f8')])

names, include_names, exclude_names -- What to call the columns and what you find interesting

Some tables arrive nameless, that is to say there is no definition of the column names within the table. When that is the case ascii.read() will automatically assign names like col0, col1 and so on. These aren't very descriptive and the names argument gives you a way to fix that on reading the table.

In [15]:
lines = """
1 2 3
4 5 6
"""
Table.read(lines, format='ascii', names=['a', 'b', 'c'])
Out[15]:
abc
123
456

Maybe someone has given you a table with tons of columns that are mostly uninteresting to you:

In [16]:
lines = """
a b c d e f g h i j k planet color
1 2 3 4 5 6 7 8 1 2 3 mars reddish
9 8 7 6 5 4 3 2 1 0 1 earth blue-ish
"""
Table.read(lines, format='ascii', include_names=['planet', 'color'])
Out[16]:
planetcolor
marsreddish
earthblue-ish

fill_values, fill_include_names, fill_exclude_names -- Handling missing or bad values

This can be used to fill missing values in the table or replace strings with special meaning. See the Bad or missing values section for more information and examples.

The default is that any blank table values are treated as missing.

Unfortunately there isn't time here for the joy and madness of fill values!

Writing tables

Once you understand reading tables then writing is pretty simple and comparitively boring. The online documentation on Parameters for write() describes the arguments, most of which should be familiar by now.

Instead of going through the parameters exhaustively we'll just show a few examples here.

In [17]:
lines = """
X Y Z
1 2.0022 'hello world'
-99 4.01234 'so long world'
"""
t = ascii.read(lines)  # For variety we show the functional interface
t['Y'].unit = 'm/s'
In [18]:
import sys
t.write(sys.stdout, format='ascii')
X Y Z
1 2.0022 "hello world"
-99 4.01234 "so long world"

In [19]:
t.write(sys.stdout, format='ascii', delimiter=',')  # Or just format='ascii.csv' in astropy 0.4 and later
X,Y,Z
1,2.0022,hello world
-99,4.01234,so long world

In [20]:
ascii.write(t, format='aastex')  # default functional interface output is sys.stdout
\begin{deluxetable}{ccc}
\tablehead{\colhead{X} & \colhead{Y} & \colhead{Z}}
\startdata
1 & 2.0022 & hello world \\
-99 & 4.01234 & so long world \\
\enddata
\end{deluxetable}

In [21]:
t.write(sys.stdout, format='ascii.ipac', formats={'Y': '%.4f'})
|   X|     Y|            Z|
|long|double|         char|
|    | m / s|             |
|null|  null|         null|
    1 2.0022   hello world 
  -99 4.0123 so long world 

In [22]:
t.write(sys.stdout, format='ascii.ipac', names=['A', 'B', 'C'])
|   A|      B|            C|
|long| double|         char|
|    |  m / s|             |
|null|   null|         null|
    1  2.0022   hello world 
  -99 4.01234 so long world 

In [23]:
t.write(sys.stdout, format='ascii.latex', fill_values=[('-99', '...')])
\begin{table}
\begin{tabular}{ccc}
X & Y & Z \\
1 & 2.0022 & hello world \\
... & 4.01234 & so long world \\
\end{tabular}
\end{table}

Practical exercises

Level 1-2: Basic reading

First familiarize yourself with the available supported formats:

  • Open up the Supported formats page and click through the documentation for each format.
  • Which formats document support for table metadata like column units, comments or additional descriptions?

Read the following files in the data/ascii/ sub-directory and show the contents for each one. Try to use the minimum effort and number of non-default options possible.

The pattern will be Table.read('data/ascii/<filename>', format='...', ...).

HINT: You can examine each file with %cat <filename> in IPython notebook (e.g. %cat data/ascii/basic.dat).

  • 'basic.dat'
  • 'ipac.dat' -- from NASA/IPAC Infrared Science Archive
  • 'daophot.dat.gz' -- Hint: unzipping is automatic
  • 'cds/lhs2065.dat' -- Hint: read the docs! Note that Vizier catalog results are in CDS format.
  • 'fixed_width.dat' -- Hint: check out the Fixed width gallery.

Level 2: Metadata

For the table from 'ipac.dat', there is table metadata that is in an ordered dictionary in the table meta attribute. If you read the table into a variable named dat then you would be looking at dat.meta.

  • Print the keys of dat.meta. What are they?
  • Print each of the keywords and their values from dat.meta, one per line in the format keyword: value.
  • For each column in the table print the column name and unit. HINT remember colnames.

Level 3: Harder reading

Read two slightly challenging tables:

  • hard1.dat -- This is part of a real data file produced by a Sybase SQL query.
  • hard2.dat -- Your neighbor asked you to help computerize his garden and handed you this data file to read. You'll need to study up on fill_values.

Level 1: Writing

Start from the table from 'ipac.dat' and make a new table that has just first 2 lines of the first 3 columns. This is done with:

>>> dat3 = dat['designation', 'ra', 'dec'][:2]

Find the output format output that will create each of the following three outputs from the table t3. You can use the pattern:

import sys
dat3.write(sys.stdout, format='...', ...)
Format 1
designation,ra,dec,sigra,sigdec
J095542.55+690421.2,148.9273263,69.0725621,0.0401,0.039
J095540.45+690439.1,148.9185759,69.0775361,0.0519,0.0494
Format 2
        designation          ra        dec
------------------- ----------- ----------
J095542.55+690421.2 148.9273263 69.0725621
J095540.45+690439.1 148.9185759 69.0775361
Format 3
\begin{table}
\begin{tabular}{ccc}
designation & ra & dec \\
J095542.55+690421.2 & 148.9273263 & 69.0725621 \\
J095540.45+690439.1 & 148.9185759 & 69.0775361 \\
\end{tabular}
\end{table}

Level 2: Writing IPAC

  • Now write the dat3 table back out in IPAC format. Does it look like the output has the same units and metadata as the original?