Gadfly: SQL Relational Database in Python
=========================================

:Version: $Revision: 1.2 $


.. contents::

What is it?
-----------

Gadfly  is  a  collection  of  python modules that provides relational
database  functionality  entirely implemented in Python. It supports a
subset  of  the intergalactic standard RDBMS Structured Query Language
SQL.

One of the most compelling aspects of Gadfly is that it runs whereever
Python  runs  and supports client/server on any platform that supports
the  standard  Python  socket interface. Even the file formats used by
Gadfly  for  storage are cross-platform -- a gadfly database directory
can  be moved from Win95 to Linux using a binary copying mechanism and
gadfly will read and run the database.

It  supports  persistent  databases  consisting  of  a  collection  of
structured  tables  with  indices,  and  a  large  subset  of  SQL for
accessing and modifying those tables. It supports a log based recovery
protocol  which  allows  committed  operations  of  a  database  to be
recovered  even  if  the database was not shut down in a proper manner
(ie, in the event of a CPU or software crash, [but not in the event of
a  disk  crash]).  It  also supports a TCP/IP Client/Server mode where
remote  clients  can  access  a  Gadfly database over a TCP/IP network
(such as the Internet) subject to configurable security mechanisms.

Because  it  lacks  (at  this  time)  true  concurrency  control,  and
file-system  based  indexing  it  is  not  appropriate  for very large
multiprocess transaction based systems.

Since  Gadfly  depends  intimately  on  the  kwParsing  package  it is
distributed  as part of the kwParsing package, under the same generous
copyright.

Why?
----

Gadfly  allows Python programs a convenient way to store, retrieve and
query  tabular  data  without  having to rely on any external database
engine  or  package.  That  is,  Gadfly  provides  a simple, easy, and
relatively  efficient  in-memory  relational database style engine for

Python  programs,  complete with a notion of a "committed, recoverable
transaction" and "aborts".

Looking  at  the /etc directory in unix or at the Registry under win32
or  at  the buzzillions of configuration files one find sitting around
file  systems  it becomes apparent that modern programs depend heavily
on  tabular  data.  Furthermore  as memory prices continue to drop and
inexpensive machines keep growing larger and larger memory capacity it
is clear that more and more database-style work can be done on largish
data  sets  in memory, and hence a simple in-memory SQL implementation
like Gadfly may be useful for serious work.

Gadfly  uses  relational  style  representations  and  the  SQL  query
language primarily because these are widely understood and familiar to
many  programmers.  SQL  can't  do everything, but it is successful in
part  because  it  can  do  a lot of important things easily and well.
(Python can do everything else...)

Use
---

The  main  "gadfly"  module  attempts  to faithfully adhere to Greg
Stein's  Python Database API, as discussed and certified by the Python
DB-SIG.  That said, some of the API that I didn't really understand is
not  implemented  yet.  Please  look to gadfly.py to determine exactly
what parts of the interface are implemented or stubbed.

Concurrent  database  updates  are  not supported. The "databases" are
currently designed to be written/modified by one process in isolation.
Multiple  processes  can  access  a  Gadfly database when accesses are
arbitrated by a tcp/ip Gadfly server process.

Creating a new database
~~~~~~~~~~~~~~~~~~~~~~~

Unlike most Python/database-engine interfaces you must create a Gadfly
database  using Python (whereas with Oracle you'd use other tools, for
example). To accomplish this use::

  import gadfly
  connection = gadfly.gadfly()

with  no  arguments  and  then  startup  a  database using the startup
method::

  connection.startup("mydatabase", "mydirectory")

Here  "mydirectory"  must be a directory which exists and which can be
written  to  in  order  to  store the database files. The startup will
create  some  files  in  "mydirectory".  This  will have the effect of
clobbering  any  existing  Gadfly  database called "mydatabase" in the
directory  "mydirectory". Gadfly will prevent you from starting up the
same connection twice, however.

Note  that  the  first  "import  gadfly" reads in and initializes some
rather  large  data structures used for parsing SQL, and thus may take
longer than other module imports.

Now  with  your new database you can create tables, populate them, and
commit the result when you are happy::

  cursor = connection.cursor()
  cursor.execute("create table ph (nm varchar, ph varchar)")
  cursor.execute("insert into ph(nm, ph) values ('arw', '3367')")
  cursor.execute("select * from ph")
  for x in cursor.fetchall():
    print x
  # prints ('arw', '3367')
  connection.commit()

Reconnecting to an existing database
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once a database exists you can reconnect to it as follows::

  import gadfly
  connection = gadfly.gadfly("mydatabase", "mydirectory")

This will read in the database tables with the most recently committed
values. The initialized database may now be queried and updated::

  cursor = connection.cursor()
  cursor.execute("update ph set nm='aaron' where nm='arw'")
  cursor.execute("select * from ph")
  for x in cursor.fetchall():
  print x
    # prints ('aaron', '3367')

If  you  do  not  wish  to commit updates you may simply not execute a
commit  on the connection object (which writes out the tables). If you
wish to restore the old values from the existing database use::

  connection.abort()

Updates  are  only  stored  upon  a connection.commit(). [Actually, if
autocheckpoint  is disabled, updates are only stored to table files on
checkpoint -- see the documentation on the recovery mechanism.]

Use::

  print cursor.pp()

to  "pretty  print"  the result of any evaluation (which might be None
for a non-select).

Features
--------

In  this  version all tables are read into memory upon "connecting" to
the  database  and  "touched"  tables are written out upon checkpoint.
Each  table  is  represented  as  a  separate  file in the destination
directory,  and  there  is a "data definition" file as well (a list of
data  definition  declarations).  During active use a log file appears
int  the active directory as well, and if the process crashes this log
file is used to recover committed operations.

The SELECT statement
~~~~~~~~~~~~~~~~~~~~

At  this  point  Gadfly  supports  quite  a  lot  of the SQL semantics
requested in the ODBC 2.0 specification. Please see the SQL constructs
page  for  a  more  detailed  presentation.  SQL  statements supported
include the SELECT::

  SELECT [DISTINCT|ALL] expressions or *
  FROM tables
  [WHERE condition]
  [GROUP BY group-expressions]
  [HAVING aggregate-condition]
  [union-clause]
  [ORDER BY columns]

This statement is quite powerful. It reads intuitively as follows:

1) Make all combinations of rows from the tables (FROM line)
2) Eliminate those combinations not satisfying condition (WHERE line)
3) (if GROUP present) form aggregate groups that match on group-expressions
4) (if HAVING present) eliminate aggregate groups that don't satisfy
   the aggregate-condition.
5) compute the columns to keep (SELECT line).
6) (if union-clause present) combine (union, difference, intersect)
   the result with the result of another select statement.
7) if DISTINCT, throw out redundant entries.
8) (if ORDER present) order the result by the columns (ascending
   or descending as specified, with precedence as listed).

The  actual  implementation  in  gadfly  is much more optimal than the
intuitive  reading,  particularly at steps 1 and 2 (which are combined
via optimizing transformations and hash join algorithms).

Conditions  may  include  equalities, and inequalities of expressions.
Conditions  may  also  be  combined  using  AND,  OR, NOT. Expressions
include  column  names,  constants, and standard arithmetic operations
over them.

Embedded  queries  supported  include  subquery  expressions,  expr IN
(subselect),   quantified  comparisons,  and  the  EXISTS  (subselect)
predicate.

Aggregate  tests  and  computations  can  only  be  applied  after the
GROUPing  and before the columns are selected (steps 3,4,5). Aggregate
operations   include   COUNT(*),  COUNT(expression),  AVG(expression),
SUM(expression),    MAX(expression),    MIN(expression),    and    the
non-standard  MEDIAN(expression).  These  may  be  applied to DISTINCT
values  (throwing  out redundancies, as in COUNT(DISTINCT drinker). if
no  GROUPing is present the aggregate computations apply to the entire
result after step 2.

There  is much more to know about the SELECT statement. The test suite
``test/test_gadfly.py`` gives numerous examples of SELECT statements.

Please  examine  gadfly/grammar.py  for a precise definition of the supported
syntax.  Please  find any of the 500 books on SQL for a description of
the  meaning of these constructs. Please inform me if any of them give
the wrong result when executed in Gadfly!

Table creation and "data types"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create tables using the CREATE TABLE statement::

  CREATE TABLE name (colname datatype [, colname datatype...])

Data types currently "supported" are integer, float, and varchar. They
are  ignored  by  the  implementation,  anything  that is hashable and
marshallable  can  currently  go  in any column (but that is likely to
change). For example::

  create table frequents
  (drinker varchar,
  bar varchar,
  perweek integer)

At  present  you  can  put  tuples, complexes, or anything else into a
column  specified as "varchar". Don't count on that always being true,
please.

Other supported statements
~~~~~~~~~~~~~~~~~~~~~~~~~~

Gadfly also supports the searched DELETE and UPDATE; INSERT VALUES and
INSERT  subselect;  CREATE/DROP  INDEX, and DROP TABLE. These have the
informal syntax::

  DELETE FROM table WHERE condition

  UPDATE table SET col=expr [, col=expr...] WHERE condition

  INSERT INTO table [(column [, column...])] values (value [, value...])

  INSERT INTO table [(column [, column...])] subselect

  CREATE [UNIQUE] INDEX name ON table (column [, column...])

  DROP TABLE table

  DROP INDEX name

Again, see the test/test_gadfly.py file for examples like::

  delete from templikes where be='rollingrock'

  update templikes set dr='norman' where dr='norm'

  insert into ph(nm,ph) values ('nan', '0356')

  insert into templikes(dr, be)
  select drinker, beer from likes

  create index sbb on serves (beer, bar)

  drop table templikes

  drop index tdindex

Multiple  statements  may  be  executed  in  one  cursor.execute(S) by
separating  the  statements  with semicolons in S, for example S might
have the string value::

  drop index tdindex;
  drop table templikes

(no final semicolon please!).

Please  see gftest.py for examples of most of these. Remember that SQL
is  case  insensitive  (capitalization  of  keywords  doesn't matter).
Please  see  sqlgram.py  for  a  precise  definition  of all supported
constructs

Dynamic Values
~~~~~~~~~~~~~~

Expressions  also  include  the special expression '?' (the ODBC-style
dynamic expression) as in::

  insertstat = "insert into ph(nm,ph) values (?, ?)"
  cursor.execute(insertstat, ('nan', "0356"))
  cursor.execute(insertstat, ('bill', "2356"))
  cursor.execute(insertstat, ('tom', "4356"))

Dynamic values allow the cursor to use the same parsed expression many
times  for  a  similar  operation.  Above the insertstat is parsed and
bound to the database only once. Using dynamic attributes should speed
up accesses. Thus the above should run much faster than the equivalent::

  cursor.execute("insert into ph(nm,ph) values ('nan', '0356')");
  cursor.execute("insert into ph(nm,ph) values ('bill', '2356')");
  cursor.execute("insert into ph(nm,ph) values ('tom', '4356')");

Dynamic   attributes   can   appear  in  other  statements  containing
expressions (such as SELECTs, UPDATEs and DELETEs too).

For  SELECT,  UPDATE,  and DELETE the dynamic expression substitutions
must consist of a single tuple, as in::

  stat = "select * from ph where nm=?"
  cursor.execute(stat, ("nan",))
  ...
  cursor.execute(stat, ("bob",))
  ...

Since  the  dynamic  substitution  eliminates the need for parsing and
binding  (expensive  operations!) the above should run faster than the
equivalent::

  cursor.execute("select * from ph where nm='nan'")
  ...
  cursor.execute("select * from ph where nm='bob'")
  ...

If  you  repeat several similar queries multiple times, associate each
query  "template  string"  with  a  unique  cursor object so that each
template must be parsed and bound only once. Note that some relatively
complex queries from the test suite run 2 to 3 times faster after they
have  been  parsed and bound, even without the kjbuckets builtin. With
kjbuckets the same ran 5 to 10 times faster.

Multiple Batch Inserts and Dynamic Values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For  the  special  case of INSERT VALUES a list of substitution tuples
allows  the  query  engine  to  perform the inserts in optimized batch
mode.  Thus the fastest way to perform the three inserts given earlier
is::

  data = [('nan', "0356")), ('bill', "2356"), ('tom', "4356")]
  stat = "insert into ph(nm,ph) values (?, ?)"
  cursor.execute(stat, data)

...and  it  would be even faster if the cursor had previously executed
the  stat  with different data (since then no parsing or binding would
occur).

Introspection
~~~~~~~~~~~~~

By    default   a   gadfly   database   now   automatically   includes
"introspective"  tables which allow a gadfly query to "query the shape
of  the database" -- for example to view table names and names of rows
in tables::

  >>> g = gadfly()
  >>> g.startup("dbtest", "dbtest")
  >>> c = g.cursor()
  >>> c.execute("select * from __table_names__")
  >>> print c.pp()
  IS_VIEW | TABLE_NAME
  =========================
  1       | __TABLE_NAMES__
  1       | DUAL
  1       | __DATADEFS__
  1       | __COLUMNS__
  1       | __INDICES__
  1       | __INDEXCOLS__

Here DUAL is a standard one row/one column test table (from the Oracle
tradition) and the other tables provide information about the database
schema::

  >>> c.execute("create table t1 (a varchar, b varchar)")
  >>> c.execute("create table t2 (b varchar, c varchar)")
  >>> c.execute("create unique index t1a on t1(a)")
  >>> c.execute("create index t1b on t1(b)")

  >>> c.execute("select * from __table_names__")
  >>> print c.pp()
  IS_VIEW | TABLE_NAME
  =========================
  0       | T1
  1       | __DATADEFS__
  1       | __INDICES__
  0       | T2
  1       | __TABLE_NAMES__
  1       | __COLUMNS__
  1       | DUAL
  1       | __INDEXCOLS__

  >>> c.execute("select * from __columns__")
  >>> print c.pp()
  COLUMN_NAME | TABLE_NAME
  =============================
  A           | T1
  B           | T1
  NAME        | __DATADEFS__
  DEFN        | __DATADEFS__
  INDEX_NAME  | __INDICES__
  TABLE_NAME  | __INDICES__
  IS_UNIQUE   | __INDICES__
  TABLE_NAME  | __TABLE_NAMES__
  IS_VIEW     | __TABLE_NAMES__
  B           | T2
  C           | T2
  COLUMN1     | DUAL
  TABLE_NAME  | __COLUMNS__
  COLUMN_NAME | __COLUMNS__
  INDEX_NAME  | __INDEXCOLS__
  COLUMN_NAME | __INDEXCOLS__

  >>> c.execute("select * from __indices__")
  >>> print c.pp()
  IS_UNIQUE | TABLE_NAME | INDEX_NAME
  ===================================
  0         | T1         | T1B
  1         | T1         | T1A

  >>> c.execute("select * from __indexcols__")
  >>> print c.pp()
  COLUMN_NAME | INDEX_NAME
  ========================
  B           | T1B
  A           | T1A

  >>> c.execute("select * from dual")
  >>> print c.pp()
  COLUMN1
  =======
  0


Interactive testing
~~~~~~~~~~~~~~~~~~~

After  installation,  you  may interactively test the created database
from  the  same  directory  using  the  interactive  interpreter,  for
example, as in::

  Python 2.1.3 (#1, Apr 30 2002, 19:37:40)
  [GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-96)] on linux2
  Type "copyright", "credits" or "license" for more information.
  >>>
  >>> from gadfly import gadfly
  >>> connection = gadfly("test", "dbtest")
  >>> cursor = connection.cursor()
  >>> cursor.execute("select * from frequents")
  >>> cursor.description
  (('DRINKER', None, None, None, None, None, None), ('PERWEEK', None, None,
  None, None, None, None), ('BAR', None, None, None, None, None, None))
  >>> print cursor.pp()
  DRINKER | PERWEEK | BAR
  ============================
  adam    | 1       | lolas
  woody   | 5       | cheers
  sam     | 5       | cheers
  norm    | 3       | cheers
  wilt    | 2       | joes
  norm    | 1       | joes
  lola    | 6       | lolas
  norm    | 2       | lolas
  woody   | 1       | lolas
  pierre  | 0       | frankies
  >>>

Architecture
------------

The  SQL  grammar  is  described in grammar.py, the binding of the
grammar    constructs    to   semantic   objects   is   performed   in
bindings.py,  the  semantic objects and their execution strategies
is  defined  in  semantics.py.  The semantics use a lot of classical and
non-classical  logic  (cylindric  logic,  to  be  precise)  as well as
optimization heuristics to define a relatively efficient and hopefully
correct  implementation  of  SQL. I recommend the brave have a look at
semantics.py  for  a  look  into the 12 years of research into databases,
logic,  and  programming languages that contributed bits to this work.
The underlying logic (in a different framework) is given in
A. Watters, "Interpreting a Reconstructed Relational Calculus",
ACM SIGMOD Proceedings, 1993, Washington DC, pp. 367-376.

The  most  basic  data  structures  of the implementation are given in
either  kjbuckets0.py or the faster kjbucketsmodule.c, which implement
the same data type signatures in Python and in a C extension to Python
respectively.

The  database.py  module  is  a  simple wrapper that provides a standard
DBAPI  interface  to  the system.

The test suite test/test_gadfly.py attempts to provide a regression test
and  a  demonstration  of the system.

The SQL parser also requires the kwParsing  parser  generation  package,
which consists of a number of additional python modules.


Comments
--------

Please find bugs and report them to us at http://gadfly.sf.net/.

The  query  engine  should  run  faster if you have the builtin (C) module
kjbuckets  installed.  Otherwise  it  will  use  a  "python imitation"
kjbuckets0.py.  In  one test the test suite ran two times faster using
kjbuckets.  I  suspect  it  will  have a higher payoff for larger data
sets.

