Pythonic SQL with SQLAlchemy

During my Applied Databases course in Spring 2019, I gave my students a choice of which language to use to interact with SQL and relational databases. They had already learned core SQL and I only gave them 3 options: C++, R, and Python. The choices of R and python are natural given my data science interests and experience. Last year I just showed them R and got some complaints on evaluations (some people don’t think R is a “real language”). The option of C++ had 2 motivations: every student had 1 semester (many had 2) of the intro programming sequence in C++, and we were using SQLite which is written in C++. Fortunately, the majority picked Python and these are some examples we went through to start learning SQLAlchemy.

Here’s the basic process of working with a database via sqlalchemy:

  • Create an engine via: eng = create_engine('dialect+driver://db-location')
  • Create a connection via: conn = eng.connection()
  • Create an object to hold metadata: md = MetaData()
  • Create python objects for tables you’ll work with: pyTable = Table('name',md, autoload=True, autoload_with=eng)
  • Build queries - anything SQL can do, SQLAlchemy supports.
  • execute queries: proxy = conn.execute(query)
  • fetch results: results = proxy.fetchall()
  • work with results (any python code you need)
  • When finished, close the connection

Creating an Engine

First we need an engine to handle communication with the database. This requires passing the SQL driver and dialect information as well as the location of the database. For SQLite, this is very simple but large RDBMS connections will require additional connection information.

Once we create the engine, we’ll ask for the table names to see which tables are in the database (it should be empty in this case).

import sqlalchemy as sa

eng = sa.create_engine('sqlite:///../../static/files/countries.sqlite')
eng.table_names()
## []

Pandas to Bulk Load Data

We could run SQL INSERT commands, but pandas can read common data files and export to SQL. Here we’ll put some city data into our database:

import pandas as pd
cityDF = pd.read_csv('../../static/files/cities.csv')
print(cityDF.head())
##           name country_code  city_proper_pop  metroarea_pop  urbanarea_pop
## 0      Abidjan          CIV          4765000            NaN        4765000
## 1    Abu Dhabi          ARE          1145000            NaN        1145000
## 2        Abuja          NGA          1235880      6000000.0        1235880
## 3        Accra          GHA          2070463      4010054.0        2070463
## 4  Addis Ababa          ETH          3103673      4567857.0        3103673
cityDF.to_sql('cities', eng)

eng.table_names()
## ['cities']

pandas also provides ways to query a database as though it was a typical dataframe. If you already know pandas and only need some basic queries then this is a natural path. Instead, we’re going to start using SQLAlchemy to write queries in a nice pythonic way, but I’m not going to get into the full object-relational mapper that SQLAlchemy provides (but this is what to do if you want to store complex objects into a relational database).

SQLAlchemy Queries

Warning about SQL injection

SQLAlchemy allows you to send raw query strings, but don’t do this. If you’re doing interactive data analysis, raw query strings are quick and easy. If you’re building the query from user input, raw query strings are the easiest way to expose your database to SQL injection attacks.

Now to the queries. We need a metadata object to hold table metadata. Then we have to have a table object to select from. Other SQL clauses are then methods on the select function:

md = sa.MetaData()
cities = sa.Table('cities', md, autoload=True, autoload_with=eng)

stmt =  sa.sql.select([cities]).where(
  cities.columns.country_code=='USA'
  ).order_by(
    cities.columns.city_proper_pop.desc()
    )

results = eng.execute(stmt).fetchall()
print(results[0:10])
## [(153, 'New York City', 'USA', 8550405, 20182305.0, 8550405), (124, 'Los Angeles', 'USA', 3884307, 15058000.0, 3884307), (45, 'Chicago', 'USA', 2695598, 9156000.0, 2695598), (87, 'Houston', 'USA', 2489558, 6490180.0, 2489558), (165, 'Philadelphia', 'USA', 1567872, 6069875.0, 1567872), (167, 'Phoenix', 'USA', 1563025, 4574531.0, 1563025), (186, 'San Antonio', 'USA', 1469845, 2454061.0, 1469845), (187, 'San Diego', 'USA', 1394928, 3095313.0, 1394928), (55, 'Dallas', 'USA', 1317929, 7233323.0, 1317929)]

After building the query stmt we obtain results by (1) executing the query on the engine and (2) fetching all the results. As you can see from the printed results, what’s returned is a list of database tuples.

Pretty Print Results

Since a list of tuples is a mess to read (but fairly easy to work with), here’s a function to display results in a more human-readable format.

# requires import pandas as pd
def query_results(query, engine, number):
    results = engine.execute(query).fetchmany(number)
    resDF = pd.DataFrame(results)
    resDF.columns = results[0].keys()
    print(resDF)

Joins

You can’t do much in a real database without joins, so let’s do an example. We’ll load the JACS database - containing data on papers published in the Journal of the American Chemical Society. This database has 3 tables: Papers, Authors and Paper_Authors. The last one is a join table since we can’t have a list of authors for each paper in a relational database.

Here’s the initial set-up for the new database.

#engine
JACSeng = sa.create_engine('sqlite:///../../static/files/jacs.sqlite')
# connection
conn = JACSeng.connect()
#metadata and tables
md = sa.MetaData()
authors = sa.Table('Authors', md, autoload=True, autoload_with=JACSeng)
papers = sa.Table('Papers', md, autoload=True, autoload_with=JACSeng)
pa = sa.Table('Paper_Authors', md, autoload=True, autoload_with=JACSeng)

Now we’ll write a query to get the titles and authors names of all papers in volume 2 of issue 118 (pretty arbitrary, but a realistic problem). First we select then add a select_from that contains the join(s). Finally, we use our function to print the results.

query118v2 = sa.select([papers.columns.title, authors.columns.surname, authors.columns.forename])
query118v2 = query118v2.select_from(
  papers.join(
    pa, pa.columns.paperID==papers.columns.paperID
    ).join(
      authors, pa.columns.authorID==authors.columns.authorID
      )
    )
query118v2 = query118v2.where(sa.and_(papers.columns.issue==118, papers.columns.volume==2))

query_results(query118v2, conn, 5)
##                                                title   surname  forename
## 0  Evidence for Concert in the Thermal Unimolecul...  Gajewski    Joseph
## 1  Evidence for Concert in the Thermal Unimolecul...     Olson      Leif
## 2  Evidence for Concert in the Thermal Unimolecul...  Willcott        M.
## 3  Orthogonal Ligation of Unprotected Peptide Seg...       Liu  Chuan-Fa
## 4  Orthogonal Ligation of Unprotected Peptide Seg...       Rao     Chang

End

We can use similar commands to run any SQL query we want, including DDL commands to create tables.

Related