Turbodbc - Turbocharged database access for data scientists

_images/logo.svg

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. Its primary target audience are data scientist that use databases for which no efficient native Python drivers are available.

For maximum compatibility, turbodbc complies with the Python Database API Specification 2.0 (PEP 249). For maximum performance, turbodbc offers built-in NumPy support and internally relies on batched data transfer instead of single-record communication as other popular ODBC modules do.

Turbodbc is free to use (MIT license), open source (GitHub), works with Python 2.7 and Python 3.4+, and is available for Linux, OSX, and Windows.

Turbodbc is routinely tested with MySQL, PostgreSQL, EXASOL, and MSSQL, but probably also works with your database.

Introduction

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. In addition to complying with the Python Database API Specification 2.0, turbodbc offers built-in NumPy support. Don’t wait minutes for your results, just blink.

Features

  • Bulk retrieval of result sets
  • Built-in NumPy conversion
  • Bulk transfer of query parameters
  • Asynchronous I/O for result sets
  • Automatic conversion of decimal type to integer, float, and string as appropriate
  • Supported data types for both result sets and parameters: int, float, str, bool, datetime.date, datetime.datetime
  • Also provides a high-level C++11 database driver under the hood
  • Tested with Python 2.7, 3.4, 3.5, and 3.6
  • Tested on 64 bit versions of Linux, OSX, and Windows (Python 3.5+).

Why should I use turbodbc instead of other ODBC modules?

Short answer: turbodbc is faster.

Slightly longer answer: turbodbc is faster, much faster if you want to work with NumPy.

Medium-length answer: I have tested turbodbc and pyodbc (probably the most popular Python ODBC module) with various databases (Exasol, PostgreSQL, MySQL) and corresponding ODBC drivers. I found turbodbc to be consistently faster.

For retrieving result sets, I found speedups between 1.5 and 7 retrieving plain Python objects. For inserting data, I found speedups of up to 100.

Is this completely scientific? Not at all. I have not told you about which hardware I used, which operating systems, drivers, database versions, network bandwidth, database layout, SQL queries, what is measured, and how I performed was measured.

All I can tell you is that if I exchange pyodbc with turbodbc, my benchmarks took less time, often approaching one (reading) or two (writing) orders of magnitude. Give it a spin for yourself, and tell me if you liked it.

Smooth. What is the trick?

Turbodbc exploits buffering.

  • Turbodbc implements both sending parameters and retrieving result sets using buffers of multiple rows/parameter sets. This avoids round trips to the ODBC driver and (depending how well the ODBC driver is written) to the database.
  • Multiple buffers are used for asynchronous I/O. This allows to interleave Python object conversion and direct database interaction (see performance options below).
  • Buffers contain binary representations of data. NumPy arrays contain binary representations of data. Good thing they are often the same, so instead of converting we can just copy data.

Supported environments

  • 64 bit operating systems (32 bit not supported)
  • Linux (successfully built on Ubuntu 12, Ubuntu 14, Debian 7, Debian 8)
  • OSX (successfully built on Sierra a.k.a. 10.12 and El Capitan a.k.a. 10.11)
  • Windows (successfully built on Windows 10)
  • Python 2.7, 3.4, 3.5, 3.6
  • More environments probably work as well, but these are the versions that are regularly tested on Travis or local development machines.

Supported databases

Turbodbc uses suites of unit and integration tests to ensure quality. Every time turbodbc’s code is updated on GitHub, turbodbc is automatically built from scratch and tested with the following databases:

  • PostgreSQL (Linux, OSX, Windows)
  • MySQL (Linux, OSX, Windows)
  • MSSQL (Windows, with official MS driver)

During development, turbodbc is tested with the following database:

  • Exasol (Linux, OSX)

Releases will not be made if any (implemented) test fails for any of the databases listed above. The following databases/driver combinations are tested on an irregular basis:

  • MSSQL with FreeTDS (Linux, OSX)
  • MSSQL with Microsoft’s official ODBC driver (Linux)

There is a good chance that turbodbc will work with other, totally untested databases as well. There is, however, an equally good chance that you will encounter compatibility issues. If you encounter one, please take the time to report it so turbodbc can be improved to work with more real-world databases. Thanks!

Getting started

Installation

Linux and OSX

To install turbodbc on Linux and OSX, please use the following command:

pip install turbodbc

This will trigger a source build that requires compiling C++ code. Please make sure the following prerequisites are met:

Requirement Linux (apt-get install) OSX (brew install)
C++11 compiler G++-4.8 or higher clang with OSX 10.9+
Boost library + headers (1) libboost-all-dev boost
ODBC library + headers unixodbc-dev unixodbc
Python headers python-dev use pyenv to install

Please pip install numpy before installing turbodbc, because turbodbc will search for the numpy Python package at installation/compile time. If NumPy is not installed, turbodbc will not compile the optional NumPy support features.

(1) The minimum viable Boost setup requires the libraries variant, optional, datetime, and locale.

Windows

To install turbodbc on Windows, please use the following command:

pip install turbodbc

This will download and install a binary wheel, no compilation required. You still need to meet the following prerequisites, though:

Requirement Windows
OS Bitness 64-bit
Python 3.5 or 3.6, 64-bit
Runtime MSVS 2015 Update 3 Redistributable, 64 bit

If you require NumPy support, please

pip install numpy

Sometime after installing turbodbc.

Basic usage

Turbodbc follows the specification of the Python database API v2 (PEP 249). Here is a short summary, including the parts not specified by the PEP.

Establish a connection with your database

All ODBC appications, including turbodbc, use connection strings to establish connections with a database. If you know how the connection string for your database looks like, use the following lines to establish a connection:

>>> from turbodbc import connect
>>> connection = connect(connection_string='Driver={PostgreSQL};Server=IP address;Port=5432;Database=myDataBase;Uid=myUsername;Pwd=myPassword;')

If you do not specify the connection_string keyword argument, turbodbc will create a connection string based on the keyword arguments you pass to connect:

>>> from turbodbc import connect
>>> connection = connect(dsn='My data source name as defined by your ODBC configuration')

The dsn is the data source name of your connection. Data source names uniquely identify connection settings that shall be used to connect with a database. Data source names are part of your ODBC configuration and you need to set them up yourself. Once set up, however, all ODBC applications can use the same data source name to refer to the same set of connection options, typically including the host, database, driver settings, and sometimes even credentials. If your ODBC environment is set up properly, just using the dsn option should be sufficient.

You can add extra options besides the dsn to overwrite or add settings:

>>> from turbodbc import connect
>>> connection = connect(dsn='my dsn', user='my user has precedence')
>>> connection = connect(dsn='my dsn', username='field names depend on the driver')

Last but not least, you can also do without a dsn and just specify all required configuration options directly:

>>> from turbodbc import connect
>>> connection = connect(driver="PostgreSQL",
...                      server="hostname",
...                      port="5432",
...                      database="myDataBase",
...                      uid="myUsername",
...                      pwd="myPassword")

Executing SQL queries and retrieving results

To execute a query, you need to create a cursor object first:

>>> cursor = connection.cursor()

This cursor object lets you execute SQL commands and queries. Here is how to execute a SELECT query:

>>> cursor.execute('SELECT 42')

You have multiple options to retrieve the generated result set. For example, you can iterate over the cursor:

>>> for row in cursor:
...     print row
[42L]

Alternatively, you can fetch all results as a list of rows:

>>> cursor.fetchall()
[[42L]]

You can also retrieve result sets as NumPy arrays, see Advanced usage.

Executing manipulating SQL queries

As before, you need to create a cursor object first:

>>> cursor = connection.cursor()

You can now execute a basic INSERT query:

>>> cursor.execute("INSERT INTO TABLE my_integer_table VALUES (42, 17)")

This will insert two values, 42 and 17, in a single row of table my_integer_table. Inserting values like this is impractical, because it requires to put the values into the actual SQL string.

To avoid this, you can pass parameters to execute():

>>> cursor.execute("INSERT INTO TABLE my_integer_table VALUES (?, ?)",
...                [42, 17])

Please note the question marks ? in the SQL string that marks two parameters. Adding single rows at a time is not efficient. You can add more than just a single row to a table in efficiently by using executemany():

>>> parameter_sets = [[42, 17],
...                   [23, 19],
...                   [314, 271]]
>>> cursor.executemany("INSERT INTO TABLE my_integer_table VALUES (?, ?)",
...                    parameter_sets)

Transactions

Turbodbc currently does not support autocommit. To commit your changes to the database, please use the following command:

>>> connection.commit()

If you want to roll back your changes, use the following command:

>>> connection.rollback()

Supported data types

Turbodbc supports the most common data types data scientists are interested in. The following table shows which database types are converted to which Python types:

Database type(s) Python type
Integers, DECIMAL(<19,0) int
DOUBLE, DECIMAL(x, >0) float
BIT, boolean-like bool
TIMESTAMP, TIME datetime.datetime
DATE datetime.date
VARCHAR, strings, DECIMAL(>18, 0) unicode (str)

When using parameters with execute() and executemany(), the table is basically reversed. The first type in the “database type(s)” column denotes the type used to transfer back data. For integers, 64-bit integers are transferred. For strings, the length of the transferred VARCHAR depends on the length of the transferred strings.

Advanced usage

Performance, compatibility, and behavior options

Turbodbc offers a way to adjust its behavior to tune performance and to achieve compatibility with your database. The basic usage is this:

>>> from turbodbc import connect, make_options
>>> options = make_options()
>>> connect(dsn="my_dsn", turbodbc_options=options)

This will connect with your database using the default options. To use non-default options, supply keyword arguments to make_options():

>>> from turbodbc import Megabytes
>>> options = make_options(read_buffer_size=Megabytes(100),
...                        parameter_sets_to_buffer=1000,
...                        use_async_io=True,
...                        prefer_unicode=True)
...                        autocommit=True)

Read buffer size

read_buffer_size affects how many result set rows are retrieved per batch of results. Set the attribute to turbodbc.Megabytes(42) to have turbodbc determine the optimal number of rows per batch so that the total buffer amounts to 42 MB. This is recommended for most users and databases. You can also set the attribute to turbodbc.Rows(13) if you would like to fetch results in batches of 13 rows. By default, turbodbc fetches results in batches of 20 MB.

Please note that sometimes a single row of a result set may exceed the specified buffer size. This can happen if large fields such as VARCHAR(8000000) or TEXT are part of the result set. In this case, results are fetched in batches of single rows that exceed the specified size.

Buffered parameter sets

Similarly, parameter_sets_to_buffer changes the number of parameter sets which are transferred per batch of parameters (e.g., as sent with executemany()). Please note that it is not (yet) possible to use the Megabytes and Rows classes here.

Asynchronous input/output

If you set use_async_io to True, turbodbc will use asynchronous I/O operations (limited to result sets for the time being). Asynchronous I/O means that while the main thread converts result set rows retrieved from the database to Python objects, another thread fetches a new batch of results from the database in the background. This may yield a speedup of 2 if retrieving and converting are similarly fast operations.

Note

Asynchronous I/O is experimental and has to fully prove itself yet. Do not be afraid to give it a try, though.

Prefer unicode

Set prefer_unicode to True if your database does not fully support the UTF-8 encoding turbodbc prefers. With this option you can tell turbodbc to use two-byte character strings with UCS-2/UTF-16 encoding. Use this option if you try to connection to Microsoft SQL server (MSSQL).

Autocommit

Set autocommit to True if you want the database to COMMIT your changes automatically after each query or command. By default, autocommit is disabled and users are required to call cursor.commit() to persist their changes.

Note

Some databases that do not support transactions may even require this option to be set to True in order to establish a connection at all.

Controlling autocommit behavior at runtime

You can enable and disable autocommit mode after you have established a connection, and you can also check whether autocommit is currently enabled:

>>> from turbodbc import connect
>>> connection = connect(dsn="my DSN")
>>> connection.autocommit = True

[... more things happening ...]

>>> if not connection.autocommit:
...     connection.commit()

NumPy support

Note

Turbodbc’s NumPy support requires the numpy package to be installed. For all source builds, Numpy needs to be installed before installing turbodbc. Please check the installation instructions for more details.

Obtaining NumPy result sets all at once

Here is how to use turbodbc to retrieve the full result set in the form of NumPy masked arrays:

>>> cursor.execute("SELECT A, B FROM my_table")
>>> cursor.fetchallnumpy()
OrderedDict([('A', masked_array(data = [42 --],
                                mask = [False True],
                                fill_value = 999999)),
             ('B', masked_array(data = [3.14 2.71],
                                mask = [False False],
                                fill_value = 1e+20))])

Obtaining NumPy result sets in batches

You can also fetch NumPy result sets in batches using an iterable:

>>> cursor.execute("SELECT A, B FROM my_table")
>>> batches = cursor.fetchnumpybatches()
>>> for batch in batches:
...     print(batch)
OrderedDict([('A', masked_array(data = [42 --],
                                mask = [False True],
                                fill_value = 999999)),
             ('B', masked_array(data = [3.14 2.71],
                                mask = [False False],
                                fill_value = 1e+20))])

The size of the batches depends on the read_buffer_size attribute set in the performance options.

Notes regarding NumPy support

  • NumPy results are returned as an OrderedDict of column name/value pairs. The column order is the same as in your query.
  • The column values are of type MaskedArray. Any NULL values you have in your database will show up as masked entries (NULL values in string-like columns will shop up as None objects).
  • NumPy support is currently limited to result sets.

The following table shows how the most common data types data scientists are interested in are converted to NumPy columns:

Database type(s) Python type
Integers, DECIMAL(<19,0) int64
DOUBLE, DECIMAL(x, >0) float64
BIT, boolean-like bool_
TIMESTAMP, TIME datetime64[us]
DATE datetime64[D}
VARCHAR, strings, DECIMAL(>18, 0) object_

ODBC configuration

ODBC configuration can be a real pain, in particular if you are new to ODBC. So here is a short primer of what ODBC is about.

ODBC basics

ODBC is the abbreviation for open database connectivity, a standard for interacting with relational databases that has been considerably influenced by Microsoft. The aim of the standard is that applications can work with multiple databases with little to no adjustments in code.

This is made possible by combining three components with each other:

Turbodbc makes it easy to build applications that use the ODBC driver manager, but it still requires the driver manager to be configured correctly so that your databases are found.

ODBC concepts

ODBC drivers

ODBC drivers comply with the ODBC API, meaning that they offer a set of about 80 C functions with well-defined behavior that internally use database-specific commands to achieve the desired behavior. There is some wiggle room that allows ODBC drivers to implement certain things differently or even exclude support for some advanced usage patterns. But in essence, all ODBC drivers are born more or less equal.

ODBC drivers are easy to come by. Major database vendors offer ODBC drivers as free downloads (Microsoft SQL server, Exasol, Teradata, etc). Open source databases provide ODBC databases as part of their projects (PostgreSQL, Impala, MongoDB). Many ODBC drivers are also shipped with Linux distributions or are readily available via Homebrew for OSX. Last but not least, commercial ODBC drivers are available at Progress or easysoft, claiming better performance that their freely available counterparts.

ODBC driver manager

The driver manager is a somewhat odd centerpiece. It is a library that can be used just like any ODBC driver. It provides definitions for various data types, and actual ODBC drivers often rely on these definitions for compilation. The driver manager has a built-in configuration of data sources. A data source has a name (the data source name or DSN), is associated with an ODBC driver, contains configuration options such as the database host or the connection locale, and sometimes it also contains credentials for authentication with the database. Finally, the driver manager typically comes with a tool to edit data sources.

Driver managers are less numerous, but still easily available on all major platforms. Windows comes with a preinstalled ODBC database manager. On Linux and OSX, there are competing driver managers in unixodbc and iodbc.

Note

Turbodbc is tested with Windows’s built-in driver manager and unixodbc on Linux and OSX.

ODBC applications

Applications finally use the ODBC API and link to the driver manager. Any time they open a connection, they need to specify the data source name that contains connection attributes that relate to the desired database. Alternatively, they can specify all necessary connection options directly.

Linking to the driver manager instead of the ODBC driver directly means that changing to another driver is as simple as exchanging the connection string at runtime instead of tediously linking to a new driver. Linking to the driver manager also means that the driver manager handles many capability and compatibility options by transparently using alternative functions and workarounds as required.

Driver manager configuration

The driver manager needs to know to which databases to connect with which ODBC drivers. This configuration needs to be maintained by the user.

Windows

Windows comes with a preinstalled driver manager that can be configured with the ODBC data source administrator. Please see Microsoft’s official documentation for this. Besides adding your data sources, no special measures need to be done for your configuration to be found.

Unixodbc (Linux and OSX)

Unixodbc is a different beast. For one thing, you need to install it first. That is usually an easy task involving a simple apt-get install unixodbc (Linux) or brew install unixodbc (OSX with Homebrew).

However, unixodbc can be configured in many ways, both with and without graphical guidance. The official documentation is not always easy to follow, and finding what you are looking for may be more difficult than you planned for and may involve looking into unixodbc’s source code.

The following primer assumes that no graphic tools are used (as is often common in server environments). It is not specific to turbodbc and based on information available at these locations:

ODBC configuration files

Unixodbc’s main configuration file is usually called odbc.ini. odbc.ini defines data sources that are available for connecting. It is a simple ini-style text file with the following layout:

[data source name]
Driver = /path/to/driver_library.so
Option1 = Value
Option2 = Other value

[other data source]
Driver = Identifier specified in odbcinst.ini file
OptionA = Value

The sections define data source names that can be used to connect with the respective database. Each section requires a Driver key. The value of this key may either contain the path to the database’s ODBC driver or a key that identifies the driver in unixodbc’s other configuration file odbcinst.ini. Each section may contain an arbitrary number of key-value pairs that specify further connection options. These connection options are driver-specific, so you need to refer to the ODBC driver’s reference for that.

As mentioned before, unixodbc features a second (and optional) configuration file usually called odbcinst.ini. This file lists available ODBC drivers and labels them for convenient reference in odbc.ini. The file also follows the ini-style convention:

[driver A]
Driver = /path/to/driver_library.so
Threading = 2
Description = A driver to access ShinyDB databases

[driver B]
Driver = /some/other/driver/library.so

The sections define names that can be used as values for the Driver keys in odbc.ini. Each section needs to feature Driver keys themselves, where the values represent paths to the respective ODBC drivers. Some additional options are available such as the Threading level (see unixodbc’s source code for details) or a Description field.

Configuration file placement options

Unixodbc has a few places where it looks for its configuration files:

  • Global configuration files are found in /etc/odbc.ini and /etc/odbcinst.ini. Data sources defined in /etc/odbc.ini are available to all users of your computer. Drivers defined in /etc/odbcinst.ini can be used by all users of your computer.

  • Users can define additional data sources by adding the file ~/.odbc.ini to their home directory. It seems that a file called ~/.odbcinst.ini has no effect.

  • Users can add a folder in which to look for configuration files by setting the ODBCSYSINI environment variable:

    > export ODBCSYSINI=/my/folder
    

    This will override the configuration files found at /etc. Place your configuration files at /my/folder/odbc.ini and /my/folder/odbcinst.ini.

  • Users can override the path for the user-specific odbc.ini file by setting the ODBCINI environment variable:

    > export ODBCINI=/full/path/to/odbc.ini
    

    If you set this option, unixodbc will no longer consider ~/.odbc.ini.

    Note

    Do not expect the ODBCINSTINI environment variable to work just as ODBCINI. Instead, the ODBCINSTINI specifies the file name of odbcinst.ini relative to the value of the ODBCSYSINI variable. I suggest not to use this variable since it is outright confusing.

Configuration file placement recommendations

Here are a few typical scenarios:

  • First steps with unixodbc: Create a new folder that contains odbc.ini and odbcinst.ini. Set the ODBCSYSINI variable to this folder.
  • Experimenting with a new database/driver: Create a new folder that contains odbc.ini and odbcinst.ini. Set the ODBCSYSINI variable to this folder.
  • Provision a system with drivers: Place an odbcinst.ini file at /etc/odbcinst.ini. Tell users to configure their databases using ~/odbc.ini or setting ODBCINI.
  • Switching between multiple distinct configurations (test/production): Use the ODBCSYSINI variable if the configurations do not share common drivers. Otherwise, use the ODBCINI variable to switch between different odbc.ini files.

Databases configuration and performance

As already outlined in the more general ODBC configuration section, connecting with your database via ODBC can be a real pain. Making matters worse, database performance may significantly depend on the configuration as well.

Well, this section tries to make life just a tad easier by providing recommended configurations for various databases. For some databases, comparisons with other database access modules are provided as well so that you know what kind of performance to expect.

Note

The quality of the ODBC driver for a given database heavily affects performance of all ODBC applications using this driver. Even though turbodbc was built to exploit buffering and what else the ODBC API has to offer, it cannot work wonders when the ODBC driver is not up to the task. In such circumstances, other, non-ODBC technologies may be available that outperform turbodbc by a considerable margin.

Exasol

Exasol is one of turbodbc’s main development databases, and also provided the initial motivation for creating turbodbc. Here are the recommended settings for connecting to an Exasol database via ODBC using the turbodbc module for Python.

PostgreSQL

PostgreSQL is part of turbodbc’s integration databases. That means that each commit in turbodbc’s repository is automatically tested against PostgreSQL to ensure compatibility. Here are the recommended settings for connecting to a PostgreSQL database via ODBC using the turbodbc module for Python.

Note

PostgreSQL’s free ODBC driver is not optimized for performance. Hence, there is not too much turbodbc can do to improve speed. You will achieve much better performance with psycopg2 or asyncpg.

MySQL

MySQL is part of turbodbc’s integration databases. That means that each commit in turbodbc’s repository is automatically tested against MySQL to ensure compatibility. Here are the recommended settings for connecting to a MySQL database via ODBC using the turbodbc module for Python.

Note

You can use the MySQL ODBC driver to connect with databases that use the MySQL wire protocol. Examples for such databases are MariaDB, Amazon Aurora DB, or MemSQL.

Microsoft SQL server (MSSQL)

Microsoft SQL server (MSSQL) is part of turbodbc’s integration databases. That means that each commit in turbodbc’s repository is automatically tested against MSSQL to ensure compatibility. Here are the recommended settings for connecting to a Microsoft SQL database via ODBC using the turbodbc module for Python.

Troubleshooting

This section contains advice on how to troubleshoot ODBC connections. The advice contained here is not specific to turbodbc, but very related.

Note

This section currently assumes you are on a Linux/OSX machine that uses unixodbc as a driver manager. Windows users may find the contained information useful, but should expect some additional transfer work adjusting the advice to the Windows platform.

Testing your ODBC configuration

You can test your configuration with turbodbc, obviously, by creating a connection. It is preferable, however, to use the tool isql that is shipped together with unixodbc. It is a very simple program that does not try anything fancy and is perfectly suited for debugging. If you configuration does not work with isql, it will not work with turbodbc.

Note

Before you file an issue with turbodbc, please make sure that you can actually connect your database using isql.

When you have selected an ODBC configuration as outlined above, enter the following command in a shell:

> isql "data source name" user password -v

Specifying user and password is optional. On success, this will output a shell such as this:

+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+
SQL>

You can type in any SQL command you wish to test or leave the shell with the quit command. In case of errors, a hopefully somewhat helpful error message will be displayed.

Common configuration errors

[IM002][unixODBC][Driver Manager]Data source name not found, and no default driver specified
[ISQL]ERROR: Could not SQLConnect

This usually means that the data source name could not be found because the configuration is not active. Troubleshooting:

  • Check for typos in data source names in odbc.ini or your shell.
  • Check if the correct odbc.ini file is used.
  • Check the values of $ODBCSYSINI and $ODBCINI (usually only one should be set).
  • Check if $ODBCINSTINI is set (usually it should not be set). Unset the variable.
  • Check your data source has a Driver section.
[01000][unixODBC][Driver Manager]Can't open lib '/path/to/driver.so' : file not found
[ISQL]ERROR: Could not SQLConnect

This means the ODBC driver library cannot be opened. The suggested cause “file not found” may be misleading, however, as this message may be printed even if the file exists. Troubleshooting:

  • Check whether the library exists at the specified location.
  • Check whether you have permission to read the library.
  • Check whether the library depends on other shared libraries that are not present: * On Linux, use ldd /path/to/library.so * On OSX, use otool -L /path/to/library.dylib
  • Check whether any superfluous non-printable characters are present in your odbc.ini or odbcinst.ini near the Driver line. Been there, done that...

More subtle issues

There are still a few errors to make even when you can successfully establish a connection with your database. Here are a few common ones:

  • Unsuitable locale: Some databases return data in a format dictated by your current locale settings. For example, unicode output may require a locale that supports UTF-8, such as en-US.utf-8. Otherwise, replacement characters appear instead of unicode characters. Set the locale via environment variables such as LC_ALL or check whether your driver supports to set a locale in its connection options.
  • Time zones: ODBC does not feature a dedicated type that is aware of time zones or the distinction between local time and UTC. Some databases, however, feature separate types for, e.g., timestamps with and without time zone information. ODBC drivers now need to find a way to make such information available to ODBC applications. A usual way to do this is to convert (some) values into the session time zone. This may lead to conflicting information when sessions with different time zones access the same database. The recommendation would be to fix the session time zone to UTC whenever possible to keep things consistent.

Frequently asked questions

Can I use turbodbc together with SQLAlchemy?

Using Turbodbc in combination with SQLAlchemy is possible for a (currently) limited number of databases:

All of the above packages are available on PyPI. There are also more experimental implementations available:

Where would I report issues concerning turbodbc?

In this case, please use turbodbc’s issue tracker on GitHub.

Where can I ask questions regarding turbodbc?

Basically, you can ask them anywhere, chances to get a helpful answer may vary, though. I suggest to ask questions either using turbodbc’s issue tracker on GitHub or by heading over to stackoverflow.

Is there a guided tour through turbodbc’s entrails?

Yes, there is! Check out these blog posts on the making of turbodbc:

I love Twitter! Is turbodbc on Twitter?

Yes, it is! Just follow @turbodbc for the latest turbodbc talk and news about related technologies.

How can I find our more about turbodbc’s latest developments?

There are a few options:

  • Watch the turbodbc project on GitHub. This way, you will get mails for new issues, updates issues, and the like.
  • Periodically read turbodbc’s change log
  • Follow @turbodbc on Twitter. There will be tweets for new releases.

Contributing

Ways to contribute

Contributions to turbodbc are most welcome! There are many options how you can contribute, and not all of them require you to be an expert programmer:

  • Ask questions and raise issues on GitHub. This may influence turbodbc’s roadmap.
  • If you like turbodbc, star/fork/watch the project on GitHub. This will improve visibility, and potentially attracts more contributors.
  • Report performance comparisons between turbodbc and other means to access a database.
  • Tell others about turbodbc on your blog, Twitter account, or at the coffee machine at work.
  • Improve turbodbc’s documentation by creating pull requests on GitHub.
  • Improve existing features by creating pull requests on GitHub.
  • Add new features by creating pull requests on GitHub.
  • Implement dialects for SQLAlchemy that connect to databases using turbodbc.

Pull requests

Pull requests are appreciated in general, but not all pull requests will be accepted into turbodbc. Before starting to work on a pull request, please make sure your pull request is aligned with turbodbc’s vision of creating fast ODBC database access for data scientists. The safest way is to ask on GitHub whether a certain feature would be appreciated.

After forking the project and making your modifications, you can create a new pull request on turbodbc’s GitHub page. This will trigger an automatic build and, eventually, a code review. During code reviews, I try to make sure that the added code complies with clean code principles such as single level of abstraction, single responsibility principle, principle of least astonishment, etc.

If you do not know what all of this means, just try to keep functions small (up to five lines) and find meaningful names. If you feel like writing a comment, think about whether the comment would make a nice variable or function name, and refactor your code accordingly.

I am well aware that the current code falls short of clean code standards in one place or another. Please do not take criticism regarding your code personally. Any comments are purely directed to improve the quality of turbodbc’s code base over its current state.

Development version

For developing new features or just sampling the latest version of turbodbc, do the following:

  1. Make sure your development environment meets the prerequisites mentioned in the getting started guide.

  2. Create a Python virtual environment, activate it, and install the necessary packages numpy, pytest, and mock:

    pip install numpy pytest pytest-cov mock
    
  3. Make sure you have a recent version of cmake installed. For some operating systems, binary wheels are available in addition to the package your operating system offers:

    pip install cmake
    
  4. Clone turbodbc into the virtual environment somewhere:

    git clone https://github.com/blue-yonder/turbodbc.git
    
  5. cd into the git repo and pull in the pybind11 submodule by running:

    git submodule update --init --recursive
    
  6. Create a build directory somewhere and cd into it.

  7. Execute the following command:

    cmake -DCMAKE_INSTALL_PREFIX=./dist /path/to/turbodbc
    

    where the final path parameter is the directory to the turbodbc git repo, specifically the directory containing setup.py. This cmake command will prepare the build directory for the actual build step.

  8. Run make. This will build (compile) the source code.

  9. At this point you can run the test suite. First, make a copy of the relevant json documents from the turbodbc python/turbodbc_test directory, there’s one for each database. Then edit your copies with the relevant credentials. Next, set the environment variable TURBODBC_TEST_CONFIGURATION_FILES as a comma-separated list of the json files you’ve just copied and run the test suite, as follows:

    export TURBODBC_TEST_CONFIGURATION_FILES="<Postgres json file>,<MySql json file>, <MS SQL json file>"
    ctest --output-on-failure
    
  10. Finally, to create a Python source distribution for pip installation, run the following from the build directory:

    make install
    cd dist
    python setup.py sdist
    

    This will create a turbodbc-x.y.z.tar.gz file locally which can be used by others to install turbodbc with pip install turbodbc-x.y.z.tar.gz.

API reference

turbodbc.connect(dsn=None, turbodbc_options=None, read_buffer_size=None, parameter_sets_to_buffer=None, use_async_io=False, connection_string=None, **kwargs)

Create a connection with the database identified by the dsn or the connection_string.

Parameters:
  • dsn – Data source name as given in the (unix) odbc.ini file or (Windows) ODBC Data Source Administrator tool.
  • turbodbc_options – Options that control how turbodbc interacts with the database. Create such a struct with turbodbc.make_options() or leave this blank to take the defaults.
  • connection_string – Preformatted ODBC connection string. Specifying this and dsn or kwargs at the same time raises ParameterError.
  • **kwargs – You may specify additional options as you please. These options will go into the connection string that identifies the database. Valid options depend on the specific database you would like to connect with (e.g. user and password, or uid and pwd)
Returns:

A connection to your database

turbodbc.make_options(read_buffer_size=None, parameter_sets_to_buffer=None, prefer_unicode=None, use_async_io=None, autocommit=None)

Create options that control how turbodbc interacts with a database. These options affect performance for the most part, but some options may require adjustment so that all features work correctly with certain databases.

If a parameter is set to None, this means the default value is used.

Parameters:
  • read_buffer_size – Affects performance. Controls the size of batches fetched from the database when reading result sets. Can be either an instance of turbodbc.Megabytes (recommended) or turbodbc.Rows.
  • parameter_sets_to_buffer – Affects performance. Number of parameter sets (rows) which shall be transferred to the server in a single batch when executemany() is called. Must be an integer.
  • use_async_io – Affects performance. Set this option to True if you want to use asynchronous I/O, i.e., while Python is busy converting database results to Python objects, new result sets are fetched from the database in the background.
  • prefer_unicode – May affect functionality and performance. Some databases do not support strings encoded with UTF-8, leading to UTF-8 characters being misinterpreted, misrepresented, or downright rejected. Set this option to True if you want to transfer character data using the UCS-2/UCS-16 encoding that use (multiple) two-byte instead of (multiple) one-byte characters.
  • autocommit – Affects behavior. If set to True, all queries and commands executed with cursor.execute() or cursor.executemany() will be succeeded by an implicit commit operation, persisting any changes made to the database. If not set or set to False, users has to take care of calling cursor.commit() themselves.
Returns:

An option struct that is suitable to pass to the turbodbc_options parameter of turbodbc.connect()

class turbodbc.connection.Connection(impl)
autocommit

This attribute controls whether changes are automatically committed after each execution or not.

close()

Close the connection and all associated cursors. This will implicitly roll back any uncommitted operations.

commit()

Commits the current transaction

cursor()

Create a new Cursor instance associated with this Connection

Returns:A new Cursor instance
rollback()

Roll back all changes in the current transaction

class turbodbc.cursor.Cursor(impl)

This class allows you to send SQL commands and queries to a database and retrieve associated result sets.

close()

Close the cursor.

description

Retrieve a description of the columns in the current result set

Returns:A tuple of seven elements. Only some elements are meaningful:
  • Element #0 is the name of the column
  • Element #1 is the type code of the column
  • Element #6 is true if the column may contain NULL values
execute(sql, parameters=None)

Execute an SQL command or query

Parameters:
  • sql – A (unicode) string that contains the SQL command or query. If you would like to use parameters, please use a question mark ? at the location where the parameter shall be inserted.
  • parameters – An iterable of parameter values. The number of values must match the number of parameters in the SQL string.
Returns:

The Cursor object to allow chaining of operations.

executemany(sql, parameters=None)

Execute an SQL command or query with multiple parameter sets.

Parameters:
  • sql – A (unicode) string that contains the SQL command or query. If you would like to use parameters, please use a question mark ? at the location where the parameter shall be inserted.
  • parameters – An iterable of iterable of parameter values. The outer iterable represents separate parameter sets. The inner iterable contains parameter values for a given parameter set. The number of values of each set must match the number of parameters in the SQL string.
Returns:

The Cursor object to allow chaining of operations.

fetchall()

Fetches a list of all rows in the active result set generated with execute() or executemany().

Returns:A list of rows
fetchallnumpy()

Fetches all rows in the active result set generated with execute() or executemany().

Returns:An OrderedDict of columns, where the keys of the dictionary are the column names. The columns are of NumPy’s MaskedArray type, where the optimal data type for each result set column is chosen automatically.
fetchmany(size=None)

Fetches a batch of rows in the active result set generated with execute() or executemany().

Parameters:size – Controls how many rows are returned. The default None means that the value of Cursor.arraysize is used.
Returns:A list of rows
fetchnumpybatches()

Returns an iterator over all rows in the active result set generated with execute() or executemany().

Returns:An iterator you can use to iterate over batches of rows of the result set. Each batch consists of an OrderedDict of NumPy MaskedArray instances. See fetchallnumpy() for details.
fetchone()

Returns a single row of a result set. Requires an active result set on the database generated with execute() or executemany().

Returns:Returns None when no more rows are available in the result set
setinputsizes(sizes)

Has no effect since turbodbc automatically picks appropriate return types and sizes. Method exists since PEP-249 requires it.

setoutputsize(size, column=None)

Has no effect since turbodbc automatically picks appropriate input types and sizes. Method exists since PEP-249 requires it.

class turbodbc.exceptions.Error

turbodbc’s basic error class

class turbodbc.exceptions.InterfaceError

An error that is raised whenever you use turbodbc incorrectly

class turbodbc.exceptions.DatabaseError

An error that is raised when the database encounters an error while processing your commands and queries

class turbodbc.exceptions.ParameterError

An error that is raised when you use connection arguments that are supposed to be mutually exclusive

Indices and tables