Getting started¶

Installation¶

Linux and OSX¶

To install turbodbc on Linux and OSX, please use the following command:

pip install turbodbc

This will trigger a source build that requires compiling C++ code. Please make sure the following prerequisites are met:

Requirement	Linux (`apt-get install`)	OSX (`brew install`)
C++11 compiler	G++-4.8 or higher	clang with OSX 10.9+
Boost library + headers (1)	`libboost-all-dev`	`boost`
ODBC library + headers	`unixodbc-dev`	`unixodbc`
Python headers	`python-dev`	use `pyenv` to install

Please pip install numpy before installing turbodbc, because turbodbc will search for the numpy Python package at installation/compile time. If NumPy is not installed, turbodbc will not compile the optional NumPy support features. Similarly, please pip install pyarrow before installing turbodbc if you would like to use the optional Apache Arrow support.

(1) The minimum viable Boost setup requires the libraries variant, optional, datetime, and locale.

Windows¶

To install turbodbc on Windows, please use the following command:

pip install turbodbc

This will download and install a binary wheel, no compilation required. You still need to meet the following prerequisites, though:

Requirement	Windows
OS Bitness	64-bit
Python	3.5 or 3.6, 64-bit
Runtime	MSVS 2015 Update 3 Redistributable, 64 bit

If you require NumPy support, please

pip install numpy

Sometime after installing turbodbc. Apache Arrow support is not yet available on Windows.

Basic usage¶

Turbodbc follows the specification of the Python database API v2 (PEP 249). Here is a short summary, including the parts not specified by the PEP.

Establish a connection with your database¶

All ODBC appications, including turbodbc, use connection strings to establish connections with a database. If you know how the connection string for your database looks like, use the following lines to establish a connection:

>>> from turbodbc import connect
>>> connection = connect(connection_string='Driver={PostgreSQL};Server=IP address;Port=5432;Database=myDataBase;Uid=myUsername;Pwd=myPassword;')

If you do not specify the connection_string keyword argument, turbodbc will create a connection string based on the keyword arguments you pass to connect:

>>> from turbodbc import connect
>>> connection = connect(dsn='My data source name as defined by your ODBC configuration')

The dsn is the data source name of your connection. Data source names uniquely identify connection settings that shall be used to connect with a database. Data source names are part of your ODBC configuration and you need to set them up yourself. Once set up, however, all ODBC applications can use the same data source name to refer to the same set of connection options, typically including the host, database, driver settings, and sometimes even credentials. If your ODBC environment is set up properly, just using the dsn option should be sufficient.

You can add extra options besides the dsn to overwrite or add settings:

>>> from turbodbc import connect
>>> connection = connect(dsn='my dsn', user='my user has precedence')
>>> connection = connect(dsn='my dsn', username='field names depend on the driver')

Last but not least, you can also do without a dsn and just specify all required configuration options directly:

>>> from turbodbc import connect
>>> connection = connect(driver="PostgreSQL",
...                      server="hostname",
...                      port="5432",
...                      database="myDataBase",
...                      uid="myUsername",
...                      pwd="myPassword")

Executing SQL queries and retrieving results¶

To execute a query, you need to create a cursor object first:

>>> cursor = connection.cursor()

This cursor object lets you execute SQL commands and queries. Here is how to execute a SELECT query:

>>> cursor.execute('SELECT 42')

You have multiple options to retrieve the generated result set. For example, you can iterate over the cursor:

>>> for row in cursor:
...     print row
[42L]

Alternatively, you can fetch all results as a list of rows:

>>> cursor.fetchall()
[[42L]]

You can also retrieve result sets as NumPy arrays or Apache Arrow tables, see Advanced usage.

Executing manipulating SQL queries¶

As before, you need to create a cursor object first:

>>> cursor = connection.cursor()

You can now execute a basic INSERT query:

>>> cursor.execute("INSERT INTO TABLE my_integer_table VALUES (42, 17)")

This will insert two values, 42 and 17, in a single row of table my_integer_table. Inserting values like this is impractical, because it requires to put the values into the actual SQL string.

To avoid this, you can pass parameters to execute():

>>> cursor.execute("INSERT INTO TABLE my_integer_table VALUES (?, ?)",
...                [42, 17])

Please note the question marks ? in the SQL string that marks two parameters. Adding single rows at a time is not efficient. You can add more than just a single row to a table in efficiently by using executemany():

>>> parameter_sets = [[42, 17],
...                   [23, 19],
...                   [314, 271]]
>>> cursor.executemany("INSERT INTO TABLE my_integer_table VALUES (?, ?)",
...                    parameter_sets)

If you already have parameters stored as NumPy arrays, check the Using NumPy arrays as query parameters section to use them even more efficiently.

Transactions¶

By default, turbodbc does not enable automatic commits (autocommit). To commit your changes to the database, please use the following command:

>>> connection.commit()

If you want to roll back your changes, use the following command:

>>> connection.rollback()

If you prefer autocommit for your workflow or your database does not support transactions at all, you can use the autocommit option.

Supported data types¶

Turbodbc supports the most common data types data scientists are interested in. The following table shows which database types are converted to which Python types:

Database type(s)	Python type
Integers, `DECIMAL(<19,0)`	`int`
`DOUBLE`, `DECIMAL(<19, >0)`	`float`
`DOUBLE`, `DECIMAL(>18, 0)`	`unicode` (`str`) or `int` *
`DOUBLE`, `DECIMAL(>18, >0)`	`unicode` (`str`) or `float` *
`BIT`, boolean-like	`bool`
`TIMESTAMP`, `TIME`	`datetime.datetime`
`DATE`	`datetime.date`
`VARCHAR`, strings	`unicode` (`str`)

*) The conversion depends on turbodbc’s large_decimals_as_64_bit_types option.

When using parameters with execute() and executemany(), the table is basically reversed. The first type in the “database type(s)” column denotes the type used to transfer back data. For integers, 64-bit integers are transferred. For strings, the length of the transferred VARCHAR depends on the length of the transferred strings.

What to read next¶

Continue with the advanced usage section. Besides general tuning parameters it also discusses how to leverage NumPy or Apache Arrow for even better performance.