Performance, compatibility, and behavior options¶
Turbodbc offers a way to adjust its behavior to tune performance and to achieve compatibility with your database. The basic usage is this:
>>> from turbodbc import connect, make_options >>> options = make_options() >>> connect(dsn="my_dsn", turbodbc_options=options)
This will connect with your database using the default options. To use non-default
options, supply keyword arguments to
>>> from turbodbc import Megabytes >>> options = make_options(read_buffer_size=Megabytes(100), ... parameter_sets_to_buffer=1000, ... varchar_max_character_limit=10000, ... use_async_io=True, ... prefer_unicode=True, ... autocommit=True, ... large_decimals_as_64_bit_types=True, ... limit_varchar_results_to_max=True)
Read buffer size¶
read_buffer_size affects how many result set rows are retrieved per batch
of results. Set the attribute to
turbodbc.Megabytes(42) to have turbodbc determine
the optimal number of rows per batch so that the total buffer amounts to
42 MB. This is recommended for most users and databases. You can also set
the attribute to
turbodbc.Rows(13) if you would like to fetch results in
batches of 13 rows. By default, turbodbc fetches results in batches of 20 MB.
Please note that sometimes a single row of a result set may exceed the specified
buffer size. This can happen if large fields such as
are part of the result set. In this case, results are fetched in batches of single rows
that exceed the specified size. Buffer sizes for large text fields can be controlled
with the VARCHAR(max) character limit and XXX options.
Buffered parameter sets¶
parameter_sets_to_buffer changes the number of parameter sets
which are transferred per batch of parameters (e.g., as sent with
Please note that it is not (yet) possible to use the Megabytes and Rows classes
VARCHAR(max) character limit¶
varchar_max_character_limit specifies the buffer size for result set columns
NVARCHAR(max), or similar types your database supports.
Small values increase the chance of truncation, large ones require more memory. Depending
on your setting of
read_buffer_size, this may increase the total memory consumption
or reduce the number of rows fetched per batch, thus affecting performance.
The default value is
This value does not affect fields of type
n > 0, unless
the option Limit VARCHAR results to MAX is set. Also, this
option does not affect parameters that you may pass to the database.
If you set
True, turbodbc will use asynchronous I/O operations
(limited to result sets for the time being). Asynchronous I/O means that while the
main thread converts result set rows retrieved from the database to Python
objects, another thread fetches a new batch of results from the database in the background. This may yield
a speedup of
2 if retrieving and converting are similarly fast
Asynchronous I/O is experimental and has to fully prove itself yet. Do not be afraid to give it a try, though.
True if your database does not fully support
the UTF-8 encoding turbodbc prefers. With this option you can tell turbodbc
to use two-byte character strings with UCS-2/UTF-16 encoding. Use this option
if you try to connection to Microsoft SQL server (MSSQL).
True if you want the database to
changes automatically after each query or command. By default,
is disabled and users are required to call
cursor.commit() to persist
Some databases that do not support transactions may even require this
option to be set to
True in order to establish a connection at all.
Large decimals as 64 bit types¶
True if you want to retrieve
Numeric types with more than
18 digits as the 64 bit
float numbers. The default is to retrieve such fields
as strings instead.
Please note that this option may lead to overflows or loss of precision. If, however, your data type is much larger than the data it is supposed to hold, this option is very useful to obtain numeric Python objects and NumPy arrays.
Limit VARCHAR results to MAX¶
True if you want to limit all
string-like fields (
NVARCHAR(n), etc. with
n > 0) in
result sets to a maximum of VARCHAR(max) character limit characters.
Please note that enabling this option can lead to truncation of string-like data when retrieving results. Parameters sent to the database are not affected by this option.
If not set or set to
False, string-like result fields with a specific size will
always be retrieved with a sufficiently large buffer so that no truncation occurs.
String-like fields of indeterminate size (
TEXT, etc. on some
databases) are still subject to VARCHAR(max) character limit.
Controlling autocommit behavior at runtime¶
You can enable and disable autocommit mode after you have established a connection, and you can also check whether autocommit is currently enabled:
>>> from turbodbc import connect >>> connection = connect(dsn="my DSN") >>> connection.autocommit = True [... more things happening ...] >>> if not connection.autocommit: ... connection.commit()
Turbodbc’s NumPy support requires the
numpy package to be installed. For all source builds,
Numpy needs to be installed before installing turbodbc.
Please check the installation instructions
for more details.
Obtaining NumPy result sets all at once¶
Here is how to use turbodbc to retrieve the full result set in the form of NumPy masked arrays:
>>> cursor.execute("SELECT A, B FROM my_table") >>> cursor.fetchallnumpy() OrderedDict([('A', masked_array(data = [42 --], mask = [False True], fill_value = 999999)), ('B', masked_array(data = [3.14 2.71], mask = [False False], fill_value = 1e+20))])
Obtaining NumPy result sets in batches¶
You can also fetch NumPy result sets in batches using an iterable:
>>> cursor.execute("SELECT A, B FROM my_table") >>> batches = cursor.fetchnumpybatches() >>> for batch in batches: ... print(batch) OrderedDict([('A', masked_array(data = [42 --], mask = [False True], fill_value = 999999)), ('B', masked_array(data = [3.14 2.71], mask = [False False], fill_value = 1e+20))])
The size of the batches depends on the
read_buffer_size attribute set in
the performance options.
Notes regarding NumPy result sets¶
- NumPy results are returned as an
OrderedDictof column name/value pairs. The column order is the same as in your query.
- The column values are of type
NULLvalues you have in your database will show up as masked entries (
NULLvalues in string-like columns will shop up as
The following table shows how the most common data types data scientists are interested in are converted to NumPy columns:
|Database type(s)||Python type|
*) The conversion depends on turbodbc’s
Using NumPy arrays as query parameters¶
Here is how to use turbodbc to use values stored in NumPy arrays
as query parameters with
>>> from numpy import array >>> from numpy.ma import MaskedArray >>> normal_param = array([1, 2, 3], dtype='int64') >>> masked_param = MaskedArray([3.14, 1.23, 4.56], ... mask=[False, True, False], ... dtype='float64') >>> cursor.executemanycolumns("INSERT INTO my_table VALUES (?, ?)", ... [normal_param, masked_param]) # functionally equivalent, but much faster than: # cursor.execute("INSERT INTO my_table VALUES (1, 3.14)") # cursor.execute("INSERT INTO my_table VALUES (2, NULL)") # cursor.execute("INSERT INTO my_table VALUES (3, 4.56)") >>> cursor.execute("SELECT * FROM my_table").fetchall() [[1L, 3.14], [2L, None], [3L, 4.56]]
- Columns must either be of type
- Each column must contain one-dimensional, contiguous data.
- All columns must have equal size.
dtypeof each column must be supported, see the table below.
MaskedArray``s with and set the ``maskto
Truefor individual elements to use
- Data is transfered in batches (see Buffered parameter sets)
|Supported NumPy type||Transferred as|
Apache Arrow support¶
Turbodbc’s Apache Arrow support requires the
pyarrow package to be installed.
For all source builds, Apache Arrow needs to be installed before installing turbodbc.
Please check the installation instructions
for more details.
Apache Arrow support in turbodbc is still experimental and may not be as efficient as possible yet. Also, Apache Arrow support is not yet available for Windows and has some issues with Unicode fields. Stay tuned for upcoming improvements.
Obtaining Apache Arrow result sets¶
Here is how to use turbodbc to retrieve the full result set in the form of an Apache Arrow table:
>>> cursor.execute("SELECT A, B FROM my_table") >>> table = cursor.fetchallarrow() >>> table pyarrow.Table A: int64 B: string >>> table.to_pylist()  >>> table.to_pylist() [u'hello']
Looking at the data like this is not particularly useful. However, there is some really useful stuff you can do with an Apache Arrow table, for example, convert it to a Pandas dataframe like this:
>>> table.to_pandas() A B 0 42 hello
As a performance optimisation for string columns, you can specify the parameter
strings_as_dictionary. This will retrieve all string columns as Arrow
DictionaryArray. The data will here be split into two arrays, one that stores
all unique string values and one integer array that stores for each row the index
in the dictionary. On converions to Pandas, these columns will be turned into
>>> cursor.execute("SELECT a, b FROM my_other_table") >>> table = cursor.fetchallarrow(strings_as_dictionary=True) >>> table pyarrow.Table a: int64 b: dictionary<values=binary, indices=int8, ordered=0> dictionary: [61, 62] >>> table.to_pandas() a b 0 1 a 1 2 b 2 3 b >>> table.to_pandas().info() <class 'pandas.core.frame.DataFrame'> Int64Index: 3 entries, 0 to 2 Data columns (total 2 columns): a 3 non-null int64 b 3 non-null category dtypes: category(1), int64(1) memory usage: 147.0 bytes