dask.dataframe.to_sql

dask.dataframe.to_sql(df, name: str, uri: str, schema=None, if_exists: str = 'fail', index: bool = True, index_label=None, chunksize=None, dtype=None, method=None, compute=True, parallel=False)[source]

Store Dask Dataframe to a SQL table

An empty table is created based on the “meta” DataFrame (and conforming to the caller’s “if_exists” preference), and then each block calls pd.DataFrame.to_sql (with if_exists=”append”).

Databases supported by SQLAlchemy [1] are supported. Tables can be newly created, appended to, or overwritten.

Parameters
namestr

Name of SQL table.

uristring

Full sqlalchemy URI for the database connection

schemastr, optional

Specify the schema (if database flavor supports this). If None, use default schema.

if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’

How to behave if the table already exists.

  • fail: Raise a ValueError.

  • replace: Drop the table before inserting new values.

  • append: Insert new values to the existing table.

indexbool, default True

Write DataFrame index as a column. Uses index_label as the column name in the table.

index_labelstr or sequence, default None

Column label for index column(s). If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.

chunksizeint, optional

Specify the number of rows in each batch to be written at a time. By default, all rows will be written at once.

dtypedict or scalar, optional

Specifying the datatype for columns. If a dictionary is used, the keys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided, it will be applied to all columns.

method{None, ‘multi’, callable}, optional

Controls the SQL insertion clause used:

  • None : Uses standard SQL INSERT clause (one per row).

  • ‘multi’: Pass multiple values in a single INSERT clause.

  • callable with signature (pd_table, conn, keys, data_iter).

Details and a sample callable implementation can be found in the section insert method.

computebool, default True

When true, call dask.compute and perform the load into SQL; otherwise, return a Dask object (or array of per-block objects when parallel=True)

parallelbool, default False

When true, have each block append itself to the DB table concurrently. This can result in DB rows being in a different order than the source DataFrame’s corresponding rows. When false, load each block into the SQL DB in sequence.

Raises
ValueError

When the table already exists and if_exists is ‘fail’ (the default).

See also

read_sql

Read a DataFrame from a table.

Notes

Timezone aware datetime columns will be written as Timestamp with timezone type with SQLAlchemy if supported by the database. Otherwise, the datetimes will be stored as timezone unaware timestamps local to the original timezone.

New in version 0.24.0.

References

1

https://docs.sqlalchemy.org

2

https://www.python.org/dev/peps/pep-0249/

Examples

Create a table from scratch with 4 rows.

>>> import pandas as pd
>>> df = pd.DataFrame([ {'i':i, 's':str(i)*2 } for i in range(4) ])
>>> from dask.dataframe import from_pandas
>>> ddf = from_pandas(df, npartitions=2)
>>> ddf  
Dask DataFrame Structure:
                   i       s
npartitions=2
0              int64  object
2                ...     ...
3                ...     ...
Dask Name: from_pandas, 2 tasks
>>> from dask.utils import tmpfile
>>> from sqlalchemy import create_engine
>>> with tmpfile() as f:
...     db = 'sqlite:///%s' %f
...     ddf.to_sql('test', db)
...     engine = create_engine(db, echo=False)
...     result = engine.execute("SELECT * FROM test").fetchall()
>>> result
[(0, 0, '00'), (1, 1, '11'), (2, 2, '22'), (3, 3, '33')]