How to use python with databases

How to use python with databases

Today we are going to explore some of the ways you can connect to sql databases with python. A few examples of why you would want to do this are:

  • Querying large data sets without the need to store the data in an intermediate step.
  • To combine data from several disparate databases.
  • To use as a wrapper for an application (sqlite, mariadb, etc)
  • Extracting data from an application with a hosted database.

Let’s dive in and explore the different ways of extracting data and show how to store it.

Oracle

The package that I like to use is cx_Oracle, you can install this with the pip installer:

pip install cx_Oracle

Oracle uses JDBC server-side internal driver supports any Java code that runs inside an Oracle database, such as in a Java stored procedures or Enterprise JavaBean, and must access the same database. This driver allows the Java virtual machine (JVM) to communicate directly with the SQL engine.

You will either need to have this driver as an environmental variable (preferred), or explicitly declare it for the python version you are running to acknowledge. If you plan to declare where this driver is located, make sure the version of cx_orcale has the init_oracle_client function. You can check that this is available with a simple statement

from cx_Oracle import init_oracle_client

If you receive an error message update your version of cx_oracle with

pip install cx_Oracle –-upgrade

Once you have the correct version available, declare where your drivers are located: cx_Oracle.init_oracle_client(lib_dir= r"c:\path_to_libraries"), this line will only need to be run once in your script.

To make a connection, create the following objects, and then you can write a query like you would in any sql tool.

dsn = cx_Oracle.makedsn("dbhost.example.com", 1521, service_name="orclpdb1")
connection = cx_Oracle.connect("hr", userpwd, dsn, encoding="UTF-8")
cur = connection.cursor()
query = """
select
user
,employee
,id
from tables
where 1=1
"""

There are a few different ways you can execute your query. I prefer to grab the data as a data frame and use the pandas function pd.read_sql_query(query, connection). This is the easiest formatting for standard data frames. You can also use cur.execute(), and then run .fecthone(), .fetchmany(), or .fetchall().

Finally, make sure to close the connection with connection.close() before you finish your script.

MSSQL

Like Oracle, MSSQL will have a similar setup. I like Pyodbc which simplifies the drivers required, making initial installation faster. For installation:

pip install pyodbc

The connection strings will be a little different, but have a similar structure.

server = 'tcp:myserver.database.windows.net' 
database = 'mydb' 
username = 'myusername' 
password = 'mypassword' 
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()

The cursor will be necessary if you plan to have isolated .execute() or .fetch() statements. Otherwise I suggest just using pd.read_sql_query() to reduce your code and put this directly into a data frame. Once again, you will want to close the connection with cnxn.close().

Sqlite

Connecting to the lighter-weight databases are much easier. There are a number of reasons that this may be used outside of traditional data storage. For example, utilizing sqlite to structure and organize .json formatted data may be helpful. Or you may wish to use a sqlite database within an application. Supporting up to a terabyte of data, this can be a viable option for many different projects.

To connect with a sqlite file you need very few commands, and this adds to the strength of this particular method. The library needed for this will be sqlite3 and is installed with the standard python library set, so no special installation needed.

import sqlite3

conn = sqlite3.connect('my_db.sqlite')
cur = conn.cursor()

With the “cur” you can execute a the standard commands.

cur.execute('''
CREATE TABLE Counts (org TEXT, count INTEGER)''')

cur.execute('SELECT count FROM Counts WHERE org = ? ', (org,))

The above statement is also an example of how to avoid sql injection if you are using sqlite with an application. These commands will allow you to run any sql statement supported, and will work with pd.read_sql_query().

cur.executescript('''
DROP TABLE IF EXISTS User;
DROP TABLE IF EXISTS Member;
DROP TABLE IF EXISTS Course;

CREATE TABLE User (
    id     INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    name   TEXT UNIQUE
);

CREATE TABLE Course (
    id     INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    title  TEXT UNIQUE
);

CREATE TABLE Member (
    user_id     INTEGER,
    course_id   INTEGER,
    role        INTEGER,
    PRIMARY KEY (user_id, course_id)
)
''')

Mariadb

Finally mariadb has an equally easy process for installation and connection:

pip install mariadb

conn = mariadb.connect(
        user="db_user",
        password="db_user_passwd",
        host="192.0.2.1",
        port=3306,
        database="employees"
    )
cur = conn.cursor()

Combining data from multiple databases to create a single data frame

Let’s create a simple merge of two data frames from different databases where we don’t have a linked server. Here is a great use-case for utilizing python. Here we will use a Oracle database and a MSSQL database.

dsn_1 = cx_Oracle.makedsn("dbhost.example_1.com", 1521, service_name="orclpdb1")
cnxn_oracle = cx_Oracle.connect("hr", userpwd, dsn, encoding="UTF-8")

server = 'tcp:myserver.database.windows.net' 
database = 'mydb' 
username = 'myusername' 
password = 'mypassword' 
cnxn_mssql = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)


oracle_query = """
select employees, id, role, salary from eployees"""

mssql_query = """
select employee, id, start_date, end_date from empl_table """

set1 = pd.read_sql_query(oracle_query, cnxn_oracle)
set2 = pd.read_sql_query(mssql_query, cnxn_mssql)

cnxn_oracle.close()
cnxn_mssql.close()

merge_df = pd.merge(set1, set2, how="inner", left_on = "id", right_on="id")

Once you have this data combined, it is important to create artifacts that you can refer back to. Data may change over time, making a binary file of this set is a simple way to store data for auditing, or to recall later without making the database connections. I prefer to use the python library pickle, but there are a number of others that are equally useful. I like pickle since pandas unpacks the data with a single line of code, and pandas is a library I use daily. Pickle is also part of the standard python library, so another bonus if you use multiple environments and want to simplify package setup.

import pickle

def pickled(filename, data):
	outfile = open(f'folder/{filename}', 'wb')
	pickle.dump(data,outfile)
	outfile.close()

pickled(merge_df.data, merge_df)

To extract this data you can use df = pd.read_pickle('folder/merge_df.data').

Passing data from one query to another

Let’s use our oracle/mssql example again.

dsn_1 = cx_Oracle.makedsn("dbhost.example_1.com", 1521, service_name="orclpdb1")
cnxn_oracle = cx_Oracle.connect("hr", userpwd, dsn, encoding="UTF-8")

server = 'tcp:myserver.database.windows.net' 
database = 'mydb' 
username = 'myusername' 
password = 'mypassword' 
cnxn_mssql = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)

oracle_query = """
select employees, id, role, salary from eployees"""

or_df = pd.read_sql_query(oracle_query, cnxn_oracle)

Now that there is a data frame or_df that contains the employee id, we would like to query the ids another database. To do so, we need to format the structure. I find the following function helpful.

def sql_output(df, column):
	lst = []
	for ind, row in df.iterrows():
		lst.append(row[column])

	count = len(lst)
	if count>1000:
		print('query cannot support parameters greater than 1000 objects')
		return None
	output = ''
	for y in lst:
		if count>1:
			output = output+"'"+str(y)+"'"+","+" "
		else:
			output = output+"'"+str(y)+"'"
		count+=1
	return output

To apply this we simply grab the column, create the parameter and pass it into the next query.

sql_param = sql_output(or_df, "id")

mssql_query = f"""
select employee, id, start_date, end_date 
from empl_table 
where id in {sql_param}
"""
ms_df = pd.read_sql_query(mssql_query, cnxn_mssql)

Finally, don’t forgot to close your connections:

cnxn_oracle.close()
cnxn_mssql.close()

I hope you found these examples helpful, and see how simple it is to interact with databases from a python script. There are many different use-cases where interaction with databases is required. When using python you can eliminate multiple steps to organize and merge data, as well as chain queries together to limit outputs.

2 thoughts on “How to use python with databases

    1. Glad you like it! Hopefully you can make use of this code to start data munging. If you have any questions let me know!

Comments are closed.

Comments are closed.