Grizly is a highly experimental library to generate SQL statements using the python pandas api.
The main class in grizly is called QFrame. You can load basic table information using a dictionary or an excel file.
from grizly import QFrame
data = {'select': {
'fields': {
'CustomerId': {'type': 'dim', 'as': 'ID'},
'CustomerName': {'type': 'dim'},
'Country': {'type': 'dim'},
'FiscalYear':{'type': 'dim'},
'Sales': {'type': 'num'}
},
'schema': 'sales_schema',
'table': 'sales_table'
}
}
q = QFrame().from_dict(data)
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
CustomerName,
Country,
FiscalYear,
Sales
FROM sales_schema.sales_table
- Renaming fields
q.rename({'Country': 'Territory', 'FiscalYear': 'FY'})
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
CustomerName,
Country AS Territory,
FiscalYear AS FY,
Sales
FROM sales_schema.sales_table
- Removing fields
q.remove(['CustomerName', 'FiscalYear'])
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
Country AS Territory,
Sales
FROM sales_schema.sales_table
- Adding WHERE clause
q.query("Country IN ('France', 'Germany')")
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
Country AS Territory,
Sales
FROM sales_schema.sales_table
WHERE Country IN ('France',
'Germany')
- Aggregating fields
q.groupby(['CustomerId', 'Country'])['Sales'].agg('sum')
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
Country AS Territory,
sum(Sales) AS Sales
FROM sales_schema.sales_table
WHERE Country IN ('France',
'Germany')
GROUP BY Id,
Territory
- Adding expressions
q.assign(type='num', group_by='sum', Sales_Div="Sales/100")
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
Country AS Territory,
sum(Sales) AS Sales,
sum(Sales/100) AS Sales_Div
FROM sales_schema.sales_table
WHERE Country IN ('France',
'Germany')
GROUP BY Id,
Territory
q.assign(group_by='group', Sales_Positive="CASE WHEN Sales>0 THEN 1 ELSE 0 END")
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
Country AS Territory,
sum(Sales) AS Sales,
sum(Sales/100) AS Sales_Div,
CASE
WHEN Sales>0 THEN 1
ELSE 0
END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
'Germany')
GROUP BY Id,
Territory,
Sales_Positive
- Adding DISTINCT statement
q.distinct()
SELECT DISTINCT CustomerId AS Id,
Country AS Territory,
sum(Sales) AS Sales,
sum(Sales/100) AS Sales_Div,
CASE
WHEN Sales>0 THEN 1
ELSE 0
END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
'Germany')
GROUP BY Id,
Territory,
Sales_Positive
- Adding ORDER BY statement
q.orderby("Sales")
SELECT DISTINCT CustomerId AS Id,
Country AS Territory,
sum(Sales) AS Sales,
sum(Sales/100) AS Sales_Div,
CASE
WHEN Sales>0 THEN 1
ELSE 0
END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
'Germany')
GROUP BY Id,
Territory,
Sales_Positive
ORDER BY Sales
q.orderby(["Country", "Sales"], False)
SELECT DISTINCT CustomerId AS Id,
Country AS Territory,
sum(Sales) AS Sales,
sum(Sales/100) AS Sales_Div,
CASE
WHEN Sales>0 THEN 1
ELSE 0
END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
'Germany')
GROUP BY Id,
Territory,
Sales_Positive
ORDER BY Territory DESC,
Sales DESC
Now we will be loading fields information from excel file. Your excel file must contain following columns:
-
column - Name of the column in table.
-
column_type - Type of the column. Possibilities:
- dim - VARCHAR(500)
- num - FLOAT
Every column has to have specified type. If you want to sepcify another type check custom_type.
-
expression - Expression, eg. CASE statement, column operation, CONCAT statement, ... . In the case of expression column should be empty and the alias (name) of the expression should be placed in column_as.
-
column_as - Column alias (name).
-
group_by - Aggregation type. Possibilities:
- group - This field will go to GROUP BY statement.
- {sum, count, min, max, avg} - This field will by aggregated in specified way.
Please make sure that every field that is placed in SELECT statement (select !=0) is also placed in GROUP BY. If you don't want to aggregate fields leave group_by empty.
-
select - Set 0 to remove this field from SELECT statement.
-
custom_type - Specify custom SQL data type, eg. DATE.
-
schema - Name of the schema. Always in the first row.
-
table - Name of the table. Always in the first row.
Now let's take a look at following example.
In this case we also have a column scope which is used to filter rows in excel file and in that we are be able to create two different QFrames using sales_table. First QFrame contains information about the customer:
customer_qf = QFrame().read_excel('sales_fields.xlsx', sheet_name='sales', query="scope == 'customer'")
customer_qf.get_sql()
print(customer_qf.sql)
SELECT CustomerId AS Id,
LastName || FirstName AS CustomerName,
Email,
Country
FROM sales_schema.sales_table
Second QFrame contains information about the customer's sales:
sales_qf = QFrame().read_excel('sales_fields.xlsx', sheet_name='sales', query="scope == 'sales'")
sales_qf.get_sql()
print(sales_qf.sql)
SELECT CustomerId AS Id,
TransactionDate,
sum(Sales) AS Sales,
count(CASE
WHEN Sales > 0 THEN 1
ELSE 0
END) AS Sales_Positive
FROM sales_schema.sales_table
GROUP BY Id,
TransactionDate
We will be using chinook.db to visualize data.
engine_string = "sqlite:///" + os.getcwd() + "\\chinook.db"
First table is tracks table. We will order by Name to mix the rows and we will take only 10 records for better visual effect .
tracks = { 'select': {
'fields': {
'TrackId': { 'type': 'dim'},
'Name': {'type': 'dim'},
'AlbumId': {'type': 'dim'},
'Composer': {'type': 'dim'},
'UnitPrice': {'type': 'num'}
},
'table': 'tracks'
}
}
tracks_qf = QFrame(engine=engine_string).from_dict(tracks).orderby('Name').limit(10)
print(tracks_qf.get_sql().sql)
display(tracks_qf.to_df())
SELECT TrackId,
Name,
AlbumId,
Composer,
UnitPrice
FROM tracks
ORDER BY Name
LIMIT 10
TrackId | Name | AlbumId | Composer | UnitPrice |
---|---|---|---|---|
3027 | "40" | 239 | U2 | 0.99 |
2918 | "?" | 231 | None | 1.99 |
3412 | "Eine Kleine Nachtmusik" Serenade In G, K. 525... | 281 | Wolfgang Amadeus Mozart | 0.99 |
109 | #1 Zero | 11 | Cornell, Commerford, Morello, Wilk | 0.99 |
3254 | #9 Dream | 255 | None | 0.99 |
602 | 'Round Midnight | 48 | Miles Davis | 0.99 |
1833 | (Anesthesia) Pulling Teeth | 150 | Cliff Burton | 0.99 |
570 | (Da Le) Yaleo | 46 | Santana | 0.99 |
3045 | (I Can't Help) Falling In Love With You | 241 | None | 0.99 |
3057 | (Oh) Pretty Woman | 242 | Bill Dees/Roy Orbison | 0.99 |
The second table is playlist_track table. It contains more than 8k records, here you can see only sample 10 records.
playlist_track = { "select": {
"fields":{
"PlaylistId": {"type" : "dim"},
"TrackId": {"type" : "dim"}
},
"table" : "playlist_track"
}
}
playlist_track_qf = QFrame(engine=engine_string).from_dict(playlist_track)
print(playlist_track_qf.get_sql().sql)
display(playlist_track_qf.to_df().sample(10))
SELECT PlaylistId,
TrackId
FROM playlist_track
PlaylistId | TrackId |
---|---|
8 | 2667 |
1 | 1771 |
5 | 2404 |
1 | 1136 |
1 | 2083 |
8 | 198 |
1 | 1440 |
8 | 1689 |
1 | 2330 |
5 | 2808 |
Now let's join them on TrackId.
joined_qf = join([tracks_qf,playlist_track_qf], join_type="left join", on="sq1.TrackId=sq2.TrackId")
print(joined_qf.get_sql().sql)
display(joined_qf.to_df())
SELECT sq1.TrackId AS TrackId,
sq1.Name AS Name,
sq1.AlbumId AS AlbumId,
sq1.Composer AS Composer,
sq1.UnitPrice AS UnitPrice,
sq2.PlaylistId AS PlaylistId
FROM
(SELECT TrackId,
Name,
AlbumId,
Composer,
UnitPrice
FROM tracks
ORDER BY Name
LIMIT 10) sq1
LEFT JOIN
(SELECT PlaylistId,
TrackId
FROM playlist_track) sq2 ON sq1.TrackId=sq2.TrackId
TrackId | Name | AlbumId | Composer | UnitPrice | PlaylistId |
---|---|---|---|---|---|
3027 | "40" | 239 | U2 | 0.99 | 1 |
3027 | "40" | 239 | U2 | 0.99 | 8 |
2918 | "?" | 231 | None | 1.99 | 3 |
2918 | "?" | 231 | None | 1.99 | 10 |
3412 | "Eine Kleine Nachtmusik" Serenade In G, K. 525... | 281 | Wolfgang Amadeus Mozart | 0.99 | 1 |
3412 | "Eine Kleine Nachtmusik" Serenade In G, K. 525... | 281 | Wolfgang Amadeus Mozart | 0.99 | 8 |
3412 | "Eine Kleine Nachtmusik" Serenade In G, K. 525... | 281 | Wolfgang Amadeus Mozart | 0.99 | 12 |
3412 | "Eine Kleine Nachtmusik" Serenade In G, K. 525... | 281 | Wolfgang Amadeus Mozart | 0.99 | 15 |
109 | #1 Zero | 11 | Cornell, Commerford, Morello, Wilk | 0.99 | 1 |
109 | #1 Zero | 11 | Cornell, Commerford, Morello, Wilk | 0.99 | 8 |
3254 | #9 Dream | 255 | None | 0.99 | 1 |
3254 | #9 Dream | 255 | None | 0.99 | 8 |
602 | 'Round Midnight | 48 | Miles Davis | 0.99 | 1 |
602 | 'Round Midnight | 48 | Miles Davis | 0.99 | 8 |
1833 | (Anesthesia) Pulling Teeth | 150 | Cliff Burton | 0.99 | 1 |
1833 | (Anesthesia) Pulling Teeth | 150 | Cliff Burton | 0.99 | 8 |
570 | (Da Le) Yaleo | 46 | Santana | 0.99 | 1 |
570 | (Da Le) Yaleo | 46 | Santana | 0.99 | 5 |
570 | (Da Le) Yaleo | 46 | Santana | 0.99 | 8 |
3045 | (I Can't Help) Falling In Love With You | 241 | None | 0.99 | 1 |
3045 | (I Can't Help) Falling In Love With You | 241 | None | 0.99 | 5 |
3045 | (I Can't Help) Falling In Love With You | 241 | None | 0.99 | 8 |
3057 | (Oh) Pretty Woman | 242 | Bill Dees/Roy Orbison | 0.99 | 1 |
3057 | (Oh) Pretty Woman | 242 | Bill Dees/Roy Orbison | 0.99 | 8 |
Third table is playlists table.
playlists = { "select": {
"fields": {
"PlaylistId": {"type" : "dim"},
"Name": {"type" : "dim"}
},
"table" : "playlists"
}
}
playlists_qf = QFrame(engine=engine_string).from_dict(playlists)
print(playlists_qf.get_sql().sql)
display(playlists_qf.to_df())
SELECT PlaylistId,
Name
FROM playlists
PlaylistId | Name |
---|---|
1 | Music |
2 | Movies |
3 | TV Shows |
4 | Audiobooks |
5 | 90’s Music |
6 | Audiobooks |
7 | Movies |
8 | Music |
9 | Music Videos |
10 | TV Shows |
11 | Brazilian Music |
12 | Classical |
13 | Classical 101 - Deep Cuts |
14 | Classical 101 - Next Steps |
15 | Classical 101 - The Basics |
16 | Grunge |
17 | Heavy Metal Classic |
18 | On-The-Go 1 |
Now if we want to join tracks, playlist_track and playlists tables we can use TrackId and PlaylistId. The problem is that in tracks and playlists tables we have the same column Name. By default join function is taking all fields from the first QFrame, then all the fields from the second QFrame which are not in the first and so on. If we still want to keep all fields from each QFrame we have to set unique_col=False.
joined_qf = join(qframes=[tracks_qf, playlist_track_qf, playlists_qf], join_type=
['left join', 'left join'], on=[
'sq1.TrackId=sq2.TrackId', 'sq2.PlaylistId=sq3.PlaylistId'], unique_col=False)
joined_qf.show_duplicated_columns()
joined_qf.remove(['sq2.TrackId', 'sq2.PlaylistId']).rename({'sq1.Name': 'TrackName', 'sq3.Name': 'PlaylistType'})
print(joined_qf.get_sql().sql)
display(joined_qf.to_df())
SELECT sq1.TrackId AS TrackId,
sq1.Name AS TrackName,
sq1.AlbumId AS AlbumId,
sq1.Composer AS Composer,
sq1.UnitPrice AS UnitPrice,
sq3.PlaylistId AS PlaylistId,
sq3.Name AS PlaylistType
FROM
(SELECT TrackId,
Name,
AlbumId,
Composer,
UnitPrice
FROM tracks
ORDER BY Name
LIMIT 10) sq1
LEFT JOIN
(SELECT PlaylistId,
TrackId
FROM playlist_track) sq2 ON sq1.TrackId=sq2.TrackId
LEFT JOIN
(SELECT PlaylistId,
Name
FROM playlists) sq3 ON sq2.PlaylistId=sq3.PlaylistId
TrackId | TrackName | AlbumId | Composer | UnitPrice | PlaylistId | PlaylistType |
---|---|---|---|---|---|---|
3027 | "40" | 239 | U2 | 0.99 | 1 | Music |
3027 | "40" | 239 | U2 | 0.99 | 8 | Music |
2918 | "?" | 231 | None | 1.99 | 3 | TV Shows |
2918 | "?" | 231 | None | 1.99 | 10 | TV Shows |
3412 | "Eine Kleine Nachtmusik" Serenade In G, K. 525... | 281 | Wolfgang Amadeus Mozart | 0.99 | 1 | Music |
3412 | "Eine Kleine Nachtmusik" Serenade In G, K. 525... | 281 | Wolfgang Amadeus Mozart | 0.99 | 8 | Music |
3412 | "Eine Kleine Nachtmusik" Serenade In G, K. 525... | 281 | Wolfgang Amadeus Mozart | 0.99 | 12 | Classical |
3412 | "Eine Kleine Nachtmusik" Serenade In G, K. 525... | 281 | Wolfgang Amadeus Mozart | 0.99 | 15 | Classical 101 - The Basics |
109 | #1 Zero | 11 | Cornell, Commerford, Morello, Wilk | 0.99 | 1 | Music |
109 | #1 Zero | 11 | Cornell, Commerford, Morello, Wilk | 0.99 | 8 | Music |
3254 | #9 Dream | 255 | None | 0.99 | 1 | Music |
3254 | #9 Dream | 255 | None | 0.99 | 8 | Music |
602 | 'Round Midnight | 48 | Miles Davis | 0.99 | 1 | Music |
602 | 'Round Midnight | 48 | Miles Davis | 0.99 | 8 | Music |
1833 | (Anesthesia) Pulling Teeth | 150 | Cliff Burton | 0.99 | 1 | Music |
1833 | (Anesthesia) Pulling Teeth | 150 | Cliff Burton | 0.99 | 8 | Music |
570 | (Da Le) Yaleo | 46 | Santana | 0.99 | 1 | Music |
570 | (Da Le) Yaleo | 46 | Santana | 0.99 | 5 | 90’s Music |
570 | (Da Le) Yaleo | 46 | Santana | 0.99 | 8 | Music |
3045 | (I Can't Help) Falling In Love With You | 241 | None | 0.99 | 1 | Music |
3045 | (I Can't Help) Falling In Love With You | 241 | None | 0.99 | 5 | 90’s Music |
3045 | (I Can't Help) Falling In Love With You | 241 | None | 0.99 | 8 | Music |
3057 | (Oh) Pretty Woman | 242 | Bill Dees/Roy Orbison | 0.99 | 1 | Music |
3057 | (Oh) Pretty Woman | 242 | Bill Dees/Roy Orbison | 0.99 | 8 | Music |
Currently Pandas does not support building interactive SQL queries with its api. Pandas is a great library with a great api so why not use the same api to generate SQL statements? This would make the data ecosystem more consistent for analysts and reduce their cognitive load when moving from databases to dataframes. And so here we are.