Getting started!

Grizly is a highly experimental library to generate SQL statements using the python pandas api.

Getting started!

The main class in grizly is called QFrame. You can load basic table information using a dictionary or an excel file.

from grizly import QFrame

Loading data from dictionary

data = {'select': {
           'fields': {
                      'CustomerId': {'type': 'dim', 'as': 'ID'},
                      'CustomerName': {'type': 'dim'},
                      'Country': {'type': 'dim'},
                      'FiscalYear':{'type': 'dim'},
                      'Sales': {'type': 'num'}
            },
           'schema': 'sales_schema',
           'table': 'sales_table'
           }
          }

q = QFrame().from_dict(data)
q.get_sql()
print(q.sql)

SELECT CustomerId AS Id,
       CustomerName,
       Country,
       FiscalYear,
       Sales
FROM sales_schema.sales_table

SQL manipulation

Renaming fields

q.rename({'Country': 'Territory', 'FiscalYear': 'FY'})
q.get_sql()
print(q.sql)

SELECT CustomerId AS Id,
       CustomerName,
       Country AS Territory,
       FiscalYear AS FY,
       Sales
FROM sales_schema.sales_table

Removing fields

q.remove(['CustomerName', 'FiscalYear'])
q.get_sql()
print(q.sql)

SELECT CustomerId AS Id,
       Country AS Territory,
       Sales
FROM sales_schema.sales_table

Adding WHERE clause

q.query("Country IN ('France', 'Germany')")
q.get_sql()
print(q.sql)

SELECT CustomerId AS Id,
       Country AS Territory,
       Sales
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')

Aggregating fields

q.groupby(['CustomerId', 'Country'])['Sales'].agg('sum')
q.get_sql()
print(q.sql)

SELECT CustomerId AS Id,
       Country AS Territory,
       sum(Sales) AS Sales
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory

Adding expressions

q.assign(type='num', group_by='sum', Sales_Div="Sales/100")
q.get_sql()
print(q.sql)

SELECT CustomerId AS Id,
       Country AS Territory,
       sum(Sales) AS Sales,
       sum(Sales/100) AS Sales_Div
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory

q.assign(group_by='group', Sales_Positive="CASE WHEN Sales>0 THEN 1 ELSE 0 END")
q.get_sql()
print(q.sql)

SELECT CustomerId AS Id,
       Country AS Territory,
       sum(Sales) AS Sales,
       sum(Sales/100) AS Sales_Div,
       CASE
           WHEN Sales>0 THEN 1
           ELSE 0
       END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory,
         Sales_Positive

Adding DISTINCT statement

q.distinct()

SELECT DISTINCT CustomerId AS Id,
                Country AS Territory,
                sum(Sales) AS Sales,
                sum(Sales/100) AS Sales_Div,
                CASE
                    WHEN Sales>0 THEN 1
                    ELSE 0
                END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory,
         Sales_Positive

Adding ORDER BY statement

q.orderby("Sales")

SELECT DISTINCT CustomerId AS Id,
                Country AS Territory,
                sum(Sales) AS Sales,
                sum(Sales/100) AS Sales_Div,
                CASE
                    WHEN Sales>0 THEN 1
                    ELSE 0
                END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory,
         Sales_Positive
ORDER BY Sales

q.orderby(["Country", "Sales"], False)

SELECT DISTINCT CustomerId AS Id,
                Country AS Territory,
                sum(Sales) AS Sales,
                sum(Sales/100) AS Sales_Div,
                CASE
                    WHEN Sales>0 THEN 1
                    ELSE 0
                END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory,
         Sales_Positive
ORDER BY Territory DESC,
         Sales DESC

Loading data from excel file

Now we will be loading fields information from excel file. Your excel file must contain following columns:

column - Name of the column in table.
column_type - Type of the column. Possibilities:
- dim - VARCHAR(500)
- num - FLOAT
Every column has to have specified type. If you want to sepcify another type check custom_type.
expression - Expression, eg. CASE statement, column operation, CONCAT statement, ... . In the case of expression column should be empty and the alias (name) of the expression should be placed in column_as.
column_as - Column alias (name).
group_by - Aggregation type. Possibilities:
- group - This field will go to GROUP BY statement.
- {sum, count, min, max, avg} - This field will by aggregated in specified way.
Please make sure that every field that is placed in SELECT statement (select !=0) is also placed in GROUP BY. If you don't want to aggregate fields leave group_by empty.
select - Set 0 to remove this field from SELECT statement.
custom_type - Specify custom SQL data type, eg. DATE.
schema - Name of the schema. Always in the first row.
table - Name of the table. Always in the first row.

Now let's take a look at following example.

In this case we also have a column scope which is used to filter rows in excel file and in that we are be able to create two different QFrames using sales_table. First QFrame contains information about the customer:

customer_qf = QFrame().read_excel('sales_fields.xlsx', sheet_name='sales', query="scope == 'customer'")
customer_qf.get_sql()
print(customer_qf.sql)

SELECT CustomerId AS Id,
       LastName || FirstName AS CustomerName,
       Email,
       Country
FROM sales_schema.sales_table

Second QFrame contains information about the customer's sales:

sales_qf = QFrame().read_excel('sales_fields.xlsx', sheet_name='sales', query="scope == 'sales'")
sales_qf.get_sql()
print(sales_qf.sql)

SELECT CustomerId AS Id,
       TransactionDate,
       sum(Sales) AS Sales,
       count(CASE
                 WHEN Sales > 0 THEN 1
                 ELSE 0
             END) AS Sales_Positive
FROM sales_schema.sales_table
GROUP BY Id,
         TransactionDate

Joining data

We will be using chinook.db to visualize data.

engine_string = "sqlite:///" + os.getcwd() + "\\chinook.db"

First table is tracks table. We will order by Name to mix the rows and we will take only 10 records for better visual effect .

tracks = {  'select': {
                'fields': {
                    'TrackId': { 'type': 'dim'},
                    'Name': {'type': 'dim'},
                    'AlbumId': {'type': 'dim'},
                    'Composer': {'type': 'dim'},
                    'UnitPrice': {'type': 'num'}
                },
                'table': 'tracks'
            }
}
tracks_qf = QFrame(engine=engine_string).from_dict(tracks).orderby('Name').limit(10)
print(tracks_qf.get_sql().sql)
display(tracks_qf.to_df())

SELECT TrackId,
       Name,
       AlbumId,
       Composer,
       UnitPrice
FROM tracks
ORDER BY Name
LIMIT 10

TrackId	Name	AlbumId	Composer	UnitPrice
3027	"40"	239	U2	0.99
2918	"?"	231	None	1.99
3412	"Eine Kleine Nachtmusik" Serenade In G, K. 525...	281	Wolfgang Amadeus Mozart	0.99
109	#1 Zero	11	Cornell, Commerford, Morello, Wilk	0.99
3254	#9 Dream	255	None	0.99
602	'Round Midnight	48	Miles Davis	0.99
1833	(Anesthesia) Pulling Teeth	150	Cliff Burton	0.99
570	(Da Le) Yaleo	46	Santana	0.99
3045	(I Can't Help) Falling In Love With You	241	None	0.99
3057	(Oh) Pretty Woman	242	Bill Dees/Roy Orbison	0.99

The second table is playlist_track table. It contains more than 8k records, here you can see only sample 10 records.

playlist_track = { "select": {
                        "fields":{
                            "PlaylistId": {"type" : "dim"},
                            "TrackId": {"type" : "dim"}
                        },
                        "table" : "playlist_track"
                    }
                }

playlist_track_qf = QFrame(engine=engine_string).from_dict(playlist_track)
print(playlist_track_qf.get_sql().sql)
display(playlist_track_qf.to_df().sample(10))

SELECT PlaylistId,
       TrackId
FROM playlist_track

PlaylistId	TrackId
8	2667
1	1771
5	2404
1	1136
1	2083
8	198
1	1440
8	1689
1	2330
5	2808

Now let's join them on TrackId.

joined_qf = join([tracks_qf,playlist_track_qf], join_type="left join", on="sq1.TrackId=sq2.TrackId")

print(joined_qf.get_sql().sql)
display(joined_qf.to_df())

SELECT sq1.TrackId AS TrackId,
       sq1.Name AS Name,
       sq1.AlbumId AS AlbumId,
       sq1.Composer AS Composer,
       sq1.UnitPrice AS UnitPrice,
       sq2.PlaylistId AS PlaylistId
FROM
  (SELECT TrackId,
          Name,
          AlbumId,
          Composer,
          UnitPrice
   FROM tracks
   ORDER BY Name
   LIMIT 10) sq1
LEFT JOIN
  (SELECT PlaylistId,
          TrackId
   FROM playlist_track) sq2 ON sq1.TrackId=sq2.TrackId

TrackId	Name	AlbumId	Composer	UnitPrice	PlaylistId
3027	"40"	239	U2	0.99	1
3027	"40"	239	U2	0.99	8
2918	"?"	231	None	1.99	3
2918	"?"	231	None	1.99	10
3412	"Eine Kleine Nachtmusik" Serenade In G, K. 525...	281	Wolfgang Amadeus Mozart	0.99	1
3412	"Eine Kleine Nachtmusik" Serenade In G, K. 525...	281	Wolfgang Amadeus Mozart	0.99	8
3412	"Eine Kleine Nachtmusik" Serenade In G, K. 525...	281	Wolfgang Amadeus Mozart	0.99	12
3412	"Eine Kleine Nachtmusik" Serenade In G, K. 525...	281	Wolfgang Amadeus Mozart	0.99	15
109	#1 Zero	11	Cornell, Commerford, Morello, Wilk	0.99	1
109	#1 Zero	11	Cornell, Commerford, Morello, Wilk	0.99	8
3254	#9 Dream	255	None	0.99	1
3254	#9 Dream	255	None	0.99	8
602	'Round Midnight	48	Miles Davis	0.99	1
602	'Round Midnight	48	Miles Davis	0.99	8
1833	(Anesthesia) Pulling Teeth	150	Cliff Burton	0.99	1
1833	(Anesthesia) Pulling Teeth	150	Cliff Burton	0.99	8
570	(Da Le) Yaleo	46	Santana	0.99	1
570	(Da Le) Yaleo	46	Santana	0.99	5
570	(Da Le) Yaleo	46	Santana	0.99	8
3045	(I Can't Help) Falling In Love With You	241	None	0.99	1
3045	(I Can't Help) Falling In Love With You	241	None	0.99	5
3045	(I Can't Help) Falling In Love With You	241	None	0.99	8
3057	(Oh) Pretty Woman	242	Bill Dees/Roy Orbison	0.99	1
3057	(Oh) Pretty Woman	242	Bill Dees/Roy Orbison	0.99	8

Multiple join

Third table is playlists table.

playlists = { "select": {
                    "fields": {
                        "PlaylistId": {"type" : "dim"},
                        "Name": {"type" : "dim"}
                    },
                    "table" : "playlists"
                }
            }

playlists_qf = QFrame(engine=engine_string).from_dict(playlists)
print(playlists_qf.get_sql().sql)
display(playlists_qf.to_df())

SELECT PlaylistId,
       Name
FROM playlists

PlaylistId	Name
1	Music
2	Movies
3	TV Shows
4	Audiobooks
5	90’s Music
6	Audiobooks
7	Movies
8	Music
9	Music Videos
10	TV Shows
11	Brazilian Music
12	Classical
13	Classical 101 - Deep Cuts
14	Classical 101 - Next Steps
15	Classical 101 - The Basics
16	Grunge
17	Heavy Metal Classic
18	On-The-Go 1

Now if we want to join tracks, playlist_track and playlists tables we can use TrackId and PlaylistId. The problem is that in tracks and playlists tables we have the same column Name. By default join function is taking all fields from the first QFrame, then all the fields from the second QFrame which are not in the first and so on. If we still want to keep all fields from each QFrame we have to set unique_col=False.

joined_qf = join(qframes=[tracks_qf, playlist_track_qf, playlists_qf], join_type=
                ['left join', 'left join'], on=[
                 'sq1.TrackId=sq2.TrackId', 'sq2.PlaylistId=sq3.PlaylistId'], unique_col=False)
                 
joined_qf.show_duplicated_columns()

joined_qf.remove(['sq2.TrackId', 'sq2.PlaylistId']).rename({'sq1.Name': 'TrackName', 'sq3.Name': 'PlaylistType'})
print(joined_qf.get_sql().sql)
display(joined_qf.to_df())

SELECT sq1.TrackId AS TrackId,
       sq1.Name AS TrackName,
       sq1.AlbumId AS AlbumId,
       sq1.Composer AS Composer,
       sq1.UnitPrice AS UnitPrice,
       sq3.PlaylistId AS PlaylistId,
       sq3.Name AS PlaylistType
FROM
  (SELECT TrackId,
          Name,
          AlbumId,
          Composer,
          UnitPrice
   FROM tracks
   ORDER BY Name
   LIMIT 10) sq1
LEFT JOIN
  (SELECT PlaylistId,
          TrackId
   FROM playlist_track) sq2 ON sq1.TrackId=sq2.TrackId
LEFT JOIN
  (SELECT PlaylistId,
          Name
   FROM playlists) sq3 ON sq2.PlaylistId=sq3.PlaylistId

TrackId	TrackName	AlbumId	Composer	UnitPrice	PlaylistId	PlaylistType
3027	"40"	239	U2	0.99	1	Music
3027	"40"	239	U2	0.99	8	Music
2918	"?"	231	None	1.99	3	TV Shows
2918	"?"	231	None	1.99	10	TV Shows
3412	"Eine Kleine Nachtmusik" Serenade In G, K. 525...	281	Wolfgang Amadeus Mozart	0.99	1	Music
3412	"Eine Kleine Nachtmusik" Serenade In G, K. 525...	281	Wolfgang Amadeus Mozart	0.99	8	Music
3412	"Eine Kleine Nachtmusik" Serenade In G, K. 525...	281	Wolfgang Amadeus Mozart	0.99	12	Classical
3412	"Eine Kleine Nachtmusik" Serenade In G, K. 525...	281	Wolfgang Amadeus Mozart	0.99	15	Classical 101 - The Basics
109	#1 Zero	11	Cornell, Commerford, Morello, Wilk	0.99	1	Music
109	#1 Zero	11	Cornell, Commerford, Morello, Wilk	0.99	8	Music
3254	#9 Dream	255	None	0.99	1	Music
3254	#9 Dream	255	None	0.99	8	Music
602	'Round Midnight	48	Miles Davis	0.99	1	Music
602	'Round Midnight	48	Miles Davis	0.99	8	Music
1833	(Anesthesia) Pulling Teeth	150	Cliff Burton	0.99	1	Music
1833	(Anesthesia) Pulling Teeth	150	Cliff Burton	0.99	8	Music
570	(Da Le) Yaleo	46	Santana	0.99	1	Music
570	(Da Le) Yaleo	46	Santana	0.99	5	90’s Music
570	(Da Le) Yaleo	46	Santana	0.99	8	Music
3045	(I Can't Help) Falling In Love With You	241	None	0.99	1	Music
3045	(I Can't Help) Falling In Love With You	241	None	0.99	5	90’s Music
3045	(I Can't Help) Falling In Love With You	241	None	0.99	8	Music
3057	(Oh) Pretty Woman	242	Bill Dees/Roy Orbison	0.99	1	Music
3057	(Oh) Pretty Woman	242	Bill Dees/Roy Orbison	0.99	8	Music

But why?

Currently Pandas does not support building interactive SQL queries with its api. Pandas is a great library with a great api so why not use the same api to generate SQL statements? This would make the data ecosystem more consistent for analysts and reduce their cognitive load when moving from databases to dataframes. And so here we are.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
grizly		grizly
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting started!

Loading data from dictionary

SQL manipulation

Loading data from excel file

Joining data

Multiple join

But why?

About

Releases

Packages

Contributors 4

Languages

malinaa96/grizly

Folders and files

Latest commit

History

Repository files navigation

Getting started!

Loading data from dictionary

SQL manipulation

Loading data from excel file

Joining data

Multiple join

But why?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages