Skip to content

malinaa96/grizly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 

Repository files navigation

Grizly is a highly experimental library to generate SQL statements using the python pandas api.

Getting started!

The main class in grizly is called QFrame. You can load basic table information using a dictionary or an excel file.

from grizly import QFrame

Loading data from dictionary

data = {'select': {
           'fields': {
                      'CustomerId': {'type': 'dim', 'as': 'ID'},
                      'CustomerName': {'type': 'dim'},
                      'Country': {'type': 'dim'},
                      'FiscalYear':{'type': 'dim'},
                      'Sales': {'type': 'num'}
            },
           'schema': 'sales_schema',
           'table': 'sales_table'
           }
          }

q = QFrame().from_dict(data)
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
       CustomerName,
       Country,
       FiscalYear,
       Sales
FROM sales_schema.sales_table

SQL manipulation

  • Renaming fields
q.rename({'Country': 'Territory', 'FiscalYear': 'FY'})
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
       CustomerName,
       Country AS Territory,
       FiscalYear AS FY,
       Sales
FROM sales_schema.sales_table
  • Removing fields
q.remove(['CustomerName', 'FiscalYear'])
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
       Country AS Territory,
       Sales
FROM sales_schema.sales_table
  • Adding WHERE clause
q.query("Country IN ('France', 'Germany')")
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
       Country AS Territory,
       Sales
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
  • Aggregating fields
q.groupby(['CustomerId', 'Country'])['Sales'].agg('sum')
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
       Country AS Territory,
       sum(Sales) AS Sales
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory
  • Adding expressions
q.assign(type='num', group_by='sum', Sales_Div="Sales/100")
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
       Country AS Territory,
       sum(Sales) AS Sales,
       sum(Sales/100) AS Sales_Div
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory
q.assign(group_by='group', Sales_Positive="CASE WHEN Sales>0 THEN 1 ELSE 0 END")
q.get_sql()
print(q.sql)
SELECT CustomerId AS Id,
       Country AS Territory,
       sum(Sales) AS Sales,
       sum(Sales/100) AS Sales_Div,
       CASE
           WHEN Sales>0 THEN 1
           ELSE 0
       END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory,
         Sales_Positive
  • Adding DISTINCT statement
q.distinct()
SELECT DISTINCT CustomerId AS Id,
                Country AS Territory,
                sum(Sales) AS Sales,
                sum(Sales/100) AS Sales_Div,
                CASE
                    WHEN Sales>0 THEN 1
                    ELSE 0
                END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory,
         Sales_Positive
  • Adding ORDER BY statement
q.orderby("Sales")
SELECT DISTINCT CustomerId AS Id,
                Country AS Territory,
                sum(Sales) AS Sales,
                sum(Sales/100) AS Sales_Div,
                CASE
                    WHEN Sales>0 THEN 1
                    ELSE 0
                END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory,
         Sales_Positive
ORDER BY Sales
q.orderby(["Country", "Sales"], False)
SELECT DISTINCT CustomerId AS Id,
                Country AS Territory,
                sum(Sales) AS Sales,
                sum(Sales/100) AS Sales_Div,
                CASE
                    WHEN Sales>0 THEN 1
                    ELSE 0
                END AS Sales_Positive
FROM sales_schema.sales_table
WHERE Country IN ('France',
                  'Germany')
GROUP BY Id,
         Territory,
         Sales_Positive
ORDER BY Territory DESC,
         Sales DESC

Loading data from excel file

Now we will be loading fields information from excel file. Your excel file must contain following columns:

  • column - Name of the column in table.

  • column_type - Type of the column. Possibilities:

    • dim - VARCHAR(500)
    • num - FLOAT

    Every column has to have specified type. If you want to sepcify another type check custom_type.

  • expression - Expression, eg. CASE statement, column operation, CONCAT statement, ... . In the case of expression column should be empty and the alias (name) of the expression should be placed in column_as.

  • column_as - Column alias (name).

  • group_by - Aggregation type. Possibilities:

    • group - This field will go to GROUP BY statement.
    • {sum, count, min, max, avg} - This field will by aggregated in specified way.

    Please make sure that every field that is placed in SELECT statement (select !=0) is also placed in GROUP BY. If you don't want to aggregate fields leave group_by empty.

  • select - Set 0 to remove this field from SELECT statement.

  • custom_type - Specify custom SQL data type, eg. DATE.

  • schema - Name of the schema. Always in the first row.

  • table - Name of the table. Always in the first row.

Now let's take a look at following example.

alt text

In this case we also have a column scope which is used to filter rows in excel file and in that we are be able to create two different QFrames using sales_table. First QFrame contains information about the customer:

customer_qf = QFrame().read_excel('sales_fields.xlsx', sheet_name='sales', query="scope == 'customer'")
customer_qf.get_sql()
print(customer_qf.sql)
SELECT CustomerId AS Id,
       LastName || FirstName AS CustomerName,
       Email,
       Country
FROM sales_schema.sales_table

Second QFrame contains information about the customer's sales:

sales_qf = QFrame().read_excel('sales_fields.xlsx', sheet_name='sales', query="scope == 'sales'")
sales_qf.get_sql()
print(sales_qf.sql)
SELECT CustomerId AS Id,
       TransactionDate,
       sum(Sales) AS Sales,
       count(CASE
                 WHEN Sales > 0 THEN 1
                 ELSE 0
             END) AS Sales_Positive
FROM sales_schema.sales_table
GROUP BY Id,
         TransactionDate

Joining data

We will be using chinook.db to visualize data.

engine_string = "sqlite:///" + os.getcwd() + "\\chinook.db"

First table is tracks table. We will order by Name to mix the rows and we will take only 10 records for better visual effect .

tracks = {  'select': {
                'fields': {
                    'TrackId': { 'type': 'dim'},
                    'Name': {'type': 'dim'},
                    'AlbumId': {'type': 'dim'},
                    'Composer': {'type': 'dim'},
                    'UnitPrice': {'type': 'num'}
                },
                'table': 'tracks'
            }
}
tracks_qf = QFrame(engine=engine_string).from_dict(tracks).orderby('Name').limit(10)
print(tracks_qf.get_sql().sql)
display(tracks_qf.to_df())
SELECT TrackId,
       Name,
       AlbumId,
       Composer,
       UnitPrice
FROM tracks
ORDER BY Name
LIMIT 10
TrackId Name AlbumId Composer UnitPrice
3027 "40" 239 U2 0.99
2918 "?" 231 None 1.99
3412 "Eine Kleine Nachtmusik" Serenade In G, K. 525... 281 Wolfgang Amadeus Mozart 0.99
109 #1 Zero 11 Cornell, Commerford, Morello, Wilk 0.99
3254 #9 Dream 255 None 0.99
602 'Round Midnight 48 Miles Davis 0.99
1833 (Anesthesia) Pulling Teeth 150 Cliff Burton 0.99
570 (Da Le) Yaleo 46 Santana 0.99
3045 (I Can't Help) Falling In Love With You 241 None 0.99
3057 (Oh) Pretty Woman 242 Bill Dees/Roy Orbison 0.99

The second table is playlist_track table. It contains more than 8k records, here you can see only sample 10 records.

playlist_track = { "select": {
                        "fields":{
                            "PlaylistId": {"type" : "dim"},
                            "TrackId": {"type" : "dim"}
                        },
                        "table" : "playlist_track"
                    }
                }

playlist_track_qf = QFrame(engine=engine_string).from_dict(playlist_track)
print(playlist_track_qf.get_sql().sql)
display(playlist_track_qf.to_df().sample(10))
SELECT PlaylistId,
       TrackId
FROM playlist_track
PlaylistId TrackId
8 2667
1 1771
5 2404
1 1136
1 2083
8 198
1 1440
8 1689
1 2330
5 2808

Now let's join them on TrackId.

joined_qf = join([tracks_qf,playlist_track_qf], join_type="left join", on="sq1.TrackId=sq2.TrackId")

print(joined_qf.get_sql().sql)
display(joined_qf.to_df())
SELECT sq1.TrackId AS TrackId,
       sq1.Name AS Name,
       sq1.AlbumId AS AlbumId,
       sq1.Composer AS Composer,
       sq1.UnitPrice AS UnitPrice,
       sq2.PlaylistId AS PlaylistId
FROM
  (SELECT TrackId,
          Name,
          AlbumId,
          Composer,
          UnitPrice
   FROM tracks
   ORDER BY Name
   LIMIT 10) sq1
LEFT JOIN
  (SELECT PlaylistId,
          TrackId
   FROM playlist_track) sq2 ON sq1.TrackId=sq2.TrackId
TrackId Name AlbumId Composer UnitPrice PlaylistId
3027 "40" 239 U2 0.99 1
3027 "40" 239 U2 0.99 8
2918 "?" 231 None 1.99 3
2918 "?" 231 None 1.99 10
3412 "Eine Kleine Nachtmusik" Serenade In G, K. 525... 281 Wolfgang Amadeus Mozart 0.99 1
3412 "Eine Kleine Nachtmusik" Serenade In G, K. 525... 281 Wolfgang Amadeus Mozart 0.99 8
3412 "Eine Kleine Nachtmusik" Serenade In G, K. 525... 281 Wolfgang Amadeus Mozart 0.99 12
3412 "Eine Kleine Nachtmusik" Serenade In G, K. 525... 281 Wolfgang Amadeus Mozart 0.99 15
109 #1 Zero 11 Cornell, Commerford, Morello, Wilk 0.99 1
109 #1 Zero 11 Cornell, Commerford, Morello, Wilk 0.99 8
3254 #9 Dream 255 None 0.99 1
3254 #9 Dream 255 None 0.99 8
602 'Round Midnight 48 Miles Davis 0.99 1
602 'Round Midnight 48 Miles Davis 0.99 8
1833 (Anesthesia) Pulling Teeth 150 Cliff Burton 0.99 1
1833 (Anesthesia) Pulling Teeth 150 Cliff Burton 0.99 8
570 (Da Le) Yaleo 46 Santana 0.99 1
570 (Da Le) Yaleo 46 Santana 0.99 5
570 (Da Le) Yaleo 46 Santana 0.99 8
3045 (I Can't Help) Falling In Love With You 241 None 0.99 1
3045 (I Can't Help) Falling In Love With You 241 None 0.99 5
3045 (I Can't Help) Falling In Love With You 241 None 0.99 8
3057 (Oh) Pretty Woman 242 Bill Dees/Roy Orbison 0.99 1
3057 (Oh) Pretty Woman 242 Bill Dees/Roy Orbison 0.99 8

Multiple join

Third table is playlists table.

playlists = { "select": {
                    "fields": {
                        "PlaylistId": {"type" : "dim"},
                        "Name": {"type" : "dim"}
                    },
                    "table" : "playlists"
                }
            }

playlists_qf = QFrame(engine=engine_string).from_dict(playlists)
print(playlists_qf.get_sql().sql)
display(playlists_qf.to_df())
SELECT PlaylistId,
       Name
FROM playlists
PlaylistId Name
1 Music
2 Movies
3 TV Shows
4 Audiobooks
5 90’s Music
6 Audiobooks
7 Movies
8 Music
9 Music Videos
10 TV Shows
11 Brazilian Music
12 Classical
13 Classical 101 - Deep Cuts
14 Classical 101 - Next Steps
15 Classical 101 - The Basics
16 Grunge
17 Heavy Metal Classic
18 On-The-Go 1

Now if we want to join tracks, playlist_track and playlists tables we can use TrackId and PlaylistId. The problem is that in tracks and playlists tables we have the same column Name. By default join function is taking all fields from the first QFrame, then all the fields from the second QFrame which are not in the first and so on. If we still want to keep all fields from each QFrame we have to set unique_col=False.

joined_qf = join(qframes=[tracks_qf, playlist_track_qf, playlists_qf], join_type=
                ['left join', 'left join'], on=[
                 'sq1.TrackId=sq2.TrackId', 'sq2.PlaylistId=sq3.PlaylistId'], unique_col=False)
                 
joined_qf.show_duplicated_columns()

image

joined_qf.remove(['sq2.TrackId', 'sq2.PlaylistId']).rename({'sq1.Name': 'TrackName', 'sq3.Name': 'PlaylistType'})
print(joined_qf.get_sql().sql)
display(joined_qf.to_df())
SELECT sq1.TrackId AS TrackId,
       sq1.Name AS TrackName,
       sq1.AlbumId AS AlbumId,
       sq1.Composer AS Composer,
       sq1.UnitPrice AS UnitPrice,
       sq3.PlaylistId AS PlaylistId,
       sq3.Name AS PlaylistType
FROM
  (SELECT TrackId,
          Name,
          AlbumId,
          Composer,
          UnitPrice
   FROM tracks
   ORDER BY Name
   LIMIT 10) sq1
LEFT JOIN
  (SELECT PlaylistId,
          TrackId
   FROM playlist_track) sq2 ON sq1.TrackId=sq2.TrackId
LEFT JOIN
  (SELECT PlaylistId,
          Name
   FROM playlists) sq3 ON sq2.PlaylistId=sq3.PlaylistId
TrackId TrackName AlbumId Composer UnitPrice PlaylistId PlaylistType
3027 "40" 239 U2 0.99 1 Music
3027 "40" 239 U2 0.99 8 Music
2918 "?" 231 None 1.99 3 TV Shows
2918 "?" 231 None 1.99 10 TV Shows
3412 "Eine Kleine Nachtmusik" Serenade In G, K. 525... 281 Wolfgang Amadeus Mozart 0.99 1 Music
3412 "Eine Kleine Nachtmusik" Serenade In G, K. 525... 281 Wolfgang Amadeus Mozart 0.99 8 Music
3412 "Eine Kleine Nachtmusik" Serenade In G, K. 525... 281 Wolfgang Amadeus Mozart 0.99 12 Classical
3412 "Eine Kleine Nachtmusik" Serenade In G, K. 525... 281 Wolfgang Amadeus Mozart 0.99 15 Classical 101 - The Basics
109 #1 Zero 11 Cornell, Commerford, Morello, Wilk 0.99 1 Music
109 #1 Zero 11 Cornell, Commerford, Morello, Wilk 0.99 8 Music
3254 #9 Dream 255 None 0.99 1 Music
3254 #9 Dream 255 None 0.99 8 Music
602 'Round Midnight 48 Miles Davis 0.99 1 Music
602 'Round Midnight 48 Miles Davis 0.99 8 Music
1833 (Anesthesia) Pulling Teeth 150 Cliff Burton 0.99 1 Music
1833 (Anesthesia) Pulling Teeth 150 Cliff Burton 0.99 8 Music
570 (Da Le) Yaleo 46 Santana 0.99 1 Music
570 (Da Le) Yaleo 46 Santana 0.99 5 90’s Music
570 (Da Le) Yaleo 46 Santana 0.99 8 Music
3045 (I Can't Help) Falling In Love With You 241 None 0.99 1 Music
3045 (I Can't Help) Falling In Love With You 241 None 0.99 5 90’s Music
3045 (I Can't Help) Falling In Love With You 241 None 0.99 8 Music
3057 (Oh) Pretty Woman 242 Bill Dees/Roy Orbison 0.99 1 Music
3057 (Oh) Pretty Woman 242 Bill Dees/Roy Orbison 0.99 8 Music

But why?

Currently Pandas does not support building interactive SQL queries with its api. Pandas is a great library with a great api so why not use the same api to generate SQL statements? This would make the data ecosystem more consistent for analysts and reduce their cognitive load when moving from databases to dataframes. And so here we are.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages