Skip to content

ETL with Python from BOVESPA Web Data, filtering, dropping columns, formatting and creating functions

Notifications You must be signed in to change notification settings

RichardGomesDeAraujo/ETLPython_txt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Hands On Data

ETL with Python

Web Data, filtering, dropping columns, formatting and creating functions

Data Type: .txt

Code: ETLPython


by Richard Gomes de Araújo - 20/02/2024

Github Badge Linkedin Badge Youtube Badge


Index

  • 1 - Importing Library

  • 2 - Create Columns and Lines with a specific Data Layout

  • 3 - Filtering Under Business Rule

  • 4 - Dropping an Unuseful Column

  • 5- Formatting Date

  • 6 - Formatting Number

  • 7 - Visualizing the Data Types

  • 8 - Counting Values (Grouping by a specific column)

  • 9 - Finding Null Values and Counting Them

  • 10 - Changing Null Values by Zero

  • 11 - Processing the Object Data Type

  • 12 - Calculating Memory Usage of Dataframe 1

  • 13 - Analysing the Content of the Column with Object Data Type

  • 14 - Transforming the Column with Object to Category Data Type

  • 15 - Transforming the df into df2.parquet

  • 16 - Calculating Memory Usage of Dataframe 2

  • 17 - Creating Functions to Automate the Process

  • 18 - Creating Function to Concat All the Files

  • 19 - Creating a Function to Execute the Functions


1

Importing Library:

# importing Library
import pandas as pd
import numpy as np


2

Create Columns and Lines with a specific Data Layout:

# Creating Columns and Lines
columnlayout = [(2,10),
                (10,12),
                (12,24),
                (27,39),
                (56,69),
                (69,82),
                (82,95),
                (108,121),
                (152,170),
                (170,188)
               ]

columnnames = ['DateDealing','CodBdi','ActionSg','ActionName','OpenPrice','MaximumPrice','MinimumPrice','ClosePrice','BusinessQuantity','BusinessVolume']

df = pd.read_fwf('C:/Users/richa/Documents/ProjetosJupyter/COTAHIST_A2022.txt', colspecs = columnlayout, names = columnnames, skiprows=1)
                 
df   


3

Filtering Under Business Rule:

# Filtering a column under a business rule
df = df[ df['CodBdi'] == 2.0]
df


4

Dropping an Unuseful Column:

# Dropping an unuseful column where I need to use the command df.drop with number 1 to flag a column
df = df.drop(['CodBdi'], axis=1)
df


5

Formatting Date:

# Formatting Date
df['DateDealing'] = pd.to_datetime( df['DateDealing'], format='%Y%m%d')
df


6

Formatting Number:

# Formatting Numbers of the columns where I need to divide the number by 100 to obtain two decimals numbers
df['OpenPrice'] = (df['OpenPrice'] / 100).astype(float)
df['MaximumPrice'] = (df['MaximumPrice'] / 100).astype(float)
df['MinimumPrice'] = (df['MinimumPrice'] / 100).astype(float)
df['ClosePrice'] = (df['ClosePrice'] / 100).astype(float)
df['BusinessQuantity'] = df['BusinessQuantity'].astype(int)
df['BusinessVolume'] = df['BusinessVolume'].astype(int)
df


7

Visualizing the Data Types:

df.dtypes

df.info()


8

Counting Values (Grouping by a specific column):

# Counting Values (Group by a specific column) Useful to identify outliers
df['ActionName'].value_counts()


9

Finding Null Values and Counting Them:

# Finding null values and counting them
df.isnull().sum()


10

Changing Null Values by Zero:

# Changing null values by zero where np.nan means null values in numpy library is igual nan and inplace=True to 
# transform the change permanent in the dataframe
df['BusinessVolume'].replace(np.nan, '0', inplace=True)
df


11

Processing the Object Data Type:

# Processing the Object Data Type
# First: Looking the size of each column in the Dataframe
memory = df.memory_usage(deep=True)


12

Calculating Memory Usage of Dataframe 1:

# Sum the total of memory my Dataframe is using dividing per 1024³ to discover the GB
memory.sum() / (1024 ** 3)


13

Analysing the Content of the Column with Object Data Type:

# Counting how many differents datas we have in the object column grouping them
df['ActionSg'].value_counts()


14

Transforming the Column with Object to Category Data Type:

df['ActionSg'] = df['ActionSg'].astype('category')
df['ActionName'] = df['ActionName'].astype('category')


15

Transforming the df into df2.parquet:

# Transforming File Type to .parquet
df.to_parquet('ETLWithPython.parquet')


16

Calculating Memory Usage of Dataframe 2:

# Reading the .parquet file
df2 = pd.read_parquet('ETLWithPython.parquet')


17

Creating Functions to Automate the Process:

# Creating Functions to Automate the Process
# Function 1
# Creating Columns and Lines with the layout and replacing the address file in the pd.read

def readfiles(path, namefile, yeardate, typefile):
    _file = f'{path}{namefile}{yeardate}.{typefile}'

    columnlayout = [(2,10),
                    (10,12),
                    (12,24),
                    (27,39),
                    (56,69),
                    (69,82),
                    (82,95),
                    (108,121),
                    (152,170),
                    (170,188)
                   ]

    columnnames = ['DateDealing','CodBdi','ActionSg','ActionName','OpenPrice','MaximumPrice','MinimumPrice','ClosePrice','BusinessQuantity','BusinessVolume']

    # replacing the address by the function _file
    df3 = pd.read_fwf(_file, colspecs = columnlayout, names = columnnames, skiprows=1)
    
    return df3

# Function 2
def filterstocks(df3):
    # Filtering a column under a business rule
    df3 = df3[ df3['CodBdi'] == 2.0]
    
    # Dropping an unuseful column where I need to use the command df.drop with number 1 to flag a column
    df3 = df3.drop(['CodBdi'], axis=1)
    
    return df3

# Function 3
def parsedate(df3):
    # Formatting Date
    df3['DateDealing'] = pd.to_datetime( df3['DateDealing'], format='%Y%m%d')

    return df3

# Function 4
def parsevalues(df3):
    # Formatting Numbers of the columns where I need to divide the number by 100 to obtain two decimals numbers
    df3['OpenPrice'] = (df3['OpenPrice'] / 100).astype(float)
    df3['MaximumPrice'] = (df3['MaximumPrice'] / 100).astype(float)
    df3['MinimumPrice'] = (df3['MinimumPrice'] / 100).astype(float)
    df3['ClosePrice'] = (df3['ClosePrice'] / 100).astype(float)
    df3['BusinessQuantity'] = df3['BusinessQuantity'].astype(int)
    df3['BusinessVolume'] = df3['BusinessVolume'].astype(int)
    
    return df3


18

Creating Function to Concat All the Files:

# Function 5: Concat all the files
def concatfiles(path, namefile, yeardate, typefile, finalfile):
    # using enumerate to index the i value
    for i, y in enumerate(yeardate):
        df3 = readfiles(path, namefile, y, typefile)
        df3 = filterstocks(df3)
        df3 = parsedate(df3)
        df3 = parsevalues(df3)
        
        if i==0:
            dffinal = df3
        else:
            dffinal = pd.concat([dffinal, df3])
            
    # create a csv file with the result
    dffinal.to_csv(f'{path}//{finalfile}', index=False)


19

Creating a Function to Execute the Functions:

# Function 6: Creating a function to execute the functions
yeardate = ['2022', '2023', '2024']
path = f'C:/Users/richa/Documents/ProjetosJupyter/'
namefile = 'COTAHIST_A'
# Obs.: I'm not using a dot after namefile or before typefile
typefile = 'txt'
finalfile = 'all_bovespa.csv'
concatfiles(path, namefile, yeardate, typefile, finalfile)


About

ETL with Python from BOVESPA Web Data, filtering, dropping columns, formatting and creating functions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages