Skip to content

Alice0206/Data-Processes-Assignment

Repository files navigation

Data-Processes-Assignment

Table of contents

Collaborators

Alicia Wang

Gabriele Zambra

Giorgio Segalla

Yazan Salti

Fee Lueng

Domain of interest

Why

In this project we are going to work with data from the field of music for the following reasons: We all use different types of services to listen to music regularly including Spotify, SoundCloud, and iTunes. What makes us more passionate about music that some of us are or have been musicians. We think it is really interesting to work and explore data about music and have the opportunity to discover some of the hidden information about music, about the behavior of music providers, music makers and users of platforms.

Three data-driven project example

In the following example, they are collecting data from Music streaming service platforms. Later they use this in a machine-learning algorithm to predict the mood of the user based on his/her playlist of music. Spotify Taste Analysis

In this project, they try to find common patterns among the most popular songs on Spotify, They analyze the audio features of the tracks in Spotify's Top Songs of 2017 playlist. Spotify Project

By analyzing the data of the top 100 artists on Spotify, the following project tries to group the artists into clusters and give a description of each cluster to discover more information about the most successful artists. Music Project

Data-driven questions

  1. What features have TOP100 artists in Europe and China according to popularity?
  2. What are the differences in music features between song I like and song I dislike? A brief analysis of my musical tastes.
  3. How are the features of the song are different in the world?
  4. How have duration, tempo and loudness of songs changed through time?
  5. How do the features of music influence the genre of music? Are they significant?

Datasets

The datasets we have found are saved as .csv in their respective folders and the structure will be the following.

data/
├── dataset_name_1/
|   └── data.csv
|
|    
└── dataset_name_n/
    └── data.csv

The Dataset consists of 1.4 Million musical artists, the dataset is collected through MusicBrainz database and the API of Last fm. The dataset can be found in the directory Data/last_fm/ . 3% of the observation in this dataset contain NA values, therefore we have cleaned and prepared the data to make it easier to work with it. This dataset is downloaded from Kaggle.com. The dataset could be useful to make an analysis of artists from different countries.

Variables of dataset:

  • mbid musicbrainz id
  • artist_mb: artist name according to musicbrainz
  • artist_lastfm: artist name according to last_fm
  • country_mb: artist country according to musicbrainz
  • country_lastfm : artist country, based on last_fm tags
  • tags_mb: artist tags on musicbrainz, separated by semicolon (;)
  • tags_lastfm: artist tags on last_fm, separated by semicolon (;), sorted by frequency decreasing
  • listeners_lastfm: number of listeners on last_fm
  • scrobbles_lastfm: number of scrobbles on last_fm
  • ambiguous_artist: TRUE if more than one artist shares the same last_fm page

This dataset contains the daily ranking of the 200 most listened songs on Spotify in 53 countries from 2017 to 2018. The dataset has more than 2 million observations providing information about 18598 different songs by 6629 different artists. The data can be found in the following directory:

Data/spotify_ranking/.

The dataset is downloaded from Kaggle, It is going to be used to find patterns and similarities among the top songs.

Variables of dataset:

  • Position Position: on charts
  • Track Name: Title of song
  • Artist Name: of musician or group
  • Streams Number: of streams
  • URL
  • Date
  • Region: Country code

This dataset contains the features of different music genres and it has 100 songs for each genre. It has 30 features and 1000 observations, we are going to use this dataset to differences in features We have downloaded the dataset from Kaggle.com.

Variables of the dataset:

  • filename: Filename as given in Marsyas dataset.
  • tempo: The speed at which a passage of music is played
  • beats: Rythmic unit in music
  • chroma_stft: Short-Time Fourier Transform
  • rmse: Root Mean Square Error
  • spectral_centroid: Indicates where the "center of mass" of the spectrum is located.
  • spectral_bandwidth: It is the Wavelength interval in which a radiated spectral quantity is not less than half its maximum value
  • rolloff: Roll-off is the steepness of a transmission function with frequency
  • zero_crossing_rate: The rate at which the signal changes from positive to negative or back
  • mfccs: Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC.
  • label: Contains a string depicting the genre

Deezer API and Rapid API offer a radio station API which can be used to access data about the music being played at the moment or was played recently (short term data). We are going to use the dataset look for trendy music and analyze their feature to find out and answer the question of what are the characteristics of songs which go viral?

Variables of the dataset:

  • Radio search (i: Radio station unique id; d: Genre id; n: Radio name; c: Country 2-letter code; g: Genre name; u: Stream URL; l: Radio logo)
  • Countries list (code: 2-letter code; name: Name)
  • Now Playing (artist_song: artist name; title_song: song name; radio_name: radio name; radio_id: radio id; radio_stream: radio stream URL; radio_logo: radio logo; date: date GMT+1 Paris)
  • Radio station Infos (i: Radio station unique id; d: Genre id; n: Radio name; c: Country 2-letter code; g: Genre name; u: Stream URL; l: Logo)
  • Weekly Charts (artist_song: artist name; title_song: song name; occurrences: occurrences number)
  • Monthly Charts (artist_song: artist name; title_song: song name; occurrences: occurrences number)
  • Music genres list (i: Genre id; c: Name)
  • Daily Charts (artist_song: artist name; title_song: song name; occurrences: occurrences number)

In this dataset each row represents a song. There are 16 columns. 13 of which are song attributes, one column for song name, one for artist, and a column called "target" which is the label for the song. Here are the 13 track attributes: acousticness, danceability, duration_ms, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, time_signature, valence. Information on what those traits mean can be found here: https://developer.spotify.com/web-api/get-audio-features/

  • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • duration_ms: duration of the song in ms.
  • energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
  • instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
  • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
  • mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
  • speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
  • tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • time_signature: An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
  • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Described as following on the source page of the dataset. "The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks." I will bes uing only a small subset of the dataset which I found here..

Variables of dataset:

  • track_id
  • artist_name
  • title: title of the song
  • loudness: general loudness of the track
  • tempo: tempo in BPM according to The Echo Nest
  • time_signal: time signature of the song according to The Echo Nest, i.e. usual number of beats per bar
  • key: estimation of the key the song is in by The Echo Nest
  • duration: in seconds
  • song_hotttnesss: according to The Echo Nest, when downloaded (in December 2010), this song had a 'hotttnesss' of 0.8 (on a scale of 0 and 1)
  • year: year when this song was released, according to musicbrainz.org

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published