
- Aws million song dataset region code#
- Aws million song dataset region series#
- Aws million song dataset region simulator#
!!! Do not forget to clean up the AWS resources otherwise you will be charged for a service you did not actually use, to do so terminate the EMR cluster and clean up your S3 buckets.A.
scp-method.txt : to copy your local files into your host machine. df.cfg : the configuration file that contains the IAM user ( keep the file's name & structure I use and replace the blanks with your own credentials). etl.py : a python file you will be using in your EMR cluster !! YOU NEED TO CHANGE THE "output_file" with your own S3 outputfile. For data quality you can launch a jupyter notebook on the EMR cluster and start playing with the dimentional table created in this work. You can monitor your job with SPARK UI ( you must set up a dynamic port forwarding first - for more info please check the AWS documentation on PORT FORWARDING). Depending on your EMR cluster configuration, the execution could take a while ( 12 minutes in my case ). Submit your etl.py using the command ( spark-submit -master yarn etl.py). Create a new file called dl.cfg, copy and paste the content of my dl.cfg I provided and replace the blanks with your own credentials !! DO NOT explicitily plain write your credentials into to etl.py, do it my way and let the configParser do the job. Or you can SCP your file from your local machine to your host machine (please refer to the txt file in the repoto be able to copy your file). Aws million song dataset region code#
Once connected to cluster via your terminal create a new file ( nano etl.py ) and copy paste the code in the etl.py I provided in this work. Open up a terminal and SSH to your cluster (please refer to summary page of your EMR cluster=> connect to Master Node using SSH under Master public DNS). Edit your inbound rules in the security group and allow SSH access to your computer IP adress. Launch an EMR cluster with the configuration you choose (either using the AWS Console or AWS CLI) and assign the key pair you created before (pem file). pem file will be automatically downloaded into your computer and will be used later. From the AWS console, click on Service, type 'EC2' to go to EC2 console, Choose Key Pairs in Network & Security on the left panel => Choose Create key pair, Type the name for the key pairs, File format: pem => Choose Create key pair. Finally save credentials into a safe place. Create an IAM user with a programmatic access type, attach existing policies and set permission as Administrator Access, then choose Next: Tags, skip the tag page and choose Next:Review and choose Create User. Important note, I set up the EMR cluster and S3 in the same region, this way It would save me time. So you can understand that your application time running will depend on the EMR configuration you set and so many else factors( like your internet connectivity speed) You can choose whatever configuration you want. There are several ways to configure and launch an EMR cluster, either programmatically or simply using AWS Console.įor the sake of simplicity I will use the AWS Console to set up an EMR cluster.ģ EC2 instances ( 1 Master and 2 Slaves) of type m5.xlarge, 16 GiB memory, 64 GiB EBS storage each. Please refer to this repo to get the schema as it is the same data schema as before.
For example, here is a filepath to two file in this dataset: log_data/20-11-12-events.json Database schema The log files in the dataset we will be using are partitioned by year and month.
These simulate app activity logs from an imaginary music streaming app based on configuration settings.
Aws million song dataset region simulator#
The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. For example, here is a filepath to a file in this dataset: song_data/A/B/C/TRABCEI128F424C983.json The files are partitioned by the first three letters of each song's track ID. Each file is in JSON format and contains metadata about a song and the artist of that song. The first dataset is a subset of real data from the Million Song Dataset. In this case we are dealing with raw data type in JSON format that are located into two different S3 buckets.
To do that, we need to extract data which resides in S3 buckets, then load data into spark dataframes on EMR clusters,transform it into dimensional tables and finally store them back into S3 parquet files. The solution is to create a star schema of fact and dimension tables optimized for queries on songplays analysis. The purpose of this project is to acquire new insights manipulating the easy-to-use data residing in parquet files in S3. This work aims to design an ETL data pipeline using Spark that extracts and store output files in S3 as a data Lake.
Aws million song dataset region series#
As Sparkify continues to have more and more users and it would be very beneficial to migrate from a datawarehouse to a data lake (please check the data modeling series in my github repos so you can follow up).