Data Collection and Preparation / Database and SQL

Mutual Fund Data

Overview

This project consisted of collecting, cleaning and preparing data from three different sources: a flat file, a website (collected by scraping the website), and an API. The cleaning of the data utilized a pandas DataFrame. The final product included joining all three sources on the same key, storing them in a SQLite database and executing SQL commands to create simple visualizations.

Tools:

  • Pandas
  • BeautifulSoup
  • SQL
  • Sqlite3
  • Notebooks

Part 1 – Flat File Source

The first part of this project was to load a flat file such as a CSV and clean the data.

Source: Mutual fund dataset located at: flat file

• Demonstrated the following data preparation techniques:

  • Find problematic data and fix data types and formats of data values
  • Rename columns
  • Convert date formats
  • Identify and remove duplicate values
  • Replace missing values
  • Identify and remove erroneous outlier information

Part 2 – Website Source

The second part of the project was to use BeautifulSoup and scrape a website to retrieve data.

Source: Mutual fund financial performance located at: website

• Demonstrated the following data preparation techniques:

  • Rename columns to be make consistent for merging with other data frames
  • Strip string representation of numbers of dollar signs and letters
  • Replace symbols with integers
  • Convert data types from strings to floats
  • Convert date formats

Part 3 – API Source

The last data source came from connecting to an API to pull the data.

Source: FMP Finance API: API

• Demonstrated the following data preparation techniques:

  • Rename columns to be make consistent for merging with other data frames
  • Change problematic index
  • Convert date datatype to datetime timestamp
  • Drop columns
  • Replace missing values
  • Create new Series with calculated values

Part 4 – Database and SQL

• Demonstrated the following:

  • Creating a SQLite database
  • Creating three separate tables
  • Performing a left outer join to combine all three tables
  • Executing SQL commands to create simple visualizations

The Complete Project: here.

Updated: