Data ETL and Methods of evaluation

Data ETL and Methods of evaluation

This report relies upon the data which has analyzed and preprocessed in order to maximize information extraction. The original data was generously provided to us by Ubineer, a fin-tech start-up based out of the University of Toronto. This page will go in depth into the methods we used to obtain, clean, and preprocess our data. The methods used for plotting and evaluating the results are also described below.

The data

As mentioned above, we obtained our data from Ubineer, which they store in Googles BigQuery platform. They’ve already performed a lot of leg work processing gigabytes worth of annual reports, extracting text fields such as the Business Description, Business Risk and Management's Discussion and Analysis of Financial Condition and Results of Operations. In this report, we’ve focused on the Business Description data.

Schema

Column

Description

accessionNumber

NA

filingDate

When the report was filed with the SEC

reportingDate

The date which the report was prepared for

financialEntity

Contains the CIK, specifying which company the report belongs to

htmlFile

Link to the report

coDescription

The Business Description section of the Report

An example row of the data is below:

accessionNumber

filingDate

reportingDate

financialEntity

htmlFile

coDescription

0001144204-09-017197

2009-03-31 10:22:32 UTC

2008-12-31 05:00:00 UTC

financialEntities/params;cik=1140028

https://www.sec.gov/Archives/edgar/data/1140028/000114420409017197/v143585_10k.htm

Item 1Business4<a href=”#item1a”

Notice how the Business Description column (coDescription) also contians some html noise.

Extraction

Within the BigQuery interface, we were able to query the data, although due to the sheer size of it (2GB+) we were unable to exatract it efficiently for analysis. Our supervisor Professor Sotirious Damouras was able to assist us in not only extracting the coDescription data, but also link each filing with a Company name, it’s respective SIC code (identifying the company’s operating industry), and the country and city they are headquartered. This data only contains filings from 2018, as the team agreed it would be best to avoid years plagued with COVID, but also have the most up to date information. As mentioned previously, the sheer size of data available prevented us from extracting all the company filings so we decided to filter only companies from the top five industries (based on number of companies). That gave us 1127 unique filings to analyze.

Here is a snippet:

import pandas as pd
data = pd.read_json("../data/bq_2018_top5SIC.json", lines = True)
data.head()
accessionNumber filingDate reportingDate financialEntity htmlFile coDescription CIK name countryinc cityma SIC SIC_desc
0 0001441816-18-000028 2018-03-30 20:12:23 UTC 2018-02-01 financialEntities/params;cik=1441816 https://www.sec.gov/Archives/edgar/data/144181... Item 1. BusinessOverviewMongoDB is the leading... 1441816 MONGODB, INC. US NEW YORK 7372 Prepackaged Software (mass reproduction of sof...
1 0001108524-18-000011 2018-03-09 22:01:46 UTC 2018-02-01 financialEntities/params;cik=1108524 https://www.sec.gov/Archives/edgar/data/110852... ITEM 1. BUSINESSOverviewSalesforce is a global... 1108524 SALESFORCE COM INC US SAN FRANCISCO 7372 Prepackaged Software (mass reproduction of sof...
2 0001564590-18-006986 2018-03-28 21:27:30 UTC 2018-02-01 financialEntities/params;cik=1385867 https://www.sec.gov/Archives/edgar/data/138586... Item 1.Business1<p style="margin-bottom:0pt;ma... 1385867 COUPA SOFTWARE INC US SAN MATEO 7372 Prepackaged Software (mass reproduction of sof...
3 0001353283-18-000004 2018-03-30 21:21:46 UTC 2018-02-01 financialEntities/params;cik=1353283 https://www.sec.gov/Archives/edgar/data/135328... Item 1. BusinessOverviewSplunk provides innov... 1353283 SPLUNK INC NaN SAN FRANCISCO 7372 Prepackaged Software (mass reproduction of sof...
4 0001660134-18-000007 2018-03-12 20:45:43 UTC 2018-02-01 financialEntities/params;cik=1660134 https://www.sec.gov/Archives/edgar/data/166013... Item 1. BusinessOverview Okta is the leading i... 1660134 OKTA, INC. US SAN FRANCISCO 7372 Prepackaged Software (mass reproduction of sof...

Cleaning

If you look closely, almost all coDescription value start with something like “Item 1. Business Overview”. Some even contain html code, identified by it’s surrounding < and > angle brackets. One of the most important things to keep in mind is that our analysis can only be as good as our data.

In an effort to improve our data, we’ll be removing as much duplicate word data and HTML code as possible, as well as empty space and random quotations. This is done in the std_func.py files function clean_data_fn(), located here.

We also remove numbers, as they don’t actually provide us with any semantic meaning.

As a final filter, we also remove any filings that contain less than 250 characters in the coDescription column, as they don’t have enough data for us to analyze, or is a shell company with no actual business.

After cleaning, the coDescription column looks more like this:

0    mongodb is the leading modern, general purpose database platform. our robust platform enables de...
1    salesforce is a global leader in customer relationship management, or crm, software. we deliver ...
3    splunk provides innovative software solutions that enable organizations to gain real-time operat...
4    okta is the leading independent provider of identity for the enterprise. our mission is to enabl...
5    veeva is a leading provider of industry cloud solutions for the global life sciences industry. w...
Name: coDescription, dtype: object
After cleaning, there are 675 filings remaining

Much better!

Pre-processing

Now that the data has been cleaned, it’s time to dig in and really analyze it. Through our analysis, we’ve found that many words such as “are” and “is”, “running” and “ran” all carried their respective meaning, but yet represented as different words. In order to condense the sheer amount of data we have (in number of words), we perform something called lemmatization, the process of reducing words to their base form. “are” and “is” would be converted to “be”, and “running” and “ran” will be converted to “run”. With less word variations to deal with, our analysis is bound to improve!

Another pre-processing step taken was removing stop words. These words include words such as “the”, “and”, “that”, “is” among many more. These words themselves don’t carry any meaning, and our goal is to extract as much semantic information as possible out of our data. As these stop words don’t contriute to that goal (and they take up a LOT of room, just read the previou sentence and count the stop words!), we can further remove them to reduce the amount of data we need to process.

You can find the code here.

After these steps, our coDescription column looks like this:

0    mongodb leading modern general purpose database platform robust platform enables developer build...
1    salesforce global leader customer relationship management crm software deliver software internet...
3    splunk provides innovative software solution enable organization gain operational intelligence h...
4    okta leading independent provider identity enterprise mission enable organization use technology...
5    veeva leading provider industry cloud solution global life science industry founded premise clou...
Name: coDescription_stopwords, dtype: object

This results in detailed, concise business descriptions with as much fluff removed as possible. Our analysis depends on having as much information as possible, while also reducing the extraneous bits that don’t contribute to our analysis. This data is what we use through the majority of our analysis, existing as the coDescription_stopwords column.