Data ETL and Methods of evaluation
Contents
Data ETL and Methods of evaluation¶
This report relies upon the data which has analyzed and preprocessed in order to maximize information extraction. The original data was generously provided to us by Ubineer, a fin-tech start-up based out of the University of Toronto. This page will go in depth into the methods we used to obtain, clean, and preprocess our data. The methods used for plotting and evaluating the results are also described below.
The data¶
As mentioned above, we obtained our data from Ubineer, which they store in Googles BigQuery platform. They’ve already performed a lot of leg work processing gigabytes worth of annual reports, extracting text fields such as the Business Description
, Business Risk
and Management's Discussion and Analysis of Financial Condition and Results of Operations
. In this report, we’ve focused on the Business Description
data.
Schema¶
Column |
Description |
---|---|
accessionNumber |
NA |
filingDate |
When the report was filed with the SEC |
reportingDate |
The date which the report was prepared for |
financialEntity |
Contains the CIK, specifying which company the report belongs to |
htmlFile |
Link to the report |
coDescription |
The Business Description section of the Report |
An example row of the data is below:
accessionNumber |
filingDate |
reportingDate |
financialEntity |
htmlFile |
coDescription |
---|---|---|---|---|---|
0001144204-09-017197 |
2009-03-31 10:22:32 UTC |
2008-12-31 05:00:00 UTC |
financialEntities/params;cik=1140028 |
https://www.sec.gov/Archives/edgar/data/1140028/000114420409017197/v143585_10k.htm |
Item 1Business4<a href=”#item1a” |
Notice how the Business Description column (coDescription
) also contians some html noise.
Extraction¶
Within the BigQuery interface, we were able to query the data, although due to the sheer size of it (2GB+) we were unable to exatract it efficiently for analysis. Our supervisor Professor Sotirious Damouras was able to assist us in not only extracting the coDescription
data, but also link each filing with a Company name, it’s respective SIC
code (identifying the company’s operating industry), and the country and city they are headquartered. This data only contains filings from 2018, as the team agreed it would be best to avoid years plagued with COVID, but also have the most up to date information. As mentioned previously, the sheer size of data available prevented us from extracting all the company filings so we decided to filter only companies from the top five industries (based on number of companies). That gave us 1127 unique filings to analyze.
Here is a snippet:
import pandas as pd
data = pd.read_json("../data/bq_2018_top5SIC.json", lines = True)
data.head()
accessionNumber | filingDate | reportingDate | financialEntity | htmlFile | coDescription | CIK | name | countryinc | cityma | SIC | SIC_desc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0001441816-18-000028 | 2018-03-30 20:12:23 UTC | 2018-02-01 | financialEntities/params;cik=1441816 | https://www.sec.gov/Archives/edgar/data/144181... | Item 1. BusinessOverviewMongoDB is the leading... | 1441816 | MONGODB, INC. | US | NEW YORK | 7372 | Prepackaged Software (mass reproduction of sof... |
1 | 0001108524-18-000011 | 2018-03-09 22:01:46 UTC | 2018-02-01 | financialEntities/params;cik=1108524 | https://www.sec.gov/Archives/edgar/data/110852... | ITEM 1. BUSINESSOverviewSalesforce is a global... | 1108524 | SALESFORCE COM INC | US | SAN FRANCISCO | 7372 | Prepackaged Software (mass reproduction of sof... |
2 | 0001564590-18-006986 | 2018-03-28 21:27:30 UTC | 2018-02-01 | financialEntities/params;cik=1385867 | https://www.sec.gov/Archives/edgar/data/138586... | Item 1.Business1<p style="margin-bottom:0pt;ma... | 1385867 | COUPA SOFTWARE INC | US | SAN MATEO | 7372 | Prepackaged Software (mass reproduction of sof... |
3 | 0001353283-18-000004 | 2018-03-30 21:21:46 UTC | 2018-02-01 | financialEntities/params;cik=1353283 | https://www.sec.gov/Archives/edgar/data/135328... | Item 1. BusinessOverviewSplunk provides innov... | 1353283 | SPLUNK INC | NaN | SAN FRANCISCO | 7372 | Prepackaged Software (mass reproduction of sof... |
4 | 0001660134-18-000007 | 2018-03-12 20:45:43 UTC | 2018-02-01 | financialEntities/params;cik=1660134 | https://www.sec.gov/Archives/edgar/data/166013... | Item 1. BusinessOverview Okta is the leading i... | 1660134 | OKTA, INC. | US | SAN FRANCISCO | 7372 | Prepackaged Software (mass reproduction of sof... |
Cleaning¶
If you look closely, almost all coDescription
value start with something like “Item 1. Business Overview”. Some even contain html code, identified by it’s surrounding <
and >
angle brackets. One of the most important things to keep in mind is that our analysis can only be as good as our data.
In an effort to improve our data, we’ll be removing as much duplicate word data and HTML code as possible, as well as empty space and random quotations. This is done in the std_func.py
files function clean_data_fn()
, located here.
We also remove numbers, as they don’t actually provide us with any semantic meaning.
As a final filter, we also remove any filings that contain less than 250 characters in the coDescription
column, as they don’t have enough data for us to analyze, or is a shell company with no actual business.
After cleaning, the coDescription
column looks more like this:
0 mongodb is the leading modern, general purpose database platform. our robust platform enables de...
1 salesforce is a global leader in customer relationship management, or crm, software. we deliver ...
3 splunk provides innovative software solutions that enable organizations to gain real-time operat...
4 okta is the leading independent provider of identity for the enterprise. our mission is to enabl...
5 veeva is a leading provider of industry cloud solutions for the global life sciences industry. w...
Name: coDescription, dtype: object
After cleaning, there are 675 filings remaining
Much better!
Pre-processing¶
Now that the data has been cleaned, it’s time to dig in and really analyze it. Through our analysis, we’ve found that many words such as “are” and “is”, “running” and “ran” all carried their respective meaning, but yet represented as different words. In order to condense the sheer amount of data we have (in number of words), we perform something called lemmatization, the process of reducing words to their base form. “are” and “is” would be converted to “be”, and “running” and “ran” will be converted to “run”. With less word variations to deal with, our analysis is bound to improve!
Another pre-processing step taken was removing stop words. These words include words such as “the”, “and”, “that”, “is” among many more. These words themselves don’t carry any meaning, and our goal is to extract as much semantic information as possible out of our data. As these stop words don’t contriute to that goal (and they take up a LOT of room, just read the previou sentence and count the stop words!), we can further remove them to reduce the amount of data we need to process.
You can find the code here.
After these steps, our coDescription
column looks like this:
0 mongodb leading modern general purpose database platform robust platform enables developer build...
1 salesforce global leader customer relationship management crm software deliver software internet...
3 splunk provides innovative software solution enable organization gain operational intelligence h...
4 okta leading independent provider identity enterprise mission enable organization use technology...
5 veeva leading provider industry cloud solution global life science industry founded premise clou...
Name: coDescription_stopwords, dtype: object
This results in detailed, concise business descriptions with as much fluff removed as possible. Our analysis depends on having as much information as possible, while also reducing the extraneous bits that don’t contribute to our analysis. This data is what we use through the majority of our analysis, existing as the coDescription_stopwords
column.