Data ETL and Methods of evaluation¶

This report relies upon the data which has analyzed and preprocessed in order to maximize information extraction. The original data was generously provided to us by Ubineer, a fin-tech start-up based out of the University of Toronto. This page will go in depth into the methods we used to obtain, clean, and preprocess our data. The methods used for plotting and evaluating the results are also described below.

The data¶

As mentioned above, we obtained our data from Ubineer, which they store in Googles BigQuery platform. They’ve already performed a lot of leg work processing gigabytes worth of annual reports, extracting text fields such as the Business Description, Business Risk and Management's Discussion and Analysis of Financial Condition and Results of Operations. In this report, we’ve focused on the Business Description data.

Schema¶

Column	Description
accessionNumber	NA
filingDate	When the report was filed with the SEC
reportingDate	The date which the report was prepared for
financialEntity	Contains the CIK, specifying which company the report belongs to
htmlFile	Link to the report
coDescription	The Business Description section of the Report

An example row of the data is below:

accessionNumber	filingDate	reportingDate	financialEntity	htmlFile	coDescription
0001144204-09-017197	2009-03-31 10:22:32 UTC	2008-12-31 05:00:00 UTC	financialEntities/params;cik=1140028	https://www.sec.gov/Archives/edgar/data/1140028/000114420409017197/v143585_10k.htm	Item 1Business4<a href=”#item1a”

Notice how the Business Description column (coDescription) also contians some html noise.

Extraction¶

Within the BigQuery interface, we were able to query the data, although due to the sheer size of it (2GB+) we were unable to exatract it efficiently for analysis. Our supervisor Professor Sotirious Damouras was able to assist us in not only extracting the coDescription data, but also link each filing with a Company name, it’s respective SIC code (identifying the company’s operating industry), and the country and city they are headquartered. This data only contains filings from 2018, as the team agreed it would be best to avoid years plagued with COVID, but also have the most up to date information. As mentioned previously, the sheer size of data available prevented us from extracting all the company filings so we decided to filter only companies from the top five industries (based on number of companies). That gave us 1127 unique filings to analyze.

Here is a snippet:

import pandas as pd
data = pd.read_json("../data/bq_2018_top5SIC.json", lines = True)
data.head()

	accessionNumber	filingDate	reportingDate	financialEntity	htmlFile	coDescription	CIK	name	countryinc	cityma	SIC	SIC_desc
0	0001441816-18-000028	2018-03-30 20:12:23 UTC	2018-02-01	financialEntities/params;cik=1441816	https://www.sec.gov/Archives/edgar/data/144181...	Item 1. BusinessOverviewMongoDB is the leading...	1441816	MONGODB, INC.	US	NEW YORK	7372	Prepackaged Software (mass reproduction of sof...
1	0001108524-18-000011	2018-03-09 22:01:46 UTC	2018-02-01	financialEntities/params;cik=1108524	https://www.sec.gov/Archives/edgar/data/110852...	ITEM 1. BUSINESSOverviewSalesforce is a global...	1108524	SALESFORCE COM INC	US	SAN FRANCISCO	7372	Prepackaged Software (mass reproduction of sof...
2	0001564590-18-006986	2018-03-28 21:27:30 UTC	2018-02-01	financialEntities/params;cik=1385867	https://www.sec.gov/Archives/edgar/data/138586...	Item 1.Business1<p style="margin-bottom:0pt;ma...	1385867	COUPA SOFTWARE INC	US	SAN MATEO	7372	Prepackaged Software (mass reproduction of sof...
3	0001353283-18-000004	2018-03-30 21:21:46 UTC	2018-02-01	financialEntities/params;cik=1353283	https://www.sec.gov/Archives/edgar/data/135328...	Item 1. BusinessOverviewSplunk provides innov...	1353283	SPLUNK INC	NaN	SAN FRANCISCO	7372	Prepackaged Software (mass reproduction of sof...
4	0001660134-18-000007	2018-03-12 20:45:43 UTC	2018-02-01	financialEntities/params;cik=1660134	https://www.sec.gov/Archives/edgar/data/166013...	Item 1. BusinessOverview Okta is the leading i...	1660134	OKTA, INC.	US	SAN FRANCISCO	7372	Prepackaged Software (mass reproduction of sof...

Cleaning¶

If you look closely, almost all coDescription value start with something like “Item 1. Business Overview”. Some even contain html code, identified by it’s surrounding < and > angle brackets. One of the most important things to keep in mind is that our analysis can only be as good as our data.

In an effort to improve our data, we’ll be removing as much duplicate word data and HTML code as possible, as well as empty space and random quotations. This is done in the std_func.py files function clean_data_fn(), located here.

We also remove numbers, as they don’t actually provide us with any semantic meaning.

As a final filter, we also remove any filings that contain less than 250 characters in the coDescription column, as they don’t have enough data for us to analyze, or is a shell company with no actual business.

After cleaning, the coDescription column looks more like this:

  mongodb is the leading modern, general purpose database platform. our robust platform enables de...
  salesforce is a global leader in customer relationship management, or crm, software. we deliver ...
  splunk provides innovative software solutions that enable organizations to gain real-time operat...
  okta is the leading independent provider of identity for the enterprise. our mission is to enabl...
  veeva is a leading provider of industry cloud solutions for the global life sciences industry. w...
Name: coDescription, dtype: object

After cleaning, there are 675 filings remaining

Much better!

Pre-processing¶

Now that the data has been cleaned, it’s time to dig in and really analyze it. Through our analysis, we’ve found that many words such as “are” and “is”, “running” and “ran” all carried their respective meaning, but yet represented as different words. In order to condense the sheer amount of data we have (in number of words), we perform something called lemmatization, the process of reducing words to their base form. “are” and “is” would be converted to “be”, and “running” and “ran” will be converted to “run”. With less word variations to deal with, our analysis is bound to improve!

Another pre-processing step taken was removing stop words. These words include words such as “the”, “and”, “that”, “is” among many more. These words themselves don’t carry any meaning, and our goal is to extract as much semantic information as possible out of our data. As these stop words don’t contriute to that goal (and they take up a LOT of room, just read the previou sentence and count the stop words!), we can further remove them to reduce the amount of data we need to process.

You can find the code here.

After these steps, our coDescription column looks like this:

  mongodb leading modern general purpose database platform robust platform enables developer build...
  salesforce global leader customer relationship management crm software deliver software internet...
  splunk provides innovative software solution enable organization gain operational intelligence h...
  okta leading independent provider identity enterprise mission enable organization use technology...
  veeva leading provider industry cloud solution global life science industry founded premise clou...
Name: coDescription_stopwords, dtype: object

This results in detailed, concise business descriptions with as much fluff removed as possible. Our analysis depends on having as much information as possible, while also reducing the extraneous bits that don’t contribute to our analysis. This data is what we use through the majority of our analysis, existing as the coDescription_stopwords column.

Analysis of Textual data from 10K Financial Reports

Data ETL and Methods of evaluation

Contents