Dynamic Company Embedding¶

This notebook aims to look at the word embeddings for companies and look at whats changed between the years 2016 to 2018, using techniques we’ve explored earlier such as

tf-idf (term frequency - inverse document frequency)
word2vec
doc2vec

import os
import json
import pandas as pd
import numpy as np
import sys
sys.path.insert(0, '..')
%load_ext autoreload
%autoreload 2
%aimport std_func

# Hide warnings
import warnings
warnings.filterwarnings("ignore")

data = pd.read_csv("../data/preprocessed.csv")
data_t = pd.read_csv("../mary/filtered_timeseries_data.csv")

We will examine these five companies and their sequential annual reports:

OKTA
Z-Scaler
NFLX
IBM
GE

We have 564 unique companies that are within the corpus of data we will examine.

Diving into the embeddings¶

In this section, we look at the three embedding techniques employed earlier and how they can help us analyze how four companies have changed year after year.

tf-idf (term frequency - inverse document frequency)¶

For tf-idf we can look directly at the terms extracted and analyze how they’ve changed over time.

We have written a function that computes the 20 terms that experienced the largest changes in value year over year, for each company and each year. Below are the results after running our data through it.

OKTA¶

We can now examine OKTA and which terms experienced the largest changes in value year over year.

# OKTA
OKTA = delta_tfidf[delta_tfidf.iloc[:,0] == 1660134].iloc[:60,:].groupby("YoY").head(5)

2018 - 2019¶

The earliest filing on OKTA available is in 2018. We can see from this analysis that between 2018 and 2019, the largets decrease in tf-idf value (term importance) was “mobile application”. But on the rise of importance were terms such as “help customer” and “customer support”.

From Wikipedia,

It provides cloud software that helps companies manage and secure user authentication into applications, and for developers to build identity controls into applications, website web services and devices.

We can reason that throughout 2019, Okta shifted their focus away from mobile applications to some degree, and increased their focus on customer care.

Assuming they have many business customers with a variety of applications, we could consider this the year Okta shifted their services a little further from mobile.

OKTA[:5]

	CIK	Name	YoY	Term	Delta
68080	1660134	OKTA	year 0 to year 1	mobile application	-0.296099
68081	1660134	OKTA	year 0 to year 1	help customer	0.087498
68082	1660134	OKTA	year 0 to year 1	customer support	0.075500
68083	1660134	OKTA	year 0 to year 1	market segment	0.070916
68084	1660134	OKTA	year 0 to year 1	outside united state	0.068456

2019 - 2020¶

In 2020, it seems the “end user” became a largest importance to Okta, supported by the increased importance of “customer need” as well.

This year could be categorized as a shift toward focusing on users.

OKTA[5:10]

	CIK	Name	YoY	Term	Delta
68100	1660134	OKTA	year 1 to year 2	end user	0.125978
68101	1660134	OKTA	year 1 to year 2	market segment	-0.124971
68102	1660134	OKTA	year 1 to year 2	customer need	0.114765
68103	1660134	OKTA	year 1 to year 2	technology platform	0.071739
68104	1660134	OKTA	year 1 to year 2	new technology	0.057768

2020 - 2021¶

We see an increase in importance of terms like “merger agreement”, “common stock”, “per share”, and “equity interest”. These terms are all related to mergers and acquisitions and a quick search reveals that in May of 2021, they successfully completed a merger with another company named auth0.

You could consider this M&A event to carry the most importance to Okta’s shareholders at the end of 2021, which is why it was given such importance in the the Business Description section.

OKTA[10:15]

	CIK	Name	YoY	Term	Delta
68120	1660134	OKTA	year 2 to year 3	merger agreement	0.256254
68121	1660134	OKTA	year 2 to year 3	common stock	0.134086
68122	1660134	OKTA	year 2 to year 3	per share	0.119493
68123	1660134	OKTA	year 2 to year 3	equity interest	0.105603
68124	1660134	OKTA	year 2 to year 3	purchase price	0.082064

Z-Scaler¶

Z-scaler is a cloud security company, headquartered in San Jose, California. The company’s cloud-native technology platform, the Zscaler Zero Trust Exchange, is designed to help enterprise customers secure their employees, applications, and data as infrastructure and applications move to the cloud and as employees connect to work remotely, off the traditional corporate network.

Wikipedia

Knowing what kind of company Z-Scaler is now, lets try to infer what these changes in term importances mean.

2019-2020¶

Unfortunately, we only have 2 years of data for Z-Scaler. In 2020, it seems they had a large focus on “data center”, and increases in importance for terms that included “service”. We could infer that in 2020, the company which provides security to its clients in the pulic cloud or private data centers had an increased focus on the latter group of services.

# Z-Scaler
delta_tfidf[delta_tfidf.iloc[:,0] == 1713683].iloc[:60,:].groupby("YoY").head(5)

	CIK	Name	YoY	Term	Delta
68320	1713683	Z-Scaler	year 0 to year 1	data center	0.123903
68321	1713683	Z-Scaler	year 0 to year 1	approximately billion	-0.074951
68322	1713683	Z-Scaler	year 0 to year 1	service include	0.065553
68323	1713683	Z-Scaler	year 0 to year 1	service provider	0.064729
68324	1713683	Z-Scaler	year 0 to year 1	credit card	-0.041534

Netflix¶

Netflix is a company that usually needs no introduction, but I will give one regardless.

Netflix, Inc. is an American subscription streaming service and production company. Launched on August 29, 1997, it offers a film and television series library through distribution deals as well as its own productions, known as Netflix Originals.

Lets look at what changes Netflix as a company went through over the years.

2008 - 2009¶

As you recall, these are the terms that experienced the largest absolute changes in tf-idf value year over year. This implies that in 2009, Netflix talked about their “business model”, “customer support” and “proprietary technology” less. It’s hard to guage what this translated into but is interesting nonetheless.

# Netflix
nlfx = delta_tfidf[delta_tfidf.iloc[:,0] == 1065280].iloc[:80,:].groupby("YoY").head(5)
nlfx[:5]

	CIK	Name	YoY	Term	Delta
67680	1065280	NFLX	year 0 to year 1	business model	-0.089397
67681	1065280	NFLX	year 0 to year 1	throughout united	0.082932
67682	1065280	NFLX	year 0 to year 1	customer support	-0.067874
67683	1065280	NFLX	year 0 to year 1	high level	-0.061801
67684	1065280	NFLX	year 0 to year 1	proprietary technology	-0.058393

2009 - 2010¶

In 2010, it seems as though an emphasis of “license agreement”, “third party”, and “intellectual property” all increased in importance.

This is quite important, as according to Britannica,

In 2010 Netflix introduced a streaming-only plan that offered unlimited streaming service but no DVDs. Netflix then expanded beyond the United States by offering the streaming-only plan in Canada in 2010, in Latin America and the Caribbean in 2011, and in the United Kingdom, Ireland, and Scandinavia in 2012.

This year marked a huge change in Netflix’s business model, and of course their company business description shows this.

nlfx[5:10]

	CIK	Name	YoY	Term	Delta
67700	1065280	NFLX	year 1 to year 2	high level	0.105325
67701	1065280	NFLX	year 1 to year 2	proprietary technology	-0.074626
67702	1065280	NFLX	year 1 to year 2	license agreement	0.071091
67703	1065280	NFLX	year 1 to year 2	third party	0.060303
67704	1065280	NFLX	year 1 to year 2	intellectual property	0.045675

2010 - 2011¶

As Netflix’s new business venture started to take off, so did their investments into it and its importance to the business, as highlighted by the increased importance in terms such as “web site” and “united state” as they first released the streaming service in the United States. Throughout 2011, Netflix experienced quite poor demand as consumers were still recovering from the 2008 economic crash, so Netflix had to assure it’s investors of it’s ability to stay in business. That could explain why “investor relation” gained a high degree of importance.

nlfx[10:15]

	CIK	Name	YoY	Term	Delta
67720	1065280	NFLX	year 2 to year 3	web site	-0.323084
67721	1065280	NFLX	year 2 to year 3	investor relation	0.191036
67722	1065280	NFLX	year 2 to year 3	united state	0.166511
67723	1065280	NFLX	year 2 to year 3	sec filing	0.156843
67724	1065280	NFLX	year 2 to year 3	license agreement	-0.140434

2011 - 2012¶

According to above, Netflix expanded to the United Kingdom among other countries in 2012, which may explain the decrease in focus/importance of “united state”. The rest of the changes to term importance seem quite unexplainable at a glance, so make of it what you will. :)

nlfx[15:20]

	CIK	Name	YoY	Term	Delta
67740	1065280	NFLX	year 3 to year 4	united state	-0.126168
67741	1065280	NFLX	year 3 to year 4	web site	-0.122782
67742	1065280	NFLX	year 3 to year 4	consolidated net	0.111090
67743	1065280	NFLX	year 3 to year 4	service offering	-0.100654
67744	1065280	NFLX	year 3 to year 4	supplementary data	0.099262

General Electric¶

General Electric is a household brand that manufacturers everything from washing machines to jet engines. As per Wikipedia,

General Electric Company (GE) is an American multinational conglomerate incorporated in New York State and headquartered in Boston. Until 2021, the company operated in sectors including aviation, power, renewable energy, digital industry, additive manufacturing, locomotives, and venture capital and finance but has since divested from several areas, now primarily consisting of the first four segments.

So they have their hands in many basics, what can changes in the words used in their annual reports tell us about what they’re doing, as a company year to year?

2011 - 2012¶

In 2012, terms such as “oil gas”, “financial service”, and “completed acquisition” increased in importance implying GE may have made significant investements into the Oil & Gas industry as well as the Financial Services industry, perhaps through acquisitions.

From the Acquisitions section on their Wikipedia page, during 2011 and 2012,

In March 2011, GE announced that it had completed the acquisition of privately held Lineage Power Holdings from The Gores Group. In April 2011, GE announced it had completed its purchase of John Wood plc’s Well Support Division for $2.8 billion.

In 2011, GE Capital sold its $2 billion Mexican assets to Santander for 162 million and exit the business in Mexico. Santander additionally assumed the portfolio debts of GE Capital in the country. Following this, GE Capital focused in its core business and shed its non-core assets.

In June 2012, CEO and President of GE Jeff Immelt said that the company would invest ₹3 billion to accelerate its businesses in Karnataka. In October 2012, GE acquired $7 billion worth of bank deposits from MetLife Inc.

All transactions were related to the energy sector (related to Oil & Gas) or finance (Santander is a large Spanish bank)

# GE
ge = delta_tfidf[delta_tfidf.iloc[:,0] == 40545].iloc[:60,:].groupby("YoY").head(5)
ge[:5]

	CIK	Name	YoY	Term	Delta
68360	40545	GE	year 0 to year 1	oil gas	0.100988
68361	40545	GE	year 0 to year 1	financial service	0.076844
68362	40545	GE	year 0 to year 1	product service	-0.073995
68363	40545	GE	year 0 to year 1	completed acquisition	0.062822
68364	40545	GE	year 0 to year 1	service segment	0.048669

2012 - 2013¶

In 2013, it seems GE made further divestments in financial services as terms such as “credit card” and “financial service” have decreased in importance,as well as “power generation” and “real estate” to a lesser extent. A quick Google search lends no meaning to these changes in General Electric’s business, so there are currently no conclusions to draw from this.

ge[5:10]

	CIK	Name	YoY	Term	Delta
68380	40545	GE	year 1 to year 2	credit card	-0.093631
68381	40545	GE	year 1 to year 2	financial service	-0.069370
68382	40545	GE	year 1 to year 2	power generation	-0.057595
68383	40545	GE	year 1 to year 2	real estate	-0.045220
68384	40545	GE	year 1 to year 2	ownership interest	-0.040749

2013 - 2014¶

As with 2013, it seems these changes in term importance are quite broad for 2014. One term that does stand out however is that “north american” as they opened a new facility in Cincinnati this year.

Later in 2014, General Electric announced plans to open its global operations center in Cincinnati, Ohio. The Global Operations Center opened in October 2016 as home to GE’s multifunctional shared services organization. It supports the company’s finance/accounting, human resources, information technology, supply chain, legal and commercial operations, and is one of GE’s four multifunctional shared services centers worldwide in Pudong, China; Budapest, Hungary; and Monterrey, Mexico.

ge[10:15]

	CIK	Name	YoY	Term	Delta
68400	40545	GE	year 2 to year 3	real estate	0.055807
68401	40545	GE	year 2 to year 3	entered agreement	-0.045193
68402	40545	GE	year 2 to year 3	north american	0.044864
68403	40545	GE	year 2 to year 3	financial service	-0.040293
68404	40545	GE	year 2 to year 3	first quarter	0.037850

word2vec¶

Now the issue with word2vec, is that the output feature columns don’t represent a term or topic. Luckily most of our data is already labelled and comes with industry classifications.

There are 171 industries listed in our data of 564 companies.

Lets perform word2vec on this data. Once we have embeddings for each company, we can find a median vector of each industry which we can use as a pseudo-industry embedding. This vector is what we will use to compare companies to and infer whether they’ve moved away or toward any given industries. This will be calculated on the entire time series of company filings.

Below is the code that extracts this information.

from gensim.models.word2vec import Word2Vec
from gensim import utils
from sklearn.metrics.pairwise import cosine_similarity

time_processed = final["coDescription_stopwords"].apply(lambda x: utils.simple_preprocess(x))

# https://stackoverflow.com/questions/46560861/relation-between-word2vec-vector-size-and-total-number-of-words-scanned
model_w = Word2Vec(time_processed, vector_size = 300)

def doc_to_vec(text):
    word_vecs = [model_w.wv[w] for w in text if w in model_w.wv]
    
    if len(word_vecs) == 0:
        return np.zeros(model_w.vector_size)
    
    return np.mean(word_vecs, axis = 0)

doc_vec = pd.DataFrame(time_processed.apply(doc_to_vec).tolist())

data_w2v = pd.concat([final.loc[:,["filingDate","CIK", "Name", "SIC_Descrip"]],doc_vec],axis = 1)

industry_vectors = data_w2v.iloc[:,3:].groupby("SIC_Descrip").mean()

To calculate distance between the annual reports of our five companies and the industries, we will compute the cosine similarity scores.

For each year of filings, we only care about the industries with the smallest distance (greatest consine similarity) and how these top industries have changed over time.

The inner workings of this are quite simple.

For each year, we extract the top 5 industries with the highest similarity scores. These indsutry names are stored and this step is repeated.
Since companies change year over year, different industries could appear in the top 5, which is why some figures below will have more than 5 industries visualized.

Okta¶

okta_w2v = pd.concat([data_w2v[data_w2v.loc[:,"CIK"] == 1660134].iloc[:,0].reset_index(drop=True),
           pd.DataFrame(cosine_similarity(data_w2v[data_w2v.loc[:,"CIK"] == 1660134].iloc[:,4:], industry_vectors),
                        columns = industry_vectors.index.tolist())],
          axis = 1)

# Gets the list of top industries associated with okta throughout it's history
okta_industries = set((okta_w2v.iloc[0,1:].sort_values(ascending = False).head(5).index.tolist()))
for i in np.arange(1,okta_w2v.shape[0]):
    okta_industries.update(okta_w2v.iloc[i,1:].sort_values(ascending = False).head(5).index.tolist())

std_func.dynamic_plt(okta_w2v, okta_industries, "Okta")

The above plot illustrates for the top 5 closest industries to Okta, its distance to its closest five industries stayed quite consistent over the 4 time points, while between 2020 to 2021 its similarity to Computer Integrated Systems Design increased greatly and its similarity to “Computer and Office Equipment” increased a bit less.

Okta’s main service is to handle identity and access management, so naturally integrating into it’s client’s services and internal systems is expected.

The increase in similarity to “Computer and Office Equipment” is harder to interpret, but perhaps it was due to the increased mentions of work from home setups within many companies and how Okta’s services should adapt/have adapted to it.

Z-Scaler¶

zs_w2v = pd.concat([data_w2v[data_w2v.loc[:,"CIK"] == 1713683].iloc[:,0].reset_index(drop=True),
           pd.DataFrame(cosine_similarity(data_w2v[data_w2v.loc[:,"CIK"] == 1713683].iloc[:,4:], industry_vectors),
                        columns = industry_vectors.index.tolist())],
          axis = 1)

# Gets the list of top industries associated with okta throughout it's history
zs_industries = set((zs_w2v.iloc[0,1:].sort_values(ascending = False).head(5).index.tolist()))
for i in np.arange(1,zs_w2v.shape[0]):
    zs_industries.update(zs_w2v.iloc[i,1:].sort_values(ascending = False).head(5).index.tolist())

std_func.dynamic_plt(zs_w2v, zs_industries, "Z-Scaler")

Across the board, it seems Z-Scaler’s similarity to it’s closest industries decreased. This suggests that Z-Scaler is potentially moving away from the computer/software related industry as a whole.

This interpretation should be taken with a large grain of salt, as we are only visualizing changes over one year.

Netflix¶

nlfx_w2v = pd.concat([data_w2v[data_w2v.loc[:,"CIK"] == 1065280].iloc[:,0].reset_index(drop=True),
           pd.DataFrame(cosine_similarity(data_w2v[data_w2v.loc[:,"CIK"] == 1065280].iloc[:,4:], industry_vectors),
                        columns = industry_vectors.index.tolist())],
          axis = 1)

# Gets the list of top industries associated with okta throughout it's history
nlfx_industries = set((nlfx_w2v.iloc[0,1:].sort_values(ascending = False).head(5).index.tolist()))
for i in np.arange(1,nlfx_w2v.shape[0]):
    nlfx_industries.update(nlfx_w2v.iloc[i,1:].sort_values(ascending = False).head(5).index.tolist())

std_func.dynamic_plt(nlfx_w2v, nlfx_industries, "Netflix")

Netflix has experiences quite a tumultuous 5 years, which is no surprise as this was the exact turning point when they first released their streaming service.

This chart quite splendidly depicts how Netflix moved away from industries such as “Electronic Computers”, “Computer and Office Equipment” and “Electronic Parts and Equipment” significantly. In 2008, they released a home set top box device manufactured by Roku which allowed cusotmers to play movies and shows all for one fee. As most people are familiar, they moved away from that and started offering their streaming services online starting 2010.

What can’t be easily explained are the dramatic increases in the “Catalog and mail-order houses (electronic shopping websites)” and “Household Audio and Video Equipment” industries. Presumably their sale of streaming services could be purchased on Netflix’s website, thereby making them an electronic shopping website but that still doesn’t explain the increase in the latter industry. At this stage of the company, Netflix was moving away from physical products and toward their streaming service business.

General Electric¶

ge_w2v = pd.concat([data_w2v[data_w2v.loc[:,"CIK"] == 40545].iloc[:,0].reset_index(drop=True),
           pd.DataFrame(cosine_similarity(data_w2v[data_w2v.loc[:,"CIK"] == 40545].iloc[:,4:], industry_vectors),
                        columns = industry_vectors.index.tolist())],
          axis = 1)

# Gets the list of top industries associated with okta throughout it's history
ge_industries = set((ge_w2v.iloc[0,1:].sort_values(ascending = False).head(5).index.tolist()))
for i in np.arange(1,ge_w2v.shape[0]):
    ge_industries.update(ge_w2v.iloc[i,1:].sort_values(ascending = False).head(5).index.tolist())

std_func.dynamic_plt(ge_w2v, ge_industries, "General Electric")

Finally, we have General Electric. It’s quite interesting to see how its top industries year over year have declined consistently. The movement of this graph implies General Electric was in a large divestment stage in its top related industries.

Notes on how the analysis was performed¶

In the above word2vec section, one aspect of some companies may have been missed. That is for companies with large divestments in its top related industries, what did they move toward? Our analysis here focused primarily on the industries identifies as being the most similar as its quite informative to speak to industries a company operates within. But for some companies such as Netflix and General Electric, they could have experience large increases in similarity with industries that are very far from where their embeddings are.

For example, Netflix could be moving very close to the “Prepackaged Software industry”, but because its similarity is very low (say .50), even large increases to say 0.75 would go unnoticed because that is still too low to be considered one of Netflix’s top 5 industries.

doc2vec¶

Now we’ll look at how embeddings created using doc2vec have changed over time for our four companies.

from gensim.models import doc2vec
from collections import namedtuple

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(final["coDescription_stopwords"]):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model (set min_count = 1, if you want the model to work with the provided example data set)

model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4)

doc_vec_2 = pd.DataFrame([model.dv[doc] for doc in np.arange(0,len(docs))])
data_d2v = pd.concat([final.loc[:,["filingDate","CIK", "Name", "SIC_Descrip"]],doc_vec_2],axis = 1)

industry_vectors_d2v = data_d2v.iloc[:,3:].groupby("SIC_Descrip").mean()

Okta¶

okta_d2v = pd.concat([data_d2v[data_d2v.loc[:,"CIK"] == 1660134].iloc[:,0].reset_index(drop=True),
           pd.DataFrame(cosine_similarity(data_d2v[data_d2v.loc[:,"CIK"] == 1660134].iloc[:,4:], industry_vectors_d2v),
                        columns = industry_vectors.index.tolist())],
          axis = 1)

# Gets the list of top industries associated with okta throughout it's history
okta_industries_d2v = set((okta_d2v.iloc[0,1:].sort_values(ascending = False).head(5).index.tolist()))
for i in np.arange(1,okta_d2v.shape[0]):
    okta_industries_d2v.update(okta_d2v.iloc[i,1:].sort_values(ascending = False).head(5).index.tolist())

std_func.dynamic_plt(okta_d2v, okta_industries_d2v, "Okta")

The above plot illustrates for the top 5 closest industries to Okta, calculated using doc2vec embeddings. It’s quite interesting to see “Business Services, NEC (tobacco sheeting service)” as an industry be so highly ranked as an industry for Okta here.

Z-Scaler¶

zs_d2v = pd.concat([data_d2v[data_d2v.loc[:,"CIK"] == 1713683].iloc[:,0].reset_index(drop=True),
           pd.DataFrame(cosine_similarity(data_d2v[data_d2v.loc[:,"CIK"] == 1713683].iloc[:,4:], industry_vectors_d2v),
                        columns = industry_vectors.index.tolist())],
          axis = 1)

# Gets the list of top industries associated with okta throughout it's history
zs_industries = set((zs_d2v.iloc[0,1:].sort_values(ascending = False).head(5).index.tolist()))
for i in np.arange(1,zs_d2v.shape[0]):
    zs_industries.update(zs_d2v.iloc[i,1:].sort_values(ascending = False).head(5).index.tolist())

std_func.dynamic_plt(zs_d2v, zs_industries, "Z-Scaler")

Across the board, it seems Z-Scaler’s similarity to it’s closest industries increased, in contrast to our word2vec results. This suggests that Z-Scaler is moving closer to the computer integrated systems design/modem related industries.

This interpretation should be taken with a large grain of salt as well, as we are only visualizing changes over one year and the results are not consistent with our word2vec results.

Netflix¶

nlfx_d2v = pd.concat([data_d2v[data_d2v.loc[:,"CIK"] == 1065280].iloc[:,0].reset_index(drop=True),
           pd.DataFrame(cosine_similarity(data_d2v[data_d2v.loc[:,"CIK"] == 1065280].iloc[:,4:], industry_vectors_d2v),
                        columns = industry_vectors.index.tolist())],
          axis = 1)

# Gets the list of top industries associated with okta throughout it's history
nlfx_industries = set((nlfx_d2v.iloc[0,1:].sort_values(ascending = False).head(5).index.tolist()))
for i in np.arange(1,nlfx_d2v.shape[0]):
    nlfx_industries.update(nlfx_d2v.iloc[i,1:].sort_values(ascending = False).head(5).index.tolist())

std_func.dynamic_plt(nlfx_d2v, nlfx_industries, "Netflix")

A quick look at the largest increases in cosine similarity tells us that this graph is quite meaningless, although fun to look at. The largest jumps in cosine similarity occurred with the “Wholesale Trade–Durable Goods” industry and the “Business Services, NEC (tobacco sheeting service)” industry which are far crys from what Netflix’s products and services are.

As per definition,

Industries in the Wholesale Trade, Durable Goods subsector sell or arrange the purchase or sale of capital or durable goods to other businesses. Durable goods are new or used items generally with a normal life expectancy of three years or more. Durable goods wholesale trade establishments are engaged in wholesaling products, such as motor vehicles, furniture, construction materials, machinery and equipment (including household-type appliances), metals and minerals (except petroleum), sporting goods, toys and hobby goods, recyclable materials, and parts.

This result is unfortunately not very interpretable.

General Electric¶

ge_d2v = pd.concat([data_d2v[data_d2v.loc[:,"CIK"] == 40545].iloc[:,0].reset_index(drop=True),
           pd.DataFrame(cosine_similarity(data_d2v[data_d2v.loc[:,"CIK"] == 40545].iloc[:,4:], industry_vectors_d2v),
                        columns = industry_vectors.index.tolist())],
          axis = 1)

# Gets the list of top industries associated with okta throughout it's history
ge_industries = set((ge_d2v.iloc[0,1:].sort_values(ascending = False).head(5).index.tolist()))
for i in np.arange(1,ge_d2v.shape[0]):
    ge_industries.update(ge_d2v.iloc[i,1:].sort_values(ascending = False).head(5).index.tolist())

std_func.dynamic_plt(ge_d2v, ge_industries, "General Electric")

Finally, we have General Electric. This graph is quite interesting, as GE is a multinational conglomerate which has it’s hands in many industries. This graph illustrates this aspect of GE quite well, as its most similar industries are quite unrelated to eachother. Due to the scale of this company, it is also quite difficult to interpret why certain industries increased in similarity, and as we’ve seen above it may be best to take dynamic analysis results using doc2vec with a grain of salt.

Conclusion¶

On this page, you’ve learned about our methodology of measuring and analyzing the changes in business descriptions. These changes helped us identify what the largest changes in each of these companies were, whether it was a increase or decrease of importance of certain terms, or a shift closer toward or further from its closest industries.

As this work is unsupervised and exploratory, its quite hard to extract any concrete results that could lead to actionable insight. This is more a look into the techniques available to perform NLP analysis. We hope this was informative to how time series data and text embeddings have mayn novel applications waiting to be explored.

Analysis of Textual data from 10K Financial Reports

Dynamic Company Embedding

Contents

Dynamic Company Embedding¶

Diving into the embeddings¶

tf-idf (term frequency - inverse document frequency)¶

OKTA¶

2018 - 2019¶

2019 - 2020¶

2020 - 2021¶

Z-Scaler¶

2019-2020¶

Netflix¶

2008 - 2009¶

2009 - 2010¶

2010 - 2011¶

2011 - 2012¶

General Electric¶

2011 - 2012¶

2012 - 2013¶

2013 - 2014¶

word2vec¶

Okta¶

Z-Scaler¶

Netflix¶

General Electric¶

Notes on how the analysis was performed¶

doc2vec¶

Okta¶

Z-Scaler¶

Netflix¶

General Electric¶

Conclusion¶