N-grams Embeddings - Cosine Similarity Analysis¶

Next, we look into cosine similarity distances to measure the descripiton similarity between companies. In this notebook, we simply use n-grams embeddings for consine similarity analysis.

Cosine similarity measures the similarity between two vectors of an inner product space. In text analysis, a document can be represented by its elements (words) and the frequency of each element. Comparing the frequency of words in different documents, which is the company description for companies in our case, would generate cosine similarity distance between documents. Each description generates a vector containing the frequency of each word. It measures the similarity between these companies in terms of their business description.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('../data/preprocessed.csv', 
                usecols = ['reportingDate', 'name', 
                           'coDescription_stopwords', 'SIC', 'SIC_desc'])

Cosine Similarity Analysis¶

Words Counting¶

For this cosine similarity analysis, we generate sequences of 2 to 4 words as one term and only select the top 600 terms by frequency.

from sklearn.feature_extraction.text import CountVectorizer

Vectorizer = CountVectorizer(ngram_range = (2,4), 
                             max_features = 600)

count_data = Vectorizer.fit_transform(df['coDescription_stopwords'])

wordsCount = pd.DataFrame(count_data.toarray(),columns=Vectorizer.get_feature_names())
wordsCount = wordsCount.set_index(df['name'])

Here is the n-grams embedding matrix with the 600 2-to-4 grams as columns and the 675 companies as rows.

wordsCount

	ability make	accounting standard	acquire property	act act	act amended	additional information	adequately capitalized	adverse effect	adverse effect business	adverse event	...	wa million	weighted average	well capitalized	wholly owned	wholly owned subsidiary	wide range	within day	working interest	year ended	year ended december
name
MONGODB, INC.	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	3	0	0	5	0
SALESFORCE COM INC	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
SPLUNK INC	0	0	0	0	1	2	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
OKTA, INC.	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	1	0	0	1	0
VEEVA SYSTEMS INC	0	12	0	1	4	1	0	7	4	0	...	18	4	0	0	0	0	1	0	102	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
AMERICAN REALTY CAPITAL NEW YORK CITY REIT, INC.	0	0	1	0	0	1	0	1	0	0	...	0	0	0	0	0	0	0	0	2	2
CYCLACEL PHARMACEUTICALS, INC.	0	0	0	0	0	1	0	1	0	1	...	0	0	0	0	0	0	1	0	0	0
ZOETIS INC.	0	17	0	0	0	12	0	3	0	0	...	20	5	0	1	1	0	2	0	84	83
STAG INDUSTRIAL, INC.	0	0	1	0	1	0	0	1	1	0	...	0	0	0	0	0	0	0	0	2	2
EQUINIX INC	0	0	0	0	2	0	0	0	0	0	...	0	0	0	0	0	0	0	0	2	2

675 rows × 600 columns

Cosine Similarity Computation¶

Now we take in the 2-to-4 grams embeddings to analyze the text similarity.

# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = pd.DataFrame(cosine_similarity(wordsCount, wordsCount))
cosine_sim = cosine_sim.set_index(df['name'])
cosine_sim.columns = df['name']

The description similarity between companies range from 0 to 1. The higher the cosine similarity score, the more similar they are.

cosine_sim

name	MONGODB, INC.	SALESFORCE COM INC	SPLUNK INC	OKTA, INC.	VEEVA SYSTEMS INC	AUTODESK INC	INTERNATIONAL WESTERN PETROLEUM, INC.	DAYBREAK OIL & GAS, INC.	ETERNAL SPEECH, INC.	ETERNAL SPEECH, INC.	...	OMEGA HEALTHCARE INVESTORS INC	TABLEAU SOFTWARE INC	HORIZON PHARMA PLC	MERRIMACK PHARMACEUTICALS INC	REVEN HOUSING REIT, INC.	AMERICAN REALTY CAPITAL NEW YORK CITY REIT, INC.	CYCLACEL PHARMACEUTICALS, INC.	ZOETIS INC.	STAG INDUSTRIAL, INC.	EQUINIX INC
name
MONGODB, INC.	1.000000	0.445455	0.610272	0.620961	0.500762	0.338268	0.065380	0.052345	0.000000	0.000000	...	0.050935	0.630465	0.436327	0.143385	0.066598	0.135839	0.144678	0.189609	0.178397	0.102958
SALESFORCE COM INC	0.445455	1.000000	0.635969	0.455189	0.196053	0.418546	0.043515	0.064999	0.000000	0.000000	...	0.029326	0.492079	0.300027	0.133831	0.201221	0.201230	0.145089	0.075038	0.277952	0.354856
SPLUNK INC	0.610272	0.635969	1.000000	0.665648	0.274023	0.373142	0.019112	0.073553	0.000000	0.000000	...	0.018032	0.569939	0.330028	0.116923	0.109538	0.142041	0.128467	0.136418	0.194072	0.273502
OKTA, INC.	0.620961	0.455189	0.665648	1.000000	0.195672	0.399874	0.013240	0.093942	0.000000	0.000000	...	0.013905	0.579884	0.541775	0.163709	0.109948	0.144051	0.170361	0.111937	0.163588	0.074624
VEEVA SYSTEMS INC	0.500762	0.196053	0.274023	0.195672	1.000000	0.079927	0.074096	0.030179	0.075713	0.075713	...	0.424046	0.280852	0.153335	0.083683	0.128762	0.211695	0.060273	0.501041	0.332207	0.064207
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
AMERICAN REALTY CAPITAL NEW YORK CITY REIT, INC.	0.135839	0.201230	0.142041	0.144051	0.211695	0.106627	0.027594	0.048087	0.000000	0.000000	...	0.284525	0.114080	0.075274	0.048741	0.578793	1.000000	0.039971	0.136184	0.471651	0.042298
CYCLACEL PHARMACEUTICALS, INC.	0.144678	0.145089	0.128467	0.170361	0.060273	0.094262	0.010770	0.025407	0.000000	0.000000	...	0.015318	0.193458	0.462759	0.683597	0.047288	0.039971	1.000000	0.035694	0.080139	0.013121
ZOETIS INC.	0.189609	0.075038	0.136418	0.111937	0.501041	0.069267	0.039015	0.022235	0.065917	0.065917	...	0.159082	0.327556	0.148224	0.051060	0.163391	0.136184	0.035694	1.000000	0.207232	0.031911
STAG INDUSTRIAL, INC.	0.178397	0.277952	0.194072	0.163588	0.332207	0.169739	0.044467	0.057905	0.000000	0.000000	...	0.424106	0.242169	0.179394	0.068313	0.407758	0.471651	0.080139	0.207232	1.000000	0.038365
EQUINIX INC	0.102958	0.354856	0.273502	0.074624	0.064207	0.060531	0.002205	0.013749	0.000000	0.000000	...	0.018944	0.068787	0.035503	0.011838	0.043938	0.042298	0.013121	0.031911	0.038365	1.000000

675 rows × 675 columns

Performance Evaluation¶

Predictions Based on the Closest Cosine Similarity Distance¶

We use the closest neighborhood in terms of cosine similarity distances to evaluate the accuracy of the SIC classfication generated using 2-to-4 grams embeddings and cosine similarity distances.

prediction, accuracy, cm = std_func.get_accuracy(cosine_sim, df)

cosine_sim_conf = std_func.conf_mat_cosine(cosine_sim, df)
cosine_sim_conf

	y_true	y_pred
0	Prepackaged Software (mass reproduction of sof...	Prepackaged Software (mass reproduction of sof...
1	Prepackaged Software (mass reproduction of sof...	Prepackaged Software (mass reproduction of sof...
2	Prepackaged Software (mass reproduction of sof...	Prepackaged Software (mass reproduction of sof...
3	Prepackaged Software (mass reproduction of sof...	Prepackaged Software (mass reproduction of sof...
4	Prepackaged Software (mass reproduction of sof...	Prepackaged Software (mass reproduction of sof...
...	...	...
670	Real Estate Investment Trusts	Real Estate Investment Trusts
671	Pharmaceutical Preparations	Pharmaceutical Preparations
672	Pharmaceutical Preparations	Prepackaged Software (mass reproduction of sof...
673	Real Estate Investment Trusts	Real Estate Investment Trusts
674	Real Estate Investment Trusts	Real Estate Investment Trusts

675 rows × 2 columns

../../_images/1_Cosine_Similarity_Distances_16_1.png

We can see from the above confusion matrix that cosine similarity analysis on 2-to-4 grams embeddings give an accuray of 89% on average. For industries Crude Petroleum and Natural Gas, Real Estate Investment Trusts and State Commercial Banks (commercial banking), the accuracy is above 90%. Pharmaceutical Preparations gives the lowest accuracy at 76%.

Plotting¶

Plotting on the Cosine Similarity Matrix¶

We use PCA to automatically perform dimensionality reduction. First, we have a 2-D plot on cosine similarity matrix.

plot_cos = std_func.pca_visualize_2d(cosine_sim, df.loc[:,["name","SIC_desc"]])

Here we have a 3-D plot with the first three dimensions which maximize the most variance.

std_func.pca_visualize_3d(plot_cos)

We can see from the above 3D plot that three industries are clustered well spread, especially state commercial banks. However, prepackaged software industry is closely clustered with the others.

We can look at the explained variance of each dimension the PCA embedding of our cosine similatiry matrix produced below:

plot_cos[0].explained_variance_ratio_

array([0.43705121, 0.21549028, 0.13752174, 0.05257744, 0.03654605,
       0.01467293, 0.00914707, 0.00835553, 0.0072873 , 0.00633825])

The total explained variance of the first three dimensions are:

plot_cos[0].explained_variance_ratio_[0:3].sum()

0.7900632276689792

The first three dimensions explained 79% of the total variance that exists within the data.

Conclusion Reporting¶

from sklearn.metrics import classification_report
print(classification_report(prediction["y_true"], prediction["y_pred"], target_names=df["SIC_desc"].unique()))

                                                      precision    recall  f1-score   support

Prepackaged Software (mass reproduction of software)       0.88      0.85      0.87        80
                     Crude Petroleum and Natural Gas       0.91      0.93      0.92       208
                         Pharmaceutical Preparations       0.77      0.76      0.77        80
                       Real Estate Investment Trusts       0.94      0.95      0.94       191
         State Commercial Banks (commercial banking)       0.96      0.94      0.95       116

                                            accuracy                           0.91       675
                                           macro avg       0.89      0.89      0.89       675
                                        weighted avg       0.91      0.91      0.91       675

We can see from the above classification_report, we can conclude that cosine similarity analysis on 2-to-4 grams embeddings gives a good result on SIC classfication, specifically on the industries Crude Petroleum and Natural Gas, Real Estate Investment Trusts and State Commercial Banks (commercial banking).

Analysis of Textual data from 10K Financial Reports

N-grams Embeddings - Cosine Similarity Analysis

Contents