N-grams Embeddings - Cosine Similarity Analysis

Next, we look into cosine similarity distances to measure the descripiton similarity between companies. In this notebook, we simply use n-grams embeddings for consine similarity analysis.

Cosine similarity measures the similarity between two vectors of an inner product space. In text analysis, a document can be represented by its elements (words) and the frequency of each element. Comparing the frequency of words in different documents, which is the company description for companies in our case, would generate cosine similarity distance between documents. Each description generates a vector containing the frequency of each word. It measures the similarity between these companies in terms of their business description.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('../data/preprocessed.csv', 
                usecols = ['reportingDate', 'name', 
                           'coDescription_stopwords', 'SIC', 'SIC_desc'])

Cosine Similarity Analysis

Words Counting

For this cosine similarity analysis, we generate sequences of 2 to 4 words as one term and only select the top 600 terms by frequency.

from sklearn.feature_extraction.text import CountVectorizer

Vectorizer = CountVectorizer(ngram_range = (2,4), 
                             max_features = 600)

count_data = Vectorizer.fit_transform(df['coDescription_stopwords'])
wordsCount = pd.DataFrame(count_data.toarray(),columns=Vectorizer.get_feature_names())
wordsCount = wordsCount.set_index(df['name'])

Here is the n-grams embedding matrix with the 600 2-to-4 grams as columns and the 675 companies as rows.

wordsCount
ability make accounting standard acquire property act act act amended additional information adequately capitalized adverse effect adverse effect business adverse event ... wa million weighted average well capitalized wholly owned wholly owned subsidiary wide range within day working interest year ended year ended december
name
MONGODB, INC. 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 3 0 0 5 0
SALESFORCE COM INC 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SPLUNK INC 0 0 0 0 1 2 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
OKTA, INC. 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 1 0 0 1 0
VEEVA SYSTEMS INC 0 12 0 1 4 1 0 7 4 0 ... 18 4 0 0 0 0 1 0 102 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
AMERICAN REALTY CAPITAL NEW YORK CITY REIT, INC. 0 0 1 0 0 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 2 2
CYCLACEL PHARMACEUTICALS, INC. 0 0 0 0 0 1 0 1 0 1 ... 0 0 0 0 0 0 1 0 0 0
ZOETIS INC. 0 17 0 0 0 12 0 3 0 0 ... 20 5 0 1 1 0 2 0 84 83
STAG INDUSTRIAL, INC. 0 0 1 0 1 0 0 1 1 0 ... 0 0 0 0 0 0 0 0 2 2
EQUINIX INC 0 0 0 0 2 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2 2

675 rows × 600 columns

Cosine Similarity Computation

Now we take in the 2-to-4 grams embeddings to analyze the text similarity.

# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = pd.DataFrame(cosine_similarity(wordsCount, wordsCount))
cosine_sim = cosine_sim.set_index(df['name'])
cosine_sim.columns = df['name']

The description similarity between companies range from 0 to 1. The higher the cosine similarity score, the more similar they are.

cosine_sim
name MONGODB, INC. SALESFORCE COM INC SPLUNK INC OKTA, INC. VEEVA SYSTEMS INC AUTODESK INC INTERNATIONAL WESTERN PETROLEUM, INC. DAYBREAK OIL & GAS, INC. ETERNAL SPEECH, INC. ETERNAL SPEECH, INC. ... OMEGA HEALTHCARE INVESTORS INC TABLEAU SOFTWARE INC HORIZON PHARMA PLC MERRIMACK PHARMACEUTICALS INC REVEN HOUSING REIT, INC. AMERICAN REALTY CAPITAL NEW YORK CITY REIT, INC. CYCLACEL PHARMACEUTICALS, INC. ZOETIS INC. STAG INDUSTRIAL, INC. EQUINIX INC
name
MONGODB, INC. 1.000000 0.445455 0.610272 0.620961 0.500762 0.338268 0.065380 0.052345 0.000000 0.000000 ... 0.050935 0.630465 0.436327 0.143385 0.066598 0.135839 0.144678 0.189609 0.178397 0.102958
SALESFORCE COM INC 0.445455 1.000000 0.635969 0.455189 0.196053 0.418546 0.043515 0.064999 0.000000 0.000000 ... 0.029326 0.492079 0.300027 0.133831 0.201221 0.201230 0.145089 0.075038 0.277952 0.354856
SPLUNK INC 0.610272 0.635969 1.000000 0.665648 0.274023 0.373142 0.019112 0.073553 0.000000 0.000000 ... 0.018032 0.569939 0.330028 0.116923 0.109538 0.142041 0.128467 0.136418 0.194072 0.273502
OKTA, INC. 0.620961 0.455189 0.665648 1.000000 0.195672 0.399874 0.013240 0.093942 0.000000 0.000000 ... 0.013905 0.579884 0.541775 0.163709 0.109948 0.144051 0.170361 0.111937 0.163588 0.074624
VEEVA SYSTEMS INC 0.500762 0.196053 0.274023 0.195672 1.000000 0.079927 0.074096 0.030179 0.075713 0.075713 ... 0.424046 0.280852 0.153335 0.083683 0.128762 0.211695 0.060273 0.501041 0.332207 0.064207
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
AMERICAN REALTY CAPITAL NEW YORK CITY REIT, INC. 0.135839 0.201230 0.142041 0.144051 0.211695 0.106627 0.027594 0.048087 0.000000 0.000000 ... 0.284525 0.114080 0.075274 0.048741 0.578793 1.000000 0.039971 0.136184 0.471651 0.042298
CYCLACEL PHARMACEUTICALS, INC. 0.144678 0.145089 0.128467 0.170361 0.060273 0.094262 0.010770 0.025407 0.000000 0.000000 ... 0.015318 0.193458 0.462759 0.683597 0.047288 0.039971 1.000000 0.035694 0.080139 0.013121
ZOETIS INC. 0.189609 0.075038 0.136418 0.111937 0.501041 0.069267 0.039015 0.022235 0.065917 0.065917 ... 0.159082 0.327556 0.148224 0.051060 0.163391 0.136184 0.035694 1.000000 0.207232 0.031911
STAG INDUSTRIAL, INC. 0.178397 0.277952 0.194072 0.163588 0.332207 0.169739 0.044467 0.057905 0.000000 0.000000 ... 0.424106 0.242169 0.179394 0.068313 0.407758 0.471651 0.080139 0.207232 1.000000 0.038365
EQUINIX INC 0.102958 0.354856 0.273502 0.074624 0.064207 0.060531 0.002205 0.013749 0.000000 0.000000 ... 0.018944 0.068787 0.035503 0.011838 0.043938 0.042298 0.013121 0.031911 0.038365 1.000000

675 rows × 675 columns

Performance Evaluation

Predictions Based on the Closest Cosine Similarity Distance

We use the closest neighborhood in terms of cosine similarity distances to evaluate the accuracy of the SIC classfication generated using 2-to-4 grams embeddings and cosine similarity distances.

prediction, accuracy, cm = std_func.get_accuracy(cosine_sim, df)
cosine_sim_conf = std_func.conf_mat_cosine(cosine_sim, df)
cosine_sim_conf
y_true y_pred
0 Prepackaged Software (mass reproduction of sof... Prepackaged Software (mass reproduction of sof...
1 Prepackaged Software (mass reproduction of sof... Prepackaged Software (mass reproduction of sof...
2 Prepackaged Software (mass reproduction of sof... Prepackaged Software (mass reproduction of sof...
3 Prepackaged Software (mass reproduction of sof... Prepackaged Software (mass reproduction of sof...
4 Prepackaged Software (mass reproduction of sof... Prepackaged Software (mass reproduction of sof...
... ... ...
670 Real Estate Investment Trusts Real Estate Investment Trusts
671 Pharmaceutical Preparations Pharmaceutical Preparations
672 Pharmaceutical Preparations Prepackaged Software (mass reproduction of sof...
673 Real Estate Investment Trusts Real Estate Investment Trusts
674 Real Estate Investment Trusts Real Estate Investment Trusts

675 rows × 2 columns

../../_images/1_Cosine_Similarity_Distances_16_1.png

We can see from the above confusion matrix that cosine similarity analysis on 2-to-4 grams embeddings give an accuray of 89% on average. For industries Crude Petroleum and Natural Gas, Real Estate Investment Trusts and State Commercial Banks (commercial banking), the accuracy is above 90%. Pharmaceutical Preparations gives the lowest accuracy at 76%.

Plotting

Plotting on the Cosine Similarity Matrix

We use PCA to automatically perform dimensionality reduction. First, we have a 2-D plot on cosine similarity matrix.

plot_cos = std_func.pca_visualize_2d(cosine_sim, df.loc[:,["name","SIC_desc"]])

Here we have a 3-D plot with the first three dimensions which maximize the most variance.

std_func.pca_visualize_3d(plot_cos)

We can see from the above 3D plot that three industries are clustered well spread, especially state commercial banks. However, prepackaged software industry is closely clustered with the others.

We can look at the explained variance of each dimension the PCA embedding of our cosine similatiry matrix produced below:

plot_cos[0].explained_variance_ratio_
array([0.43705121, 0.21549028, 0.13752174, 0.05257744, 0.03654605,
       0.01467293, 0.00914707, 0.00835553, 0.0072873 , 0.00633825])

The total explained variance of the first three dimensions are:

plot_cos[0].explained_variance_ratio_[0:3].sum()
0.7900632276689792

The first three dimensions explained 79% of the total variance that exists within the data.

Conclusion Reporting

from sklearn.metrics import classification_report
print(classification_report(prediction["y_true"], prediction["y_pred"], target_names=df["SIC_desc"].unique()))
                                                      precision    recall  f1-score   support

Prepackaged Software (mass reproduction of software)       0.88      0.85      0.87        80
                     Crude Petroleum and Natural Gas       0.91      0.93      0.92       208
                         Pharmaceutical Preparations       0.77      0.76      0.77        80
                       Real Estate Investment Trusts       0.94      0.95      0.94       191
         State Commercial Banks (commercial banking)       0.96      0.94      0.95       116

                                            accuracy                           0.91       675
                                           macro avg       0.89      0.89      0.89       675
                                        weighted avg       0.91      0.91      0.91       675

We can see from the above classification_report, we can conclude that cosine similarity analysis on 2-to-4 grams embeddings gives a good result on SIC classfication, specifically on the industries Crude Petroleum and Natural Gas, Real Estate Investment Trusts and State Commercial Banks (commercial banking).