## Document Embedding Results 

The table below illustrates an overview of the results of all the techniques we explored to create document embeddings. 

<!-- |Embedding Technique | Prepackaged Software(Recall) |  Crude Petroleum and Natural Gas(Recall) |  Pharmaceutical Preparations(Recall) | Real Estate Investment Trusts(Recall) | State Commercial Banks(Recall) | Weighted Average(Recall) 
|--------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
|TF-IDF| 0.89 | 0.94 | 0.75 | 0.94 | 0.95 | 0.91 | 0.89 | 0.91 |
|N-grams - Cosine Similarity|  0.85 | 0.93  |  0.76 |  0.95 |  0.94 |  0.91 |
|POS Tagging - Cosine Similarity| 0.86  |  0.96 |  0.91 |  0.97 | 0.97  | 0.95  |
|Word2Vec| 0.76 | 0.95 | 0.46 | 0.85 | 0.82 | 0.82 | 0.77 | 0.82 |
|Doc2Vec| 0.88 | 0.97 | 0.86 | 0.91 | 0.94 | 0.92 | 0.91 | 0.92 |
|TwoTowers| 0.67  | 0.56  |0.60   |0.68   |0.65   |0.63   |
|Universal Sentence Encoder| 0.83  |0.96  |0.90   |0.96   |0.96 |0.94 | -->

In [1]:
cols = ["Prepackaged Software (Recall)", "Crude Petroleum and Natural Gas (Recall)",
        "Pharmaceutical Preparations (Recall)", "Real Estate Investment Trusts (Recall)", "State Commercial Banks (Recall)",
        "Weighted Average (Recall)"]
data = [[0.89,0.94,0.75,0.94,0.95,0.91],
[0.85,0.93,0.76,0.95,0.94,0.91],
[0.86,0.96,0.91,0.97,0.97,0.95],
[0.76,0.95,0.46,0.85,0.82,0.82],
[0.88,0.97,0.86,0.91,0.94,0.92],
[0.67,0.56,0.60,0.68,0.65,0.63],
[0.83,0.96,0.90,0.96,0.96,0.94]]
# [[str(i) for i in row] for row in data]
methods = ["TF-IDF", "N-grams - Cosine Similarity", "POS Tagging - Cosine Similarity", "Word2Vec",
           "Doc2Vec","TwoTowers", "Universal Sentence Encoder"]

import pandas as pd
df = pd.DataFrame(data, index = methods, columns = cols).T.round(2) \
    .rename_axis("Recall/Sensitivity of Industry").rename_axis("Embedding Technique", axis = "columns")

import seaborn as sns
cm = sns.light_palette("#5CCDC6", n_colors = 35, as_cmap=True)

df.style.background_gradient(cmap=cm)

Embedding Technique,TF-IDF,N-grams - Cosine Similarity,POS Tagging - Cosine Similarity,Word2Vec,Doc2Vec,TwoTowers,Universal Sentence Encoder
Recall/Sensitivity of Industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Prepackaged Software (Recall),0.89,0.85,0.86,0.76,0.88,0.67,0.83
Crude Petroleum and Natural Gas (Recall),0.94,0.93,0.96,0.95,0.97,0.56,0.96
Pharmaceutical Preparations (Recall),0.75,0.76,0.91,0.46,0.86,0.6,0.9
Real Estate Investment Trusts (Recall),0.94,0.95,0.97,0.85,0.91,0.68,0.96
State Commercial Banks (Recall),0.95,0.94,0.97,0.82,0.94,0.65,0.96
Weighted Average (Recall),0.91,0.91,0.95,0.82,0.92,0.63,0.94


## Conclusion

- Most of the embedding techniques give a high recall of industry classification. We accomplished up to `95% recall` for predicting similar companies when matched with their categories.


- Within our seven models, four of them have `lowest recall` for `Pharmaceutical Preparations` industry. *(Potential Reasons - business description using words too similar to those used in other industries, not well calssified)*


- Almost every model is `better` at classifying companies in `Crude Petroleum and Natural Gas`, `Real Estate Investment Trusts` and `State Commercial Banks` industry.