Document Embedding Results
Contents
Document Embedding Results¶
The table below illustrates an overview of the results of all the techniques we explored to create document embeddings.
Embedding Technique | TF-IDF | N-grams - Cosine Similarity | POS Tagging - Cosine Similarity | Word2Vec | Doc2Vec | TwoTowers | Universal Sentence Encoder |
---|---|---|---|---|---|---|---|
Recall/Sensitivity of Industry | |||||||
Prepackaged Software (Recall) | 0.890000 | 0.850000 | 0.860000 | 0.760000 | 0.880000 | 0.670000 | 0.830000 |
Crude Petroleum and Natural Gas (Recall) | 0.940000 | 0.930000 | 0.960000 | 0.950000 | 0.970000 | 0.560000 | 0.960000 |
Pharmaceutical Preparations (Recall) | 0.750000 | 0.760000 | 0.910000 | 0.460000 | 0.860000 | 0.600000 | 0.900000 |
Real Estate Investment Trusts (Recall) | 0.940000 | 0.950000 | 0.970000 | 0.850000 | 0.910000 | 0.680000 | 0.960000 |
State Commercial Banks (Recall) | 0.950000 | 0.940000 | 0.970000 | 0.820000 | 0.940000 | 0.650000 | 0.960000 |
Weighted Average (Recall) | 0.910000 | 0.910000 | 0.950000 | 0.820000 | 0.920000 | 0.630000 | 0.940000 |
Conclusion¶
Most of the embedding techniques give a high recall of industry classification. We accomplished up to
95% recall
for predicting similar companies when matched with their categories.Within our seven models, four of them have
lowest recall
forPharmaceutical Preparations
industry. (Potential Reasons - business description using words too similar to those used in other industries, not well calssified)Almost every model is
better
at classifying companies inCrude Petroleum and Natural Gas
,Real Estate Investment Trusts
andState Commercial Banks
industry.