Document Embedding Results

Document Embedding Results

The table below illustrates an overview of the results of all the techniques we explored to create document embeddings.

Embedding Technique TF-IDF N-grams - Cosine Similarity POS Tagging - Cosine Similarity Word2Vec Doc2Vec TwoTowers Universal Sentence Encoder
Recall/Sensitivity of Industry              
Prepackaged Software (Recall) 0.890000 0.850000 0.860000 0.760000 0.880000 0.670000 0.830000
Crude Petroleum and Natural Gas (Recall) 0.940000 0.930000 0.960000 0.950000 0.970000 0.560000 0.960000
Pharmaceutical Preparations (Recall) 0.750000 0.760000 0.910000 0.460000 0.860000 0.600000 0.900000
Real Estate Investment Trusts (Recall) 0.940000 0.950000 0.970000 0.850000 0.910000 0.680000 0.960000
State Commercial Banks (Recall) 0.950000 0.940000 0.970000 0.820000 0.940000 0.650000 0.960000
Weighted Average (Recall) 0.910000 0.910000 0.950000 0.820000 0.920000 0.630000 0.940000

Conclusion

  • Most of the embedding techniques give a high recall of industry classification. We accomplished up to 95% recall for predicting similar companies when matched with their categories.

  • Within our seven models, four of them have lowest recall for Pharmaceutical Preparations industry. (Potential Reasons - business description using words too similar to those used in other industries, not well calssified)

  • Almost every model is better at classifying companies in Crude Petroleum and Natural Gas, Real Estate Investment Trusts and State Commercial Banks industry.