Universal Sentence Encoder
Contents
Universal Sentence Encoder¶
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. It is a pre-trained model created by Google that uses sources like Wikipedia, web news, web question-answer pages, and discussion forums. There are two variations of the model, one trained with Transformer encoder and the other with Deep Averaging Network (DAN). The one with Transformer encoder is computationally more intensive but provides better results, while DAN trades accuracy for lower computational requirements. In our works, the model with DAN has provided results with high accuracy so we do not require the Transformer encoder alternative. The input is a variable-length English text and the output is a normalised 512-dimensional vector.
embeddings = pd.read_csv('embeddings.csv', index_col=0)
embeddings.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||||||||
MONGODB, INC. | -0.048347 | -0.048348 | -0.047972 | 0.048348 | -0.048074 | 0.024746 | -0.040604 | -0.048346 | 0.048348 | -0.048348 | ... | -0.048325 | -0.048348 | -0.048348 | 0.021898 | 0.047686 | -0.047848 | 0.046755 | 0.048289 | -0.048348 | -0.041822 |
SALESFORCE COM INC | -0.048440 | -0.048481 | -0.011648 | 0.048481 | -0.048470 | 0.020522 | 0.000173 | -0.048477 | 0.048317 | -0.048481 | ... | -0.047321 | -0.048481 | -0.048481 | -0.016805 | 0.048388 | -0.032797 | 0.046606 | 0.017406 | -0.048481 | -0.040719 |
SPLUNK INC | -0.047489 | -0.047792 | -0.047772 | 0.047791 | -0.047792 | -0.047775 | -0.047787 | -0.047636 | 0.047789 | -0.047758 | ... | -0.046150 | -0.047792 | -0.047792 | -0.047715 | 0.047761 | -0.047766 | 0.047790 | 0.047792 | -0.047792 | -0.047661 |
OKTA, INC. | -0.048333 | -0.048679 | -0.026585 | 0.048682 | -0.048568 | 0.048673 | -0.010152 | -0.046081 | 0.048555 | -0.048633 | ... | -0.048677 | -0.048682 | -0.048682 | -0.001725 | 0.048465 | -0.046822 | 0.048279 | 0.048415 | -0.048682 | -0.047951 |
VEEVA SYSTEMS INC | -0.045855 | -0.045855 | -0.045855 | 0.045855 | -0.045854 | 0.045855 | -0.045855 | -0.045855 | 0.045855 | -0.045855 | ... | -0.045855 | -0.045855 | -0.045855 | -0.040072 | -0.045855 | -0.044339 | 0.045855 | 0.045855 | -0.045855 | -0.045141 |
5 rows × 512 columns
Plotting¶
plot_d2v = std_func.pca_visualize_2d(embeddings, df.loc[:,["name","SIC_desc"]])
std_func.pca_visualize_3d(plot_d2v)
Performance Evaluation¶
dot_product_df, accuracy, cm = std_func.dot_product(embeddings,df)
from sklearn.metrics import classification_report
print(classification_report(dot_product_df["y_true"], dot_product_df["y_pred"], target_names=df["SIC_desc"].unique()))
precision recall f1-score support
Prepackaged Software (mass reproduction of software) 0.90 0.83 0.87 78
Crude Petroleum and Natural Gas 0.95 0.96 0.95 180
Pharmaceutical Preparations 0.91 0.90 0.90 67
Real Estate Investment Trusts 0.94 0.96 0.95 185
State Commercial Banks (commercial banking) 0.95 0.96 0.96 108
accuracy 0.94 618
macro avg 0.93 0.92 0.93 618
weighted avg 0.94 0.94 0.94 618
From the confusion matrix and the classification report, we can conclude that the Universal Sentence Encoder model does a good job at classifying the category of the companies. More specifically, this model is best at classifying companies in the Crude Petroleum & Natural Gas, Real Estate and Commerical Banking industries.