Universal Sentence Encoder

Universal Sentence Encoder

The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. It is a pre-trained model created by Google that uses sources like Wikipedia, web news, web question-answer pages, and discussion forums. There are two variations of the model, one trained with Transformer encoder and the other with Deep Averaging Network (DAN). The one with Transformer encoder is computationally more intensive but provides better results, while DAN trades accuracy for lower computational requirements. In our works, the model with DAN has provided results with high accuracy so we do not require the Transformer encoder alternative. The input is a variable-length English text and the output is a normalised 512-dimensional vector.

embeddings = pd.read_csv('embeddings.csv', index_col=0)
embeddings.head()
0 1 2 3 4 5 6 7 8 9 ... 502 503 504 505 506 507 508 509 510 511
name
MONGODB, INC. -0.048347 -0.048348 -0.047972 0.048348 -0.048074 0.024746 -0.040604 -0.048346 0.048348 -0.048348 ... -0.048325 -0.048348 -0.048348 0.021898 0.047686 -0.047848 0.046755 0.048289 -0.048348 -0.041822
SALESFORCE COM INC -0.048440 -0.048481 -0.011648 0.048481 -0.048470 0.020522 0.000173 -0.048477 0.048317 -0.048481 ... -0.047321 -0.048481 -0.048481 -0.016805 0.048388 -0.032797 0.046606 0.017406 -0.048481 -0.040719
SPLUNK INC -0.047489 -0.047792 -0.047772 0.047791 -0.047792 -0.047775 -0.047787 -0.047636 0.047789 -0.047758 ... -0.046150 -0.047792 -0.047792 -0.047715 0.047761 -0.047766 0.047790 0.047792 -0.047792 -0.047661
OKTA, INC. -0.048333 -0.048679 -0.026585 0.048682 -0.048568 0.048673 -0.010152 -0.046081 0.048555 -0.048633 ... -0.048677 -0.048682 -0.048682 -0.001725 0.048465 -0.046822 0.048279 0.048415 -0.048682 -0.047951
VEEVA SYSTEMS INC -0.045855 -0.045855 -0.045855 0.045855 -0.045854 0.045855 -0.045855 -0.045855 0.045855 -0.045855 ... -0.045855 -0.045855 -0.045855 -0.040072 -0.045855 -0.044339 0.045855 0.045855 -0.045855 -0.045141

5 rows × 512 columns

Plotting

plot_d2v = std_func.pca_visualize_2d(embeddings, df.loc[:,["name","SIC_desc"]])
std_func.pca_visualize_3d(plot_d2v)

Performance Evaluation

dot_product_df, accuracy, cm = std_func.dot_product(embeddings,df)
from sklearn.metrics import classification_report
print(classification_report(dot_product_df["y_true"], dot_product_df["y_pred"], target_names=df["SIC_desc"].unique()))
                                                      precision    recall  f1-score   support

Prepackaged Software (mass reproduction of software)       0.90      0.83      0.87        78
                     Crude Petroleum and Natural Gas       0.95      0.96      0.95       180
                         Pharmaceutical Preparations       0.91      0.90      0.90        67
                       Real Estate Investment Trusts       0.94      0.96      0.95       185
         State Commercial Banks (commercial banking)       0.95      0.96      0.96       108

                                            accuracy                           0.94       618
                                           macro avg       0.93      0.92      0.93       618
                                        weighted avg       0.94      0.94      0.94       618

From the confusion matrix and the classification report, we can conclude that the Universal Sentence Encoder model does a good job at classifying the category of the companies. More specifically, this model is best at classifying companies in the Crude Petroleum & Natural Gas, Real Estate and Commerical Banking industries.