Universal Sentence Encoder¶

The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. It is a pre-trained model created by Google that uses sources like Wikipedia, web news, web question-answer pages, and discussion forums. There are two variations of the model, one trained with Transformer encoder and the other with Deep Averaging Network (DAN). The one with Transformer encoder is computationally more intensive but provides better results, while DAN trades accuracy for lower computational requirements. In our works, the model with DAN has provided results with high accuracy so we do not require the Transformer encoder alternative. The input is a variable-length English text and the output is a normalised 512-dimensional vector.

embeddings = pd.read_csv('embeddings.csv', index_col=0)
embeddings.head()

	0	1	2	3	4	5	6	7	8	9	...	502	503	504	505	506	507	508	509	510	511
name
MONGODB, INC.	-0.048347	-0.048348	-0.047972	0.048348	-0.048074	0.024746	-0.040604	-0.048346	0.048348	-0.048348	...	-0.048325	-0.048348	-0.048348	0.021898	0.047686	-0.047848	0.046755	0.048289	-0.048348	-0.041822
SALESFORCE COM INC	-0.048440	-0.048481	-0.011648	0.048481	-0.048470	0.020522	0.000173	-0.048477	0.048317	-0.048481	...	-0.047321	-0.048481	-0.048481	-0.016805	0.048388	-0.032797	0.046606	0.017406	-0.048481	-0.040719
SPLUNK INC	-0.047489	-0.047792	-0.047772	0.047791	-0.047792	-0.047775	-0.047787	-0.047636	0.047789	-0.047758	...	-0.046150	-0.047792	-0.047792	-0.047715	0.047761	-0.047766	0.047790	0.047792	-0.047792	-0.047661
OKTA, INC.	-0.048333	-0.048679	-0.026585	0.048682	-0.048568	0.048673	-0.010152	-0.046081	0.048555	-0.048633	...	-0.048677	-0.048682	-0.048682	-0.001725	0.048465	-0.046822	0.048279	0.048415	-0.048682	-0.047951
VEEVA SYSTEMS INC	-0.045855	-0.045855	-0.045855	0.045855	-0.045854	0.045855	-0.045855	-0.045855	0.045855	-0.045855	...	-0.045855	-0.045855	-0.045855	-0.040072	-0.045855	-0.044339	0.045855	0.045855	-0.045855	-0.045141

5 rows × 512 columns

Plotting¶

plot_d2v = std_func.pca_visualize_2d(embeddings, df.loc[:,["name","SIC_desc"]])
std_func.pca_visualize_3d(plot_d2v)

Performance Evaluation¶

dot_product_df, accuracy, cm = std_func.dot_product(embeddings,df)

from sklearn.metrics import classification_report
print(classification_report(dot_product_df["y_true"], dot_product_df["y_pred"], target_names=df["SIC_desc"].unique()))

                                                      precision    recall  f1-score   support

Prepackaged Software (mass reproduction of software)       0.90      0.83      0.87        78
                     Crude Petroleum and Natural Gas       0.95      0.96      0.95       180
                         Pharmaceutical Preparations       0.91      0.90      0.90        67
                       Real Estate Investment Trusts       0.94      0.96      0.95       185
         State Commercial Banks (commercial banking)       0.95      0.96      0.96       108

                                            accuracy                           0.94       618
                                           macro avg       0.93      0.92      0.93       618
                                        weighted avg       0.94      0.94      0.94       618

From the confusion matrix and the classification report, we can conclude that the Universal Sentence Encoder model does a good job at classifying the category of the companies. More specifically, this model is best at classifying companies in the Crude Petroleum & Natural Gas, Real Estate and Commerical Banking industries.

Analysis of Textual data from 10K Financial Reports

Universal Sentence Encoder

Contents

Universal Sentence Encoder¶

Plotting¶

Performance Evaluation¶