Doc2Vec

A continuation of using neural networks to help predict company embeddings, we now explore doc2vec.

This works much in the same was as Word2Vec, except on input we also specify which document/filing a given word has come from, resulting in ready made document vectors for us.

Lets get to the code!

First we need to load in the functions and data:

import os
import json
import pandas as pd
import numpy as np
import sys
sys.path.insert(0, '..')
%load_ext autoreload
%autoreload 2
%aimport std_func


# Hide warnings
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv("../data/preprocessed.csv")

Thanks to the gensim package, it’s quite easy to implement doc2vec.

from gensim.models import doc2vec
from collections import namedtuple

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(df["coDescription_stopwords"]):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4)

Like Word2Vec, we now also have a document vector matrix. We specified only 100 dimensions due to computational limitations, and the fact anymore most likely would not have helped. (Tune the hyper-parameter later)

And here we have the vectors for each company.

0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
0 1.069693 0.178045 -1.494915 1.975032 0.821181 -0.237772 -0.952073 -0.296611 -3.084478 -1.059440 ... -0.373738 -1.569935 -3.687043 0.155367 0.716785 -0.532835 -0.148241 2.769564 -0.610424 -1.721455
1 0.576653 -1.412108 -2.440013 2.347250 0.058851 0.432892 -0.617848 -0.992708 -1.153835 0.419545 ... -1.235869 -0.771894 -2.954011 -0.232072 -0.430426 -0.326765 -0.040944 0.135734 0.554869 -1.329786
2 -1.158351 -0.213180 -2.251126 2.846082 0.778042 0.325837 -0.533828 -0.531405 -3.166338 -0.376345 ... -0.097557 0.173748 -4.151709 1.077599 -0.321112 0.823240 -2.134309 0.777602 1.781383 -4.190876
3 0.042752 0.426476 -1.134072 1.108936 -0.686854 -0.832317 -0.604564 -0.258482 -1.826194 -0.151843 ... -1.789804 -2.944545 -2.614629 -1.172493 -1.010654 0.851054 -0.698927 1.539879 1.511696 -1.148776
4 -1.723339 -2.885207 -2.834879 2.373493 -1.766554 0.772403 -1.874856 2.813653 -1.315448 1.487623 ... -3.770599 1.020818 -4.491113 2.180514 3.247819 2.867793 -0.491228 2.126635 -3.065429 -2.985566

5 rows × 100 columns

Plotting the results

Here are the results of the doc2vec semantic company embedding after dimensionality reduction using PCA.

These look great! It seems doc2vec was able to create embeddings for our companies that separated them by industry very well, even after the PCA dimensionality reduction.

Performance Evaluation

conf_mat = std_func.conf_mat(doc_vec_2,df)
../../_images/5_Doc2Vec_11_0.png
dot_product_df, accuracy, cm = std_func.dot_product(doc_vec_2,df)
from sklearn.metrics import classification_report
print(classification_report(dot_product_df["y_true"], dot_product_df["y_pred"], target_names=df["SIC_desc"].unique()))
                                                      precision    recall  f1-score   support

Prepackaged Software (mass reproduction of software)       0.92      0.88      0.90        80
                     Crude Petroleum and Natural Gas       0.94      0.95      0.94       208
                         Pharmaceutical Preparations       0.82      0.86      0.84        80
                       Real Estate Investment Trusts       0.90      0.92      0.91       191
         State Commercial Banks (commercial banking)       0.98      0.93      0.96       116

                                            accuracy                           0.92       675
                                           macro avg       0.91      0.91      0.91       675
                                        weighted avg       0.92      0.92      0.92       675

From the confusion matrix and the classification report, we can conclude that the doc2vec company embedding does a great job at classifying the category of the companies. This is in line with our observations of the PCA plots, as they did a very good job at separating companies in different industries.