In [2]:
import pandas as pd
import seaborn as sns

def highlight_12(val):
    color = 'blue' if val in Same_12 else 'black'
    return 'color: %s' % color

def highlight_13(val):
    color = 'green' if val in Same_13 else 'black'
    return 'color: %s' % color

def highlight_all(val):
    color = 'red' if val in Same_all else 'black'
    return 'color: %s' % color

## Porfolio Analysis Results
In this section, the porfolio performance table and weights table for each industry in terms of three kinds of estimates are displayed. 

### Porfolio Performance
Since the industry `Crude Petroleum and Natural Gas` cannot generate optimal portfolio selection for sample stimate and cosine similarity estimate, we do not include this industry in our performance comparison table.



In [3]:
columns = pd.MultiIndex.from_product([["Prepackaged Software", 
                                       "Pharmaceutical Preparations", "Real Estate Investment Trusts", 
                                       "State Commercial Banks",],
                                      ['Sample', 'Cosine Similarity', 'Factor Model']])

data = [[0.6,1.1,1.1,
         1.2,1.2,1.4,
         0.5,0.6,0.6,
         1.2,1.1,1.0],
        [2.4,2.9,4.1,
         2.1,2.6,3.3,
         1.8,1.7,2.4,
         2.7,2.2,3.6],
        [-0.57,-0.30,-0.21,
         -0.35,-0.32,-0.18,
         -0.80,-0.81,-0.57,
         -0.28,-0.38,-0.28]]

methods = ["Expected Annual Return", "Annual Volatility", "Sharpe Ratio"]

df = pd.DataFrame(data, index = methods, columns = columns).T.round(2)

cm = sns.light_palette("#5CCDC6", n_colors = 35, as_cmap=True)

df.style.background_gradient(cmap=cm)

Unnamed: 0,Unnamed: 1,Expected Annual Return,Annual Volatility,Sharpe Ratio
Prepackaged Software,Sample,0.6,2.4,-0.57
Prepackaged Software,Cosine Similarity,1.1,2.9,-0.3
Prepackaged Software,Factor Model,1.1,4.1,-0.21
Pharmaceutical Preparations,Sample,1.2,2.1,-0.35
Pharmaceutical Preparations,Cosine Similarity,1.2,2.6,-0.32
Pharmaceutical Preparations,Factor Model,1.4,3.3,-0.18
Real Estate Investment Trusts,Sample,0.5,1.8,-0.8
Real Estate Investment Trusts,Cosine Similarity,0.6,1.7,-0.81
Real Estate Investment Trusts,Factor Model,0.6,2.4,-0.57
State Commercial Banks,Sample,1.2,2.7,-0.28



From the above table, we can see that factor model estimated portfolios give slightly higher volatility while the cosine similarity estimated portfolios give similiar results with the sample estimated portfolios. 

Next, we will look into each model's choice of companies to determine if cosine similarity analysis and factor model can be used to construct similar portfolios as the sample estimate.


### Porfolio Weights
We present the portfolio weights table for comparison of three estimates. 

First, we highlight the companies in common as `blue` when comparing estimates from sample covariance and estimates from the cosine similarity of the business descriptions. 

Then, we highlight the companies in common as `green` when comparing estimates from sample covariance and estimates from factor model based on business descriptions and return data. 

Those companies that exist in all three constructed portfolios are hilighted `red` in the portfolio weights table.

#### Prepackaged Software (mass reproduction of software)

In [4]:
sample_software = pd.read_csv("data/min_vol_sample_Prepackaged_Software.csv")
cos_sim_software = pd.read_csv("data/min_vol_cos_sim_Prepackaged_Software.csv")
factor_model_software = pd.read_csv("data/min_vol_factor_model_Prepackaged_Software.csv")

sample_software = sample_software.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)
cos_sim_software = cos_sim_software.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)
factor_model_software = factor_model_software.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)

columns = pd.MultiIndex.from_product([['Sample Estimate', 'Cosine Similarity Estimate'], 
                                      ['Company Name', 'Weight']])
software_12 = pd.concat([sample_software, cos_sim_software], axis=1)
software_12.columns = columns
software_12 = software_12.fillna(" ")

columns = pd.MultiIndex.from_product([['Sample Estimate', 'Factor Model Estimate'], 
                                      ['Company Name', 'Weight']])
software_13 = pd.concat([sample_software, factor_model_software], axis=1)
software_13.columns = columns
software_13 = software_13.fillna(" ")

columns = pd.MultiIndex.from_product([['Sample Estimate', 'Cosine Similarity Estimate', 'Factor Model Estimate'], 
                                      ['Company Name', 'Weight']])
software = pd.concat([pd.concat([sample_software, cos_sim_software], axis=1), factor_model_software], axis=1)
software.columns = columns
software = software.fillna(" ")

Same_12 = (set(sample_software.Company_Name) & set(cos_sim_software.Company_Name)) 
Same_13 = (set(sample_software.Company_Name) & set(factor_model_software.Company_Name)) 
Same_all = Same_12 & Same_13

##### Sample Estimate V.S. Cosine Similarity Estimate

In [5]:
software_12.style.applymap(highlight_12)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Cosine Similarity Estimate,Cosine Similarity Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight
0,"BLACK KNIGHT, INC.",0.2,"BLACK KNIGHT, INC.",0.2
1,AWARE INC /MA/,0.2,ORACLE CORP,0.16315
2,"ACI WORLDWIDE, INC.",0.11314,ANSYS INC,0.1539
3,ORACLE CORP,0.0917,ULTIMATE SOFTWARE GROUP INC,0.1035
4,"NUANCE COMMUNICATIONS, INC.",0.08608,NATIONAL INSTRUMENTS CORP,0.09372
5,COMMVAULT SYSTEMS INC,0.07381,"Q2 HOLDINGS, INC.",0.0619
6,"QUALYS, INC.",0.06668,"NUANCE COMMUNICATIONS, INC.",0.05947
7,QUMU CORP,0.05153,"ACI WORLDWIDE, INC.",0.04754
8,"ENDURANCE INTERNATIONAL GROUP HOLDINGS, INC.",0.02554,GSE SYSTEMS INC,0.04031
9,MICROSTRATEGY INC,0.0216,REALPAGE INC,0.02937


##### Sample Estimate V.S. Factor Model Estimate

In [6]:
software_13.style.applymap(highlight_13)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Factor Model Estimate,Factor Model Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight
0,"BLACK KNIGHT, INC.",0.2,"BLACK KNIGHT, INC.",0.2
1,AWARE INC /MA/,0.2,"ACI WORLDWIDE, INC.",0.2
2,"ACI WORLDWIDE, INC.",0.11314,ORACLE CORP,0.2
3,ORACLE CORP,0.0917,NATIONAL INSTRUMENTS CORP,0.11657
4,"NUANCE COMMUNICATIONS, INC.",0.08608,SALESFORCE COM INC,0.09549
5,COMMVAULT SYSTEMS INC,0.07381,AWARE INC /MA/,0.06064
6,"QUALYS, INC.",0.06668,ULTIMATE SOFTWARE GROUP INC,0.05857
7,QUMU CORP,0.05153,ANSYS INC,0.03257
8,"ENDURANCE INTERNATIONAL GROUP HOLDINGS, INC.",0.02554,REALPAGE INC,0.02255
9,MICROSTRATEGY INC,0.0216,"POLARITYTE, INC.",0.01095


##### All Three Estimates

In [7]:
software.style.applymap(highlight_all)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Cosine Similarity Estimate,Cosine Similarity Estimate,Factor Model Estimate,Factor Model Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight,Company Name,Weight
0,"BLACK KNIGHT, INC.",0.2,"BLACK KNIGHT, INC.",0.2,"BLACK KNIGHT, INC.",0.2
1,AWARE INC /MA/,0.2,ORACLE CORP,0.16315,"ACI WORLDWIDE, INC.",0.2
2,"ACI WORLDWIDE, INC.",0.11314,ANSYS INC,0.1539,ORACLE CORP,0.2
3,ORACLE CORP,0.0917,ULTIMATE SOFTWARE GROUP INC,0.1035,NATIONAL INSTRUMENTS CORP,0.11657
4,"NUANCE COMMUNICATIONS, INC.",0.08608,NATIONAL INSTRUMENTS CORP,0.09372,SALESFORCE COM INC,0.09549
5,COMMVAULT SYSTEMS INC,0.07381,"Q2 HOLDINGS, INC.",0.0619,AWARE INC /MA/,0.06064
6,"QUALYS, INC.",0.06668,"NUANCE COMMUNICATIONS, INC.",0.05947,ULTIMATE SOFTWARE GROUP INC,0.05857
7,QUMU CORP,0.05153,"ACI WORLDWIDE, INC.",0.04754,ANSYS INC,0.03257
8,"ENDURANCE INTERNATIONAL GROUP HOLDINGS, INC.",0.02554,GSE SYSTEMS INC,0.04031,REALPAGE INC,0.02255
9,MICROSTRATEGY INC,0.0216,REALPAGE INC,0.02937,"POLARITYTE, INC.",0.01095


For the industry `Prepackaged Software`, 6/15 of the companies in sample estimated portfolio and cosine similarity estimated portfolio are the same. 5/15 of the companies in sample estimated portfolio and factor model estimated portfolio are the same. When we compare all three porfolios, there are 4 companies in common.

#### Pharmaceutical Preparations

In [9]:
sample_pharm = pd.read_csv("data/min_vol_sample_Pharmaceutical_Preparations.csv")
cos_sim_pharm = pd.read_csv("data/min_vol_cos_sim_Pharmaceutical_Preparations.csv")
factor_model_pharm = pd.read_csv("data/min_vol_factor_model_Pharmaceutical_Preparations.csv")

In [10]:
sample_pharm = sample_pharm.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)
cos_sim_pharm = cos_sim_pharm.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)
factor_model_pharm = factor_model_pharm.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)

In [11]:
columns = pd.MultiIndex.from_product([['Sample Estimate', 'Cosine Similarity Estimate'], 
                                      ['Company Name', 'Weight']])
pharm_12 = pd.concat([sample_pharm, cos_sim_pharm], axis=1)
pharm_12.columns = columns
pharm_12 = pharm_12.fillna(" ")

columns = pd.MultiIndex.from_product([['Sample Estimate', 'Factor Model Estimate'], 
                                      ['Company Name', 'Weight']])
pharm_13 = pd.concat([sample_pharm, factor_model_pharm], axis=1)
pharm_13.columns = columns
pharm_13 = pharm_13.fillna(" ")

columns = pd.MultiIndex.from_product([['Sample Estimate', 'Cosine Similarity Estimate', 'Factor Model Estimate'], 
                                      ['Company Name', 'Weight']])
pharm = pd.concat([pd.concat([sample_pharm, cos_sim_pharm], axis=1), factor_model_pharm], axis=1)
pharm.columns = columns
pharm = pharm.fillna(" ")

Same_12 = (set(sample_pharm.Company_Name) & set(cos_sim_pharm.Company_Name)) 
Same_13 = (set(sample_pharm.Company_Name) & set(factor_model_pharm.Company_Name)) 
Same_all = Same_12 & Same_13

##### Sample Estimate V.S. Cosine Similarity Estimate

In [12]:
pharm_12.style.applymap(highlight_12)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Cosine Similarity Estimate,Cosine Similarity Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight
0,"MERCK & CO., INC.",0.2,ZOETIS INC.,0.2
1,JOHNSON & JOHNSON,0.17878,PFIZER INC,0.2
2,BRISTOL MYERS SQUIBB CO,0.12824,JOHNSON & JOHNSON,0.18756
3,"ASSEMBLY BIOSCIENCES, INC.",0.05775,"MERCK & CO., INC.",0.13753
4,"PROPHASE LABS, INC.",0.0512,BIOSPECIFICS TECHNOLOGIES CORP,0.07394
5,ORAMED PHARMACEUTICALS INC.,0.04982,BIOMARIN PHARMACEUTICAL INC,0.04572
6,STEMLINE THERAPEUTICS INC,0.04273,BRISTOL MYERS SQUIBB CO,0.03719
7,"IMPRIMIS PHARMACEUTICALS, INC.",0.04181,LILLY ELI & CO,0.03562
8,PFENEX INC.,0.03777,XENCOR INC,0.02108
9,BIODELIVERY SCIENCES INTERNATIONAL INC,0.0368,"PACIRA PHARMACEUTICALS, INC.",0.01883


##### Sample Estimate V.S. Factor Model Estimate

In [13]:
pharm_13.style.applymap(highlight_13)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Factor Model Estimate,Factor Model Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight
0,"MERCK & CO., INC.",0.2,PFIZER INC,0.2
1,JOHNSON & JOHNSON,0.17878,JOHNSON & JOHNSON,0.2
2,BRISTOL MYERS SQUIBB CO,0.12824,"MERCK & CO., INC.",0.2
3,"ASSEMBLY BIOSCIENCES, INC.",0.05775,ZOETIS INC.,0.2
4,"PROPHASE LABS, INC.",0.0512,LILLY ELI & CO,0.1825
5,ORAMED PHARMACEUTICALS INC.,0.04982,BIOSPECIFICS TECHNOLOGIES CORP,0.0175
6,STEMLINE THERAPEUTICS INC,0.04273,,
7,"IMPRIMIS PHARMACEUTICALS, INC.",0.04181,,
8,PFENEX INC.,0.03777,,
9,BIODELIVERY SCIENCES INTERNATIONAL INC,0.0368,,


##### All Three Estimates

In [14]:
pharm.style.applymap(highlight_all)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Cosine Similarity Estimate,Cosine Similarity Estimate,Factor Model Estimate,Factor Model Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight,Company Name,Weight
0,"MERCK & CO., INC.",0.2,ZOETIS INC.,0.2,PFIZER INC,0.2
1,JOHNSON & JOHNSON,0.17878,PFIZER INC,0.2,JOHNSON & JOHNSON,0.2
2,BRISTOL MYERS SQUIBB CO,0.12824,JOHNSON & JOHNSON,0.18756,"MERCK & CO., INC.",0.2
3,"ASSEMBLY BIOSCIENCES, INC.",0.05775,"MERCK & CO., INC.",0.13753,ZOETIS INC.,0.2
4,"PROPHASE LABS, INC.",0.0512,BIOSPECIFICS TECHNOLOGIES CORP,0.07394,LILLY ELI & CO,0.1825
5,ORAMED PHARMACEUTICALS INC.,0.04982,BIOMARIN PHARMACEUTICAL INC,0.04572,BIOSPECIFICS TECHNOLOGIES CORP,0.0175
6,STEMLINE THERAPEUTICS INC,0.04273,BRISTOL MYERS SQUIBB CO,0.03719,,
7,"IMPRIMIS PHARMACEUTICALS, INC.",0.04181,LILLY ELI & CO,0.03562,,
8,PFENEX INC.,0.03777,XENCOR INC,0.02108,,
9,BIODELIVERY SCIENCES INTERNATIONAL INC,0.0368,"PACIRA PHARMACEUTICALS, INC.",0.01883,,


For the industry `Pharmaceutical Preparations`, 4/21 of the companies in sample estimated portfolio and cosine similarity estimated portfolio are the same. 3/21 of the companies in sample estimated portfolio and factor model estimated portfolio are the same. When we compare all three porfolios, there are 3 companies in common.

#### Real Estate Investment Trusts

In [15]:
sample_real_estate = pd.read_csv("data/min_vol_sample_Real_Estate_Investment_Trusts.csv")
cos_sim_real_estate = pd.read_csv("data/min_vol_cos_sim_Real_Estate_Investment_Trusts.csv")
factor_model_real_estate = pd.read_csv("data/min_vol_factor_model_Real_Estate_Investment_Trusts.csv")

In [16]:
sample_real_estate = sample_real_estate.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)
cos_sim_real_estate = cos_sim_real_estate.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)
factor_model_real_estate = factor_model_real_estate.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)

In [17]:
columns = pd.MultiIndex.from_product([['Sample Estimate', 'Cosine Similarity Estimate'], 
                                      ['Company Name', 'Weight']])
real_estate_12 = pd.concat([sample_real_estate, cos_sim_real_estate], axis=1)
real_estate_12.columns = columns
real_estate_12 = real_estate_12.fillna(" ")

columns = pd.MultiIndex.from_product([['Sample Estimate', 'Factor Model Estimate'], 
                                      ['Company Name', 'Weight']])
real_estate_13 = pd.concat([sample_real_estate, factor_model_real_estate], axis=1)
real_estate_13.columns = columns
real_estate_13 = real_estate_13.fillna(" ")

columns = pd.MultiIndex.from_product([['Sample Estimate', 'Cosine Similarity Estimate', 'Factor Model Estimate'], 
                                      ['Company Name', 'Weight']])
real_estate = pd.concat([pd.concat([sample_real_estate, cos_sim_real_estate], axis=1), factor_model_real_estate], axis=1)
real_estate.columns = columns
real_estate = real_estate.fillna(" ")

Same_12 = (set(sample_real_estate.Company_Name) & set(cos_sim_real_estate.Company_Name)) 
Same_13 = (set(sample_real_estate.Company_Name) & set(factor_model_real_estate.Company_Name)) 
Same_all = Same_12 & Same_13

##### Sample Estimate V.S. Cosine Similarity Estimate 

In [18]:
real_estate_12.style.applymap(highlight_12)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Cosine Similarity Estimate,Cosine Similarity Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight
0,EQUITY COMMONWEALTH,0.2,EQUITY COMMONWEALTH,0.16327
1,GREAT AJAX CORP.,0.2,SUN COMMUNITIES INC,0.14907
2,HMG COURTLAND PROPERTIES INC,0.12513,GREAT AJAX CORP.,0.13806
3,PUBLIC STORAGE,0.10938,EQUINIX INC,0.07068
4,ARES COMMERCIAL REAL ESTATE CORP,0.09107,"GAMING & LEISURE PROPERTIES, INC.",0.06734
5,CIM COMMERCIAL TRUST CORP,0.05461,PUBLIC STORAGE,0.06339
6,IMPAC MORTGAGE HOLDINGS INC,0.05108,DUKE REALTY CORP,0.05369
7,CROWN CASTLE INTERNATIONAL CORP,0.04875,HIGHWOODS PROPERTIES INC,0.05347
8,LADDER CAPITAL CORP,0.0442,"MFA FINANCIAL, INC.",0.05101
9,ALEXANDERS INC,0.02285,ANNALY CAPITAL MANAGEMENT INC,0.05094


##### Sample Estimate V.S. Factor Model Estimate 

In [19]:
real_estate_13.style.applymap(highlight_13)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Factor Model Estimate,Factor Model Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight
0,EQUITY COMMONWEALTH,0.2,GREAT AJAX CORP.,0.2
1,GREAT AJAX CORP.,0.2,"STARWOOD PROPERTY TRUST, INC.",0.2
2,HMG COURTLAND PROPERTIES INC,0.12513,EQUITY COMMONWEALTH,0.2
3,PUBLIC STORAGE,0.10938,ARES COMMERCIAL REAL ESTATE CORP,0.09357
4,ARES COMMERCIAL REAL ESTATE CORP,0.09107,ALEXANDRIA REAL ESTATE EQUITIES INC,0.07207
5,CIM COMMERCIAL TRUST CORP,0.05461,SUN COMMUNITIES INC,0.05974
6,IMPAC MORTGAGE HOLDINGS INC,0.05108,TWO HARBORS INVESTMENT CORP.,0.05921
7,CROWN CASTLE INTERNATIONAL CORP,0.04875,"GAMING & LEISURE PROPERTIES, INC.",0.04173
8,LADDER CAPITAL CORP,0.0442,ESSEX PROPERTY TRUST INC,0.03164
9,ALEXANDERS INC,0.02285,"UDR, INC.",0.02051


##### All Three Estimates

In [20]:
real_estate.style.applymap(highlight_all)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Cosine Similarity Estimate,Cosine Similarity Estimate,Factor Model Estimate,Factor Model Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight,Company Name,Weight
0,EQUITY COMMONWEALTH,0.2,EQUITY COMMONWEALTH,0.16327,GREAT AJAX CORP.,0.2
1,GREAT AJAX CORP.,0.2,SUN COMMUNITIES INC,0.14907,"STARWOOD PROPERTY TRUST, INC.",0.2
2,HMG COURTLAND PROPERTIES INC,0.12513,GREAT AJAX CORP.,0.13806,EQUITY COMMONWEALTH,0.2
3,PUBLIC STORAGE,0.10938,EQUINIX INC,0.07068,ARES COMMERCIAL REAL ESTATE CORP,0.09357
4,ARES COMMERCIAL REAL ESTATE CORP,0.09107,"GAMING & LEISURE PROPERTIES, INC.",0.06734,ALEXANDRIA REAL ESTATE EQUITIES INC,0.07207
5,CIM COMMERCIAL TRUST CORP,0.05461,PUBLIC STORAGE,0.06339,SUN COMMUNITIES INC,0.05974
6,IMPAC MORTGAGE HOLDINGS INC,0.05108,DUKE REALTY CORP,0.05369,TWO HARBORS INVESTMENT CORP.,0.05921
7,CROWN CASTLE INTERNATIONAL CORP,0.04875,HIGHWOODS PROPERTIES INC,0.05347,"GAMING & LEISURE PROPERTIES, INC.",0.04173
8,LADDER CAPITAL CORP,0.0442,"MFA FINANCIAL, INC.",0.05101,ESSEX PROPERTY TRUST INC,0.03164
9,ALEXANDERS INC,0.02285,ANNALY CAPITAL MANAGEMENT INC,0.05094,"UDR, INC.",0.02051


For the industry `Real Estate Investment Trusts`, 6/13 of the companies in sample estimated portfolio and cosine similarity estimated portfolio are the same. 4/13 of the companies in sample estimated portfolio and factor model estimated portfolio are the same. When we compare all three porfolios, there are 3 companies in common.

#### State Commercial Banks (commercial banking)

In [21]:
sample_banks = pd.read_csv("data/min_vol_sample_State_Commercial_Banks.csv")
cos_sim_banks = pd.read_csv("data/min_vol_cos_sim_State_Commercial_Banks.csv")
factor_model_banks = pd.read_csv("data/min_vol_factor_model_State_Commercial_Banks.csv")

In [22]:
sample_banks = sample_banks.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)
cos_sim_banks = cos_sim_banks.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)
factor_model_banks = factor_model_banks.sort_values(by=["Weight"], ascending=False).reset_index(drop=True)

In [23]:
columns = pd.MultiIndex.from_product([['Sample Estimate', 'Cosine Similarity Estimate'], 
                                      ['Company Name', 'Weight']])
banks_12 = pd.concat([sample_banks, cos_sim_banks], axis=1)
banks_12.columns = columns
banks_12 = banks_12.fillna(" ")

columns = pd.MultiIndex.from_product([['Sample Estimate', 'Factor Model Estimate'], 
                                      ['Company Name', 'Weight']])
banks_13 = pd.concat([sample_banks, factor_model_banks], axis=1)
banks_13.columns = columns
banks_13 = banks_13.fillna(" ")

columns = pd.MultiIndex.from_product([['Sample Estimate', 'Cosine Similarity Estimate', 'Factor Model Estimate'], 
                                      ['Company Name', 'Weight']])
banks = pd.concat([pd.concat([sample_banks, cos_sim_banks], axis=1), factor_model_banks], axis=1)
banks.columns = columns
banks = banks.fillna(" ")

Same_12 = (set(sample_banks.Company_Name) & set(cos_sim_banks.Company_Name)) 
Same_13 = (set(sample_banks.Company_Name) & set(factor_model_banks.Company_Name)) 
Same_all = Same_12 & Same_13

##### Sample Estimate V.S. Cosine Similarity Estimate

In [24]:
banks_12.style.applymap(highlight_12)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Cosine Similarity Estimate,Cosine Similarity Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight
0,INVESTAR HOLDING CORP,0.1944,BANNER CORP,0.2
1,GUARANTY FEDERAL BANCSHARES INC,0.17724,INVESTAR HOLDING CORP,0.16789
2,VILLAGE BANK & TRUST FINANCIAL CORP.,0.13994,CITIZENS & NORTHERN CORP,0.11305
3,"RELIANT BANCORP, INC.",0.12273,BANK OF NEW YORK MELLON CORP,0.09816
4,"CAROLINA TRUST BANCSHARES, INC.",0.11786,INDEPENDENT BANK CORP /MI/,0.0954
5,BANK OF NEW YORK MELLON CORP,0.09533,EAST WEST BANCORP INC,0.08342
6,CITIZENS & NORTHERN CORP,0.05375,ENTERPRISE FINANCIAL SERVICES CORP,0.07078
7,FIRST COMMUNITY CORP /SC/,0.05076,S&T BANCORP INC,0.05201
8,MACKINAC FINANCIAL CORP /MI/,0.02478,BANK OF HAWAII CORP,0.04935
9,"FAUQUIER BANKSHARES, INC.",0.02143,HOWARD BANCORP INC,0.02931


 ##### Sample Estimate V.S. Factor Model Estimate

In [25]:
banks_13.style.applymap(highlight_13)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Factor Model Estimate,Factor Model Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight
0,INVESTAR HOLDING CORP,0.1944,INVESTAR HOLDING CORP,0.2
1,GUARANTY FEDERAL BANCSHARES INC,0.17724,MACKINAC FINANCIAL CORP /MI/,0.17138
2,VILLAGE BANK & TRUST FINANCIAL CORP.,0.13994,BANK OF THE JAMES FINANCIAL GROUP INC,0.13657
3,"RELIANT BANCORP, INC.",0.12273,HOPFED BANCORP INC,0.09872
4,"CAROLINA TRUST BANCSHARES, INC.",0.11786,BANK OF HAWAII CORP,0.07586
5,BANK OF NEW YORK MELLON CORP,0.09533,GUARANTY FEDERAL BANCSHARES INC,0.05649
6,CITIZENS & NORTHERN CORP,0.05375,"CB FINANCIAL SERVICES, INC.",0.05269
7,FIRST COMMUNITY CORP /SC/,0.05076,COMMERCE BANCSHARES INC /MO/,0.05022
8,MACKINAC FINANCIAL CORP /MI/,0.02478,BANK OF NEW YORK MELLON CORP,0.04777
9,"FAUQUIER BANKSHARES, INC.",0.02143,OLD LINE BANCSHARES INC,0.04698


##### All Three Estimates

In [26]:
banks.style.applymap(highlight_all)

Unnamed: 0_level_0,Sample Estimate,Sample Estimate,Cosine Similarity Estimate,Cosine Similarity Estimate,Factor Model Estimate,Factor Model Estimate
Unnamed: 0_level_1,Company Name,Weight,Company Name,Weight,Company Name,Weight
0,INVESTAR HOLDING CORP,0.1944,BANNER CORP,0.2,INVESTAR HOLDING CORP,0.2
1,GUARANTY FEDERAL BANCSHARES INC,0.17724,INVESTAR HOLDING CORP,0.16789,MACKINAC FINANCIAL CORP /MI/,0.17138
2,VILLAGE BANK & TRUST FINANCIAL CORP.,0.13994,CITIZENS & NORTHERN CORP,0.11305,BANK OF THE JAMES FINANCIAL GROUP INC,0.13657
3,"RELIANT BANCORP, INC.",0.12273,BANK OF NEW YORK MELLON CORP,0.09816,HOPFED BANCORP INC,0.09872
4,"CAROLINA TRUST BANCSHARES, INC.",0.11786,INDEPENDENT BANK CORP /MI/,0.0954,BANK OF HAWAII CORP,0.07586
5,BANK OF NEW YORK MELLON CORP,0.09533,EAST WEST BANCORP INC,0.08342,GUARANTY FEDERAL BANCSHARES INC,0.05649
6,CITIZENS & NORTHERN CORP,0.05375,ENTERPRISE FINANCIAL SERVICES CORP,0.07078,"CB FINANCIAL SERVICES, INC.",0.05269
7,FIRST COMMUNITY CORP /SC/,0.05076,S&T BANCORP INC,0.05201,COMMERCE BANCSHARES INC /MO/,0.05022
8,MACKINAC FINANCIAL CORP /MI/,0.02478,BANK OF HAWAII CORP,0.04935,BANK OF NEW YORK MELLON CORP,0.04777
9,"FAUQUIER BANKSHARES, INC.",0.02143,HOWARD BANCORP INC,0.02931,OLD LINE BANCSHARES INC,0.04698


For the industry `State Commercial Banks`, 3/11 of the companies in sample estimated portfolio and cosine similarity estimated portfolio are the same. 8/11 of the companies in sample estimated portfolio and factor model estimated portfolio are the same. When we compare all three porfolios, there are 3 companies in common. 

#### Crude Petroleum and Natural Gas
Since there is no optimal portfolio generated for the Crude Petroleum and Natural Gas industry, we do not include this industry.

### Conclusion

Overall, cosine similarity analysis has a better performance on estimating covariance close to the sample covariance. From the performance table, we can conclude that sample estimate and cosine similarity estimate give more similar results on expected returns and annual volatility for the minimum-variance portfolio.

For all portfolios constructed, the industry `Prepackaged Software` and `Pharmaceutical Preparations`, the sample portfolios have a larger selection of companies while the other two industries generate similar size of portfolio from all three estimates. 

Most portfolios generated from cosine similarity estimate and factor model estimate contain less than half of the companies that are in common with sample estimate portfolio being the reference. However, for the industry `State Commercial Banks`, 8/11 of the companies in sample estimated portfolio and factor model estimated portfolio are the same. Factor model estimate has a better performance in consturcting a similar portfolio than cosine similarity estimate for this industry only.

To sum up, the feasibility of constructing similar portfolios using the document embeddings of the company's business description in SEC filings is low but it indeed illustrates the possibility of constructing similar portfolios through textual analysis.

Our research has certain limitations, such as the accrucy of topic selection in factor model, due to unsupervised learning. Moreover, the informativeness of the words we used in word embedding is not confirmed. 

For future research, we may apply topic modeling on the risk disclosure section of SEC filings and use the risk factors in the factor model. Business description of the companies may not explain much of their returns and correlation with other companies. Risk disclosure of the companies may reveal more information.