Vision Statement

Financial professionals are faced with an overwhelming amount of financial literature that needs to be processed and analyzed to arrive at an investment thesis and recommendation, including determining the security to choose, at what price point to buy, the exit strategy, and where to buy in the capital structure. This requires reading through hundreds of pages of text, such as macroeconomic research papers, news, annual and quarterly reports, and earnings transcripts. Once a security is purchased, manual analysis of multi-page performance and risk reports must be conducted. To alleviate this burden, there is a growing interest in applying NLP algorithms to generate concise summaries that focus on specific decision-making information while leaving unnecessary verbiage to the side. This can be achieved by either finding labeled data, generating it through SMEs, or using data extraction methods combined with pre-set templates to generate financial summaries.

Our vision is to create a comprehensive financial reporting solution that caters to the needs of institutional investors, while also making algorithm-generated reports accessible to retail investors. We recognize the vast market potential for such a solution, and our goal is to level the playing field by providing timely and impactful financial information to all. In the US alone, there are an estimated 10,000 institutional investors, and with around 6,000 publicly listed companies, there is a demand for financial statement summaries that runs into the millions per year. With additional reports such as quarterly statements, the potential market size is even larger. Moreover, the retail market is also significant, with over 100 million user names at the six largest brokerages, and assuming an average of 20 companies per portfolio, the demand for financial reports is immense.

Data

SEC logo

Our team has acquired a comprehensive dataset consisting of 191 10-K reports from SEC for the fiscal year 2021. The dataset includes both the extracted JSON format and the raw HTML format of each report. This dataset is a valuable resource for financial professionals who want to analyze and compare the financial performance of different companies over the past year.

Our dataset consists of 10-K reports, which are annual filings required by the U.S. Securities and Exchange Commission (SEC). These reports provide a comprehensive overview of a company's financial performance throughout the year. To create a high-quality and diverse dataset, we randomly selected 48 10-K reports from fiscal year 2021, representing companies across various industries and market capitalizations.

Our subject matter expert, with extensive experience in finance and financial analysis, has manually generated gold-standard summaries for these 48 reports. These summaries serve as labels that will be used to evaluate the performance of our model, ensuring its ability to generate accurate and concise summaries for 10-K reports in the future.

The primary focus of our dataset is Item 7, Management's Discussion and Analysis of Financial Condition and Results of Operations (MD&A). This section is considered crucial for financial analysts, as it contains insights and explanations from the company's executives regarding their performance during the fiscal year. The MD&A section offers an in-depth analysis of the company's financial condition, results of operations, risk factors, and future outlook.

By concentrating on Item 7, we aim to develop a model that can effectively extract and summarize the most relevant information for financial analysts and other users interested in understanding a company's financial performance. With our dataset, we hope to advance the state of natural language processing and machine learning in the finance domain, facilitating more efficient and accurate analysis of financial reports.

To ensure the accuracy and reliability of our summaries, we leverage our team's subject matter expertise (SME) to manually generate gold-standard labels for 40 reports. By doing this, we can be sure that our summaries are both comprehensive and accurate

Exploratory Data Analysis

In this section, we present the results of our exploratory data analysis, which provides insights into the characteristics of our dataset. The analysis focuses on the length and distribution of the input text, as well as the most frequently occurring words in Item 7. As evident from the distribution charts and the summary table on the left-hand side, the input text in our dataset is quite long. On average, each report's Item 7 comprises approximately 10,000 word tokens and 260 sentences. While there is some variation, the number of word tokens and sentences per report is relatively normally distributed around these means, demonstrating a consistent structure and content length across the sampled reports. The right-hand side chart illustrates the most frequent words found in Item 7. As expected, this section contains key terms associated with a company's financial performance, such as "revenue," "cash," "costs," "expense," and "expenses." The high frequency of these words confirms that our dataset effectively captures the essential elements of a company's financial analysis. By understanding these dataset characteristics, we can better tailor our model development and evaluation processes to ensure that the resulting summaries accurately reflect the most important aspects of each report. This EDA also helps to inform users of the inherent features and biases present within our dataset, contributing to a more transparent and robust analysis.

Data Processing and Modeling Pipeline

Our project follows a systematic data processing and modeling pipeline to ensure accurate and efficient analysis of 10-K reports. The pipeline consists of the following steps:

  1. Data Collection: We started with 48 10-K reports in HTML format as our input data, ensuring a diverse and representative sample.
  2. HTML Parsing: We implemented two versions of data pre-processing. In the first version, we removed all tables, as they lose their format and are difficult to summarize meaningfully. In the second version, in addition to removing tables, we filtered out non-important sections based on their headings and re-ordered and grouped the sections by themes such as revenue, debt, and liquidity.
  3. Model Comparison: We selected the BART model architecture for our project and fine-tuned it using the 48 gold standard summary labels we generated. We also added a keyword attention layer to enhance the model's performance, which we will discuss in detail later. As a comparison, we chose GPT-3.5 Turbo, a popular and widely available model through API calls.
  4. Generated Summaries: With our fine-tuned BART model and the GPT-3.5 Turbo model, we generated summaries for the 10-K reports in our dataset.
  5. Evaluation Metrics: To evaluate the quality of the generated summaries, we applied both machine learning evaluation metrics – ROUGE scores and BERT score – and human evaluation.

By following this comprehensive pipeline, we ensure that our model effectively processes the input data, generates high-quality summaries, and is rigorously compared to existing state-of-the-art models. The combination of machine learning metrics and human evaluation allows us to measure the performance of our model objectively and accurately, providing valuable insights for further improvements and applications.

Modeling Approach

In this project, we have adopted a comprehensive approach to model development and experimentation to ensure the most effective summarization of financial documents.

Baseline BART Model

We started with the BART (Bidirectional and Auto-Regressive Transformers) architecture, specifically the bart-large-cnn model, which is a large-sized model fine-tuned on the CNN Daily Mail dataset. BART is a sequence-to-sequence model that utilizes a combination of bidirectional and autoregressive transformers to generate abstractive summaries.

Fine-tuning Baseline Model

To improve the performance of the baseline BART model, we fine-tuned it by incorporating a Keywords_attention layer. This layer allows the model to focus on crucial financial keywords, resulting in more relevant and accurate summarization.

In the forward-pass stage of the fine-tuning process, the model's hidden states are first transformed into a keywords space. A tanh activation function is then applied to the transformed hidden states, producing keyword scores. These scores represent the relevance of each keyword in the context of the input text.

During the training phase, the keyword scores are multiplied by a weight matrix (W) and passed through a softmax function to generate attention weights for each keyword. These attention weights indicate the importance of each keyword in the summary generation process. The attention weights are then used to produce the attention_output, which contributes to the final summary.

By fine-tuning the baseline BART model with a Keywords_attention layer, we aim to develop a more focused and effective summarization model that can provide financial analysts and decision-makers with concise and actionable insights.

Experiment

Data Pre-processing

In the data pre-processing stage of our experiment, we undertook several steps to ensure the extracted information from 10-K reports was well-structured and relevant for summarization:

  • Remove tables: We removed all tables from the text to focus solely on the textual content.
  • Filter text, re-order, and group by big themes: The text data was filtered, re-ordered, and grouped based on the following major themes to improve the model's understanding of the financial context:
    • Business Overview
    • Results of Operations
    • Revenues
    • Gross Profit Margin
    • Interest expense
    • Operating Expenses
    • Operating Income
    • Liquidity
    • Debt
    • Not Important(removed)

These pre-processing steps allowed us to provide a cleaner and more organized dataset for the model, ensuring more accurate and relevant summaries.

Model Comparison

In our project, we compared the performance of multiple summarization models to identify how our model is comparing to GPT-3.5 model and How the final model improved from its baseline. The models under comparison included:

  1. BART Baseline Model: The original BART model without any fine-tuning, providing a baseline for comparison.
  2. BART Model trained with 50 labels using K-fold cross-validation: The fine-tuned BART model, which incorporated the Keywords_attention layer and was trained on 50 gold-standard labels using K-fold cross-validation to ensure robustness.
  3. GPT-3.5-turbo Model: The OpenAI GPT-3.5-turbo model, a powerful language

Results and Evaluation

In this project, we incorporated an SME (Subject Matter Expert) evaluation as part of the result assessment process. SMEs were provided with a Google Form to evaluate the summarizations generated by different NLP models for 10-K financial reports. They were given access to 10 original 10-K reports and three corresponding summarizations produced by distinct models for each report.The SMEs were asked to rate the quality of each summarized report on a scale of 1-10, with 6 being the minimum acceptable quality for a report written by a human. The evaluation criteria included completeness, accuracy, clarity, and conciseness of the summarizations. While the SMEs were advised not to consider formatting issues in their ratings, their primary focus was on the effectiveness of the summarization in capturing key financial information from the original report.The SME evaluation played a crucial role in understanding the real-world applicability of the NLP models and their effectiveness in summarizing financial information. This feedback from industry experts helped to identify areas for improvement and refine the NLP models, ultimately providing better financial report summarizations for decision-makers and analysts.

Model Data Preprocessing ROUGE-1 ROUGE-2 ROUGE-L BERTScore SME review
BART Baseline Table removed 0.30 0.15 0.30 0.80 N/A
BART Baseline Table removed; filtered and grouped by theme 0.41 0.22 0.39 0.84 3.83
BART trained with 50 Labels Table removed 0.47 0.29 0.46 0.84 N/A
BART trained with 50 Labels Table removed; filtered and grouped by theme 0.51 0.32 0.49 0.85 5.07
Comparison: gpt-3.5-turbo Table removed 0.41 0.20 0.36 0.85 N/A
Comparison: gpt-3.5-turbo Table removed; filtered and grouped by theme 0.49 0.28 0.47 0.86 4.10

This table shows that the BART model trained with ~50 labels and using the data preprocessing method that removes tables, filters, and groups sections by theme, achieves the highest ROUGE and BERT scores. It also receives the highest SME review score, indicating the effectiveness of our approach in generating summaries that meet the expectations of financial analysts. In comparison, the GPT-3.5 Turbo model performs competitively but falls slightly short of the performance achieved by our fine-tuned BART model, particularly in the ROUGE-2 score and SME review. Overall, our study demonstrates the potential of utilizing advanced NLP models, such as BART, for the summarization of financial reports. By combining effective data preprocessing techniques and fine-tuning strategies, we can develop models that offer valuable assistance to financial analysts and other stakeholders in understanding and interpreting complex financial documents more efficiently.

The results indicate that proper data preprocessing, such as table removal, filtering, and theme grouping, significantly improves the performance of both BART and GPT-3.5-turbo models. Furthermore, the BART model trained with 50 labels demonstrates the importance of fine-tuning with domain-specific data.

Results and Evaluation

In conclusion, our BART model, trained with 50 labels, has demonstrated its effectiveness in generating summaries of financial reports. It not only outperformed gpt-3.5-turbo but also showed significant improvement over the baseline model. This was achieved by modifying the self-attention layer and training with 50 labels using k-fold. However, we must acknowledge that despite these improvements, there is still room for further enhancement to reach the level of human performance, which we have assumed to be at a score of 6. Our model's current SME evaluation score is 5.1, so there is still work to be done.

As for future directions, if we have more budget and resources, we would like to train our model with thousands of more labels or explore unsupervised learning using large language models. This would help improve the model's performance, bringing it closer to that of human-generated summaries. We also want to expand our scope beyond Item 7 in the 10-K report and include data from different items for a more comprehensive analysis. Doing so will allow us to generate even more valuable insights for users. Furthermore, it would be beneficial to incorporate table data and convey numerical insights within the summaries. This would help provide a more in-depth understanding of the financial information at hand. Lastly, we plan to deploy the model on the cloud, allowing users to generate real-time outputs for any public trading company through a user-friendly web interface. This would greatly enhance the accessibility of our solution, making it a valuable tool for a wide range of users in the finance industry.

Demo Website

https://10-k-financial.com/

Team Members

Dmitry Baron

Haibi Lu

Viola Pu