Data

Our team has acquired a comprehensive dataset consisting of 191 10-K reports from SEC for the fiscal year 2021. The dataset includes both the extracted JSON format and the raw HTML format of each report. This dataset is a valuable resource for financial professionals who want to analyze and compare the financial performance of different companies over the past year.

Our dataset consists of 10-K reports, which are annual filings required by the U.S. Securities and Exchange Commission (SEC). These reports provide a comprehensive overview of a company's financial performance throughout the year. To create a high-quality and diverse dataset, we randomly selected 48 10-K reports from fiscal year 2021, representing companies across various industries and market capitalizations.

Our subject matter expert, with extensive experience in finance and financial analysis, has manually generated gold-standard summaries for these 48 reports. These summaries serve as labels that will be used to evaluate the performance of our model, ensuring its ability to generate accurate and concise summaries for 10-K reports in the future.

The primary focus of our dataset is Item 7, Management's Discussion and Analysis of Financial Condition and Results of Operations (MD&A). This section is considered crucial for financial analysts, as it contains insights and explanations from the company's executives regarding their performance during the fiscal year. The MD&A section offers an in-depth analysis of the company's financial condition, results of operations, risk factors, and future outlook.

By concentrating on Item 7, we aim to develop a model that can effectively extract and summarize the most relevant information for financial analysts and other users interested in understanding a company's financial performance. With our dataset, we hope to advance the state of natural language processing and machine learning in the finance domain, facilitating more efficient and accurate analysis of financial reports.

To ensure the accuracy and reliability of our summaries, we leverage our team's subject matter expertise (SME) to manually generate gold-standard labels for 40 reports. By doing this, we can be sure that our summaries are both comprehensive and accurate

Exploratory Data Analysis

In this section, we present the results of our exploratory data analysis, which provides insights into the characteristics of our dataset. The analysis focuses on the length and distribution of the input text, as well as the most frequently occurring words in Item 7. As evident from the distribution charts and the summary table on the left-hand side, the input text in our dataset is quite long. On average, each report's Item 7 comprises approximately 10,000 word tokens and 260 sentences. While there is some variation, the number of word tokens and sentences per report is relatively normally distributed around these means, demonstrating a consistent structure and content length across the sampled reports. The right-hand side chart illustrates the most frequent words found in Item 7. As expected, this section contains key terms associated with a company's financial performance, such as "revenue," "cash," "costs," "expense," and "expenses." The high frequency of these words confirms that our dataset effectively captures the essential elements of a company's financial analysis. By understanding these dataset characteristics, we can better tailor our model development and evaluation processes to ensure that the resulting summaries accurately reflect the most important aspects of each report. This EDA also helps to inform users of the inherent features and biases present within our dataset, contributing to a more transparent and robust analysis.

Data Processing and Modeling Pipeline

Our project follows a systematic data processing and modeling pipeline to ensure accurate and efficient analysis of 10-K reports. The pipeline consists of the following steps:

Data Collection: We started with 48 10-K reports in HTML format as our input data, ensuring a diverse and representative sample.
HTML Parsing: We implemented two versions of data pre-processing. In the first version, we removed all tables, as they lose their format and are difficult to summarize meaningfully. In the second version, in addition to removing tables, we filtered out non-important sections based on their headings and re-ordered and grouped the sections by themes such as revenue, debt, and liquidity.
Model Comparison: We selected the BART model architecture for our project and fine-tuned it using the 48 gold standard summary labels we generated. We also added a keyword attention layer to enhance the model's performance, which we will discuss in detail later. As a comparison, we chose GPT-3.5 Turbo, a popular and widely available model through API calls.
Generated Summaries: With our fine-tuned BART model and the GPT-3.5 Turbo model, we generated summaries for the 10-K reports in our dataset.
Evaluation Metrics: To evaluate the quality of the generated summaries, we applied both machine learning evaluation metrics – ROUGE scores and BERT score – and human evaluation.

By following this comprehensive pipeline, we ensure that our model effectively processes the input data, generates high-quality summaries, and is rigorously compared to existing state-of-the-art models. The combination of machine learning metrics and human evaluation allows us to measure the performance of our model objectively and accurately, providing valuable insights for further improvements and applications.

Modeling Approach

In this project, we have adopted a comprehensive approach to model development and experimentation to ensure the most effective summarization of financial documents.

Baseline BART Model

We started with the BART (Bidirectional and Auto-Regressive Transformers) architecture, specifically the bart-large-cnn model, which is a large-sized model fine-tuned on the CNN Daily Mail dataset. BART is a sequence-to-sequence model that utilizes a combination of bidirectional and autoregressive transformers to generate abstractive summaries.

Fine-tuning Baseline Model

To improve the performance of the baseline BART model, we fine-tuned it by incorporating a Keywords_attention layer. This layer allows the model to focus on crucial financial keywords, resulting in more relevant and accurate summarization.

In the forward-pass stage of the fine-tuning process, the model's hidden states are first transformed into a keywords space. A tanh activation function is then applied to the transformed hidden states, producing keyword scores. These scores represent the relevance of each keyword in the context of the input text.

During the training phase, the keyword scores are multiplied by a weight matrix (W) and passed through a softmax function to generate attention weights for each keyword. These attention weights indicate the importance of each keyword in the summary generation process. The attention weights are then used to produce the attention_output, which contributes to the final summary.

By fine-tuning the baseline BART model with a Keywords_attention layer, we aim to develop a more focused and effective summarization model that can provide financial analysts and decision-makers with concise and actionable insights.

Experiment

Data Pre-processing

In the data pre-processing stage of our experiment, we undertook several steps to ensure the extracted information from 10-K reports was well-structured and relevant for summarization:

Remove tables: We removed all tables from the text to focus solely on the textual content.
Filter text, re-order, and group by big themes: The text data was filtered, re-ordered, and grouped based on the following major themes to improve the model's understanding of the financial context:
- Business Overview
- Results of Operations
- Revenues
- Gross Profit Margin
- Interest expense
- Operating Expenses
- Operating Income
- Liquidity
- Debt
- Not Important(removed)

These pre-processing steps allowed us to provide a cleaner and more organized dataset for the model, ensuring more accurate and relevant summaries.

Model Comparison

In our project, we compared the performance of multiple summarization models to identify how our model is comparing to GPT-3.5 model and How the final model improved from its baseline. The models under comparison included:

BART Baseline Model: The original BART model without any fine-tuning, providing a baseline for comparison.
BART Model trained with 50 labels using K-fold cross-validation: The fine-tuned BART model, which incorporated the Keywords_attention layer and was trained on 50 gold-standard labels using K-fold cross-validation to ensure robustness.
GPT-3.5-turbo Model: The OpenAI GPT-3.5-turbo model, a powerful language

Results and Evaluation

In this project, we incorporated an SME (Subject Matter Expert) evaluation as part of the result assessment process. SMEs were provided with a Google Form to evaluate the summarizations generated by different NLP models for 10-K financial reports. They were given access to 10 original 10-K reports and three corresponding summarizations produced by distinct models for each report.The SMEs were asked to rate the quality of each summarized report on a scale of 1-10, with 6 being the minimum acceptable quality for a report written by a human. The evaluation criteria included completeness, accuracy, clarity, and conciseness of the summarizations. While the SMEs were advised not to consider formatting issues in their ratings, their primary focus was on the effectiveness of the summarization in capturing key financial information from the original report.The SME evaluation played a crucial role in understanding the real-world applicability of the NLP models and their effectiveness in summarizing financial information. This feedback from industry experts helped to identify areas for improvement and refine the NLP models, ultimately providing better financial report summarizations for decision-makers and analysts.

Model	Data Preprocessing	ROUGE-1	ROUGE-2	ROUGE-L	BERTScore	SME review
BART Baseline	Table removed	0.30	0.15	0.30	0.80	N/A
BART Baseline	Table removed; filtered and grouped by theme	0.41	0.22	0.39	0.84	3.83
BART trained with 50 Labels	Table removed	0.47	0.29	0.46	0.84	N/A
BART trained with 50 Labels	Table removed; filtered and grouped by theme	0.51	0.32	0.49	0.85	5.07
Comparison: gpt-3.5-turbo	Table removed	0.41	0.20	0.36	0.85	N/A
Comparison: gpt-3.5-turbo	Table removed; filtered and grouped by theme	0.49	0.28	0.47	0.86	4.10

This table shows that the BART model trained with ~50 labels and using the data preprocessing method that removes tables, filters, and groups sections by theme, achieves the highest ROUGE and BERT scores. It also receives the highest SME review score, indicating the effectiveness of our approach in generating summaries that meet the expectations of financial analysts. In comparison, the GPT-3.5 Turbo model performs competitively but falls slightly short of the performance achieved by our fine-tuned BART model, particularly in the ROUGE-2 score and SME review. Overall, our study demonstrates the potential of utilizing advanced NLP models, such as BART, for the summarization of financial reports. By combining effective data preprocessing techniques and fine-tuning strategies, we can develop models that offer valuable assistance to financial analysts and other stakeholders in understanding and interpreting complex financial documents more efficiently.

The results indicate that proper data preprocessing, such as table removal, filtering, and theme grouping, significantly improves the performance of both BART and GPT-3.5-turbo models. Furthermore, the BART model trained with 50 labels demonstrates the importance of fine-tuning with domain-specific data.

Vision Statement

Data

Exploratory Data Analysis

Data Processing and Modeling Pipeline

Modeling Approach

Baseline BART Model

Fine-tuning Baseline Model

Experiment

Data Pre-processing

Model Comparison

Results and Evaluation

Results and Evaluation

Demo Website

Team Members

Dmitry Baron

Haibi Lu

Viola Pu