Joint Entity and Relation Extraction Model with Fine-tuned BERT Transformer using SpaCy

Tirth Patel
3 min readDec 3, 2021

In recent years, as the knowledge graph has attained significant achievements in many specific fields, which has become one of the core driving forces for the development of the internet and artificial intelligence. However, there is no mature knowledge graph in the field of finance, so it is a great significance study on the construction technology of financial knowledge graph. Named entity recognition and relation extraction are key steps in the construction of knowledge graph.

Introduction

One of the most useful applications of NLP technology is information extraction from unstructured texts — contracts, financial documents, healthcare records, etc. — that enables automatic data query to derive new insights. Traditionally, named entity recognition has been widely used to identify entities inside a text and store the data for advanced querying and filtering. However, if we want to semantically understand the unstructured text, NER alone is not enough since we don’t know how the entities are related to each other. Performing joint NER and relation extraction will open up a whole new way of information retrieval through knowledge graphs where you can navigate across different nodes to discover hidden relationships. Therefore, performing these tasks jointly will be beneficial. We can add relation extraction to the pipeline using the new Thinc library from spaCy. We train the relation extraction model following the steps outlined in spaCy’s documentation. We will compare the performance of the relation classifier using transformers and ML algorithms.

Data Gathering and Annotation

For this tutorial, we will use the Microsoft’s Annual Report file available online. Supervised machine learning (ML) models need labeled data, but majority of the data collected in the raw format lacks labels. So, the first step before building a ML model would be to get the raw data labeled by domain experts. To do so, we did a survey of some of the annotation tools and came across Doccano as an easy tool for collaborative text annotation. The latest version of Doccano supports annotation features for text classification, sequence labeling (Named Entity Recognition NER) and sequence to sequence (machine translation, text summarization) use cases.

Why Transformers?

Transformers have recently achieved promising results in many natural language processing tasks; however, the understanding of transformers for information extraction in business scenarios. This article bridges the gap by introducing an investigation to understand the behaviour of transformers in extracting information from domain-specific financial documents. To do that, we employ transformers for taking advantage of these architectures trained on a considerable amount of general data and fine-tune transformers to our downstream IE task using transfer learning.

Transformers provide an appropriate solution for data representation by using contextual embeddings learned from a large amount of data. However, training data in specific domains should be adapted to downstream tasks. To do that, we fine-tuned the models to the downstream IE task by using the samples data of each dataset. The pre-trained weights of transformers were first reused and then adjusted in the fine-tuning process.

Transformers have truly transformed the domain of NLP and I am particularly excited about their application in information extraction. I would like to give a shoutout to Explosion AI (spaCy developers) and Hugging Face for providing open-source solutions that facilitate the adoption of transformers.

If you need data annotation for your project, don’t hesitate to try out the Doccano annotation tool. We provide numerous programmable labeling solutions (such as ML auto-annotation, regular expressions, dictionaries, etc…) to minimize hand annotation.

--

--