Law | Text | Category | Category_New | text_clean | |
---|---|---|---|---|---|
Loading ITables v2.1.3 from the init_notebook_mode cell...
(need help?) |
Legal Texts for Competition Impact Assessment
A labeled dataset of more than 4000 paragraphs mapped to different categories of potentially anticompetitive provisions
Legal Texts for CIA Dataset (9 MB)
4153 rows
Download here
About this dataset
One of the most useful applications of Natural Language Processing in the legal profession is text and document classification. The practice of law deals with large volumes of legal documents that need to be sorted and organized. Such administrative tasks can take a significant amount of time and manpower and can shift a lawyer’s focus away from tasks that need more human expertise and judgment.
In my master’s thesis Legal Text Classification for Competition Impact Assessment Studies (CIA), I trained machine and deep learning models to aid the initial document screening and review process that is part of every CIA. The goal of the models is to (1) identify laws that contain provisions that can potentially harm and restrict market competition, and (2) distinguish between multiple categories of these restrictions. These are essentially binary and multiclass classification problems.
Pre-labeled data is an essential element of classification tasks. Machine learning algorithms learn from examples that map an outcome variable to the text or input data. For this exercise I collected a set of legal texts (bills, acts, resolutions, decrees, executive orders) matched to categories of anticompetitive provisions. I cleaned and augmented the data to ensure a high-quality and balanced training set. The final dataset consists of 4,153 labeled paragraphs averaging 45 words in length.
Pre-processing of legal texts and labels
The OECD has a competition division that conducts CIAs in different industries and economies. The list of legal documents I used for this study were gathered from official government databases of 7 countries1 where a round of CIA studies were recently conducted. The corpus includes 273 unique sector legislation. I collapsed this into unique paragraphs and applied the usual text pre-processing steps – removal of numeric characters and symbols, lowercasing, and lemmatization.
The outcome labels are based on the OECD’s CIA checklist. There are four categories of potentially anticompetitive provisions.
Category | Sub-categories | |
---|---|---|
A | Limits the number or range of suppliers | Exclusivity rights, licenses and permits, cost of entry, geographical barriers |
B | Limits the ability of suppliers to compete | Price regulation, advertising restrictions, unduly strict quality standards, cost of production |
C | Reduces the incentive of suppliers to compete | Self- or co-regulatory regimes, publishing of sensitive business information, exemption from antitrust laws |
D | Limits the choices and information available to customers | Buyer-seller information asymmetry, consumer switching costs |
Paragraphs from the legal text corpus are assigned to one of the four categories. The initial assignments are based on manual annotations by the OECD, published in existing CIA reports. Most of these are hard coded in PDF files and had to be parsed and re-encoded.
There are other miscellaneous provisions (e.g. administrative burden and grandfather rules) that I combined with category D. I refer to this new group as Others. I also include a None category for paragraphs that do not fall into any of the above restrictions. The resulting preliminary dataset has 2,104 paragraphs, with assigned labels skewed to category A2. To address the data imbalance, I use data augmentation technique called back translation.
Data augmentation and back translation of texts
Back translation involves translating text into a “host” language before translating back to the original language (in this case English). As a result, new sentences are created with variations on choice of words and stylistic properties, but without significant alternations to its meaning. I use 16 host languages for the back translation of legal texts3.
The code below shows the functions used to implement back translation. The full source code and all notebooks used for this project can be found in this repository.
Show code
# download language translators
def get_translator(lang, from_eng=True):
= "en-" if from_eng else ""
prefix = "-en" if not from_eng else ""
suffix
= f'Helsinki-NLP/opus-mt-{prefix}{lang}{suffix}'
translator = MarianTokenizer.from_pretrained(translator)
translator_tokenizer = MarianMTModel.from_pretrained(translator)
translator_model print(f'{prefix}{lang}{suffix} translator downloaded.')
return translator_model, translator_tokenizer
# translate text
def run_translation(batch_text, model, tokenizer, lang):
= [">>{}<< {}".format(lang, text) for text in batch_text]
formated_batch_texts = model.generate(**tokenizer(formated_batch_texts, return_tensors="pt", padding=True))
translated = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
translated_texts
return translated_texts
# back translation
def back_translation(model, tokenizer, model_back, tokenizer_back, batch_texts, org_lang="en", temp_lang="fr"):
= run_translation(batch_texts, model, tokenizer, temp_lang)
temp_trans_batch = run_translation(temp_trans_batch, model_back, tokenizer_back, org_lang)
back_trans_batch
return back_trans_batch
Below are sample back translations of the text:
The public notice can include requirements on the brand or model.
Language | Translation | Back translation |
---|---|---|
French | L’avis public peut inclure des exigences relatives à la marque ou au modèle. | The public notice may include trademark or model requirements. |
Dutch | De bekendmaking kan eisen van merk of model bevatten. | The publication may contain requirements of a trade mark or a model. |
Italian | L’avviso pubblico puó includere requisiti di marca o modello. | The public notice may include make or model requirements. |
Show code
= 'The public notice can include requirements on the brand or model'
sample_text = []
sample_set = ['fr', 'nl', 'it']
lang_sample
for lang in lang_sample:
= models[lang]
model = tokenizers[lang]
tokenizer = models_back[lang]
model_back = tokenizers_back[lang]
tokenizer_back = run_translation([sample_text], model, tokenizer, lang)
temp_trans_batch = run_translation(temp_trans_batch, model_back, tokenizer_back, "en")
back_trans_batch 0], back_trans_batch[0]]) sample_set.append([lang, temp_trans_batch[
After back translation, the final dataset has 4,153 labeled paragraphs and a 21-31-23-25 split among categories.