document | title | date | text | link | |
---|---|---|---|---|---|
Loading ITables v2.1.3 from the init_notebook_mode cell...
(need help?) |
EU Press Materials on Digital Policy
A clean corpus of over 4000 documents scraped from the EU website
EU Digital Speeches Dataset (39 MB)
4078 rows
Download here
About this dataset
Digital platforms such as Google, Amazon, and Facebook play a crucial role in today’s technology-driven era. The increasing presence and necessity of digital innovation in our daily lives has prompted governments worldwide to actively regulate platform activities in an effort to prevent and address perceived risks and harms.
The European Union is a global leader in this regulatory effort, enacting landmark legislation like the Digital Services Act (DSA) and the Digital Markets Act (DMA). Among other things, these regulations aim to safeguard consumer welfare, address issues concerning data privacy and misinformation, and promote healthy market competition. The European Commission communicates these initiatives to various stakeholders through comprehensive documentation available on the EU Press Corner. The archive comprises a wide variety of press materials like speeches, statements, news articles, and factsheets on the digital policy discourse.
The information contained in these documents can be used to explore and answer different policy questions. For example, determining the focus of the EU’s digital strategy through topic modeling, understanding how digital policy issues are communicated through sentiment analysis, or identifying key players and stakeholders through named entity recognition.
With the goal of using communication materials to analyze EU digital policy, I scraped the EU Press Corner website to construct a dataset of speeches that contain the keyword digital.
The whole process is rather straightforward. Applying keyword and document type filters to the archive yields paginated results of all press materials that satisfy the given criteria. I parse the URLs to obtain the main text and other metadata - document type, title, date published, and link to the document. I apply some quality checks (e.g. removing duplicates and empty articles) and save the result as csv and json files, which can later be imported as a dataframe.
A part of this process can be seen in the code below. The full source code for the scraper can be found in this repository.
Show code
def extract_text(links):
'''Function for extracting information from each search result.'''
print(f"Number of links: {len(links)}")
with tqdm(total=len(links), desc="Extracting text") as pbar:
with open('raw.csv', 'a', newline='', encoding='utf-8') as csvfile:
= ["document", "title", "date", "location", "text", "link"]
fieldnames = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer
for link in links:
= load_page(link)
soup = doc_elems = paragraph_elem = None
title_elem
try:
= soup.find(class_="ecl-heading ecl-heading--h1 ecl-u-color-white")
title_elem = soup.find_all(class_="ecl-meta__item")
doc_elems = soup.find(class_="ecl-paragraph")
paragraph_elem
except Exception as e:
print(f"Error on link {link}: {e}")
= title_elem.text if title_elem else np.nan
title = paragraph_elem.text if paragraph_elem else np.nan
paragraph
if doc_elems:
= (elem.text if elem else np.nan for elem in doc_elems[:3])
doc, date, loc else:
= np.nan, np.nan, np.nan
doc, date, loc
writer.writerow({"document": doc,
"title": title,
"date": date,
"location": loc,
"text": paragraph,
"link": str(link)
})