EU Press Materials on Digital Policy

A clean corpus of over 4000 documents scraped from the EU website

EU Digital Speeches Dataset (39 MB)
4078 rows
Download here

This is the init_notebook_mode cell from ITables v2.1.3
(you should not see this message - is your notebook trusted?)
document title date text link
Loading ITables v2.1.3 from the init_notebook_mode cell... (need help?)
About this dataset

Digital platforms such as Google, Amazon, and Facebook play a crucial role in today’s technology-driven era. The increasing presence and necessity of digital innovation in our daily lives has prompted governments worldwide to actively regulate platform activities in an effort to prevent and address perceived risks and harms.

The European Union is a global leader in this regulatory effort, enacting landmark legislation like the Digital Services Act (DSA) and the Digital Markets Act (DMA). Among other things, these regulations aim to safeguard consumer welfare, address issues concerning data privacy and misinformation, and promote healthy market competition. The European Commission communicates these initiatives to various stakeholders through comprehensive documentation available on the EU Press Corner. The archive comprises a wide variety of press materials like speeches, statements, news articles, and factsheets on the digital policy discourse.

The information contained in these documents can be used to explore and answer different policy questions. For example, determining the focus of the EU’s digital strategy through topic modeling, understanding how digital policy issues are communicated through sentiment analysis, or identifying key players and stakeholders through named entity recognition.

With the goal of using communication materials to analyze EU digital policy, I scraped the EU Press Corner website to construct a dataset of speeches that contain the keyword digital.

The whole process is rather straightforward. Applying keyword and document type filters to the archive yields paginated results of all press materials that satisfy the given criteria. I parse the URLs to obtain the main text and other metadata - document type, title, date published, and link to the document. I apply some quality checks (e.g. removing duplicates and empty articles) and save the result as csv and json files, which can later be imported as a dataframe.

A part of this process can be seen in the code below. The full source code for the scraper can be found in this repository.

Show code
def extract_text(links):
    '''Function for extracting information from each search result.'''

    print(f"Number of links: {len(links)}")

    with tqdm(total=len(links), desc="Extracting text") as pbar:
        with open('raw.csv', 'a', newline='', encoding='utf-8') as csvfile:
            fieldnames = ["document", "title", "date", "location", "text", "link"]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

            for link in links:
                soup = load_page(link)
                title_elem = doc_elems = paragraph_elem = None

                try:
                    title_elem = soup.find(class_="ecl-heading ecl-heading--h1 ecl-u-color-white")
                    doc_elems = soup.find_all(class_="ecl-meta__item")
                    paragraph_elem = soup.find(class_="ecl-paragraph")

                except Exception as e:
                    print(f"Error on link {link}: {e}")

                title = title_elem.text if title_elem else np.nan
                paragraph = paragraph_elem.text if paragraph_elem else np.nan

                if doc_elems:
                    doc, date, loc = (elem.text if elem else np.nan for elem in doc_elems[:3])
                else:
                    doc, date, loc = np.nan, np.nan, np.nan

                writer.writerow({
                    "document": doc,
                    "title": title,
                    "date": date,
                    "location": loc,
                    "text": paragraph,
                    "link": str(link)
                })