← Volver atrás
Publicaciones

Web scraping by end users

Autores

Tacuri, A. , FIRMENICH ZORRILLA, SERGIO DAMIAN, Fernandéz, A. , Riva, F. , Urbieta, M. , Rossi, G.

Publicación externa

No

Medio

IEEE Access

Alcance

Article

Naturaleza

Científica

Cuartil JCR

Cuartil SJR

Fecha de publicacion

01/01/2025

Scopus Id

2-s2.0-105023183778

Abstract

Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites' search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using general-purpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results. © 2025 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.

Palabras clave

Automatic programming; Data assimilation; Data mining; Extraction; Human computer interaction; Human engineering; Search engines; Specifications; Systems analysis; Tools; User centered design; User interfaces; Websites; Computer interaction; End user computing; End-users; Programming library; Scraper specification; Toolsets; User-centred; Web data extraction; Web Mining; Web scrapings; Data integration

Miembros de la Universidad Loyola