About

What Observatorio Lázaro is, how it works, and who is behind it.

What is Observatorio Lázaro?

Observatorio Lázaro is a project that automatically analyses and extracts the anglicisms appearing every day in the news published by some twenty Spanish press outlets, including elDiario.es, El País, El Mundo, ABC, La Vanguardia, El Confidencial, 20minutos, Agencia EFE, La Marea, El Economista, Marca, Fotogramas, Rolling Stone, Elle and El Mundo Today.

Every day, Lázaro reads the press, detects unassimilated borrowings (mostly anglicisms), records them in a database and publishes the data on this website, where they can be freely searched, compared and downloaded.

How does Lázaro work?

The core of the project is a machine learning model that detects possible foreign words (mostly anglicisms) in Spanish-language press. Although the model was trained to extract anglicisms, it occasionally extracts borrowings from other languages too.

Lázaro's anglicism extraction model is a BiLSTM-CRF that uses embeddings trained on bilingual ES-EN text, as well as subword embeddings (BPE embeddings and character embeddings). Technical information about the model is available in this scientific paper. An earlier version of the observatory (live from April 2020 to August 2022) ran on a CRF model; the details of that earlier model can be read in this document.

The observatory's code and the training corpus are available on GitHub. The trained, ready-to-use detection model is available through HuggingFace and the Python library pylazaro.

Since extraction is fully automatic, the data may contain errors: words wrongly labelled as anglicisms, or anglicisms that go unnoticed.

This talk from the 2021 Trabalengua conference (in Spanish) explains the inner workings of the project:

How to cite

If Observatorio Lázaro or its data are used in research, they can be cited as follows:

@misc{observatoriolazaro,
  author    = {{\'A}lvarez Mellado, Elena},
  title     = {Observatorio L{\'a}zaro: observatorio del anglicismo
               en la prensa espa{\~n}ola},
  year      = {2020},
  url       = {https://observatoriolazaro.es},
  note      = {Accessed: 2026-06-23}
}

To cite the detection model, the reference is the ACL 2022 paper:

@inproceedings{alvarez-mellado-lignos-2022-detecting,
  title     = {Detecting Unassimilated Borrowings in {S}panish:
               {A}n Annotated Corpus and Approaches to Modeling},
  author    = {{\'A}lvarez Mellado, Elena and Lignos, Constantine},
  booktitle = {Proceedings of the 60th Annual Meeting of the
               Association for Computational Linguistics
               (Volume 1: Long Papers)},
  year      = {2022},
  publisher = {Association for Computational Linguistics},
  pages     = {3868--3888},
  doi       = {10.18653/v1/2022.acl-long.268}
}

Publications

Bot: @lazarobot

The new anglicisms Lázaro finds (those the model has not seen before) are posted daily on Twitter and BlueSky, together with their context of appearance and a link to the news article.

What is Lázaro not?

The purpose of the project is to observe, describe and analyse anglicism usage in the Spanish press. Under no circumstances is the goal of the project to shame, point fingers at or criticise the use of anglicisms, or those who use them. Nor is it the purpose of this project to propose alternative translations.

The motivation behind Observatorio Lázaro is not to defend some supposed linguistic purity of Spanish, but to study the phenomenon of lexical borrowing in the press empirically, from a data-driven perspective.

Why Lázaro?

The project's name is a tribute to the Spanish philologist Lázaro Carreter, whose columns on linguistic prescription in the media (and very especially on the use of anglicisms) were very popular in Spain throughout the 1980s and 1990s.

Awards

In the media

Research using Observatorio Lázaro

Credits

Observatorio Lázaro is a project by Elena Álvarez Mellado. The seed of the project was conceived at the BLT Lab (Broadening Linguistic Technologies) at Brandeis University (Massachusetts) under the supervision of Constantine Lignos, and it was developed as a PhD project in the Natural Language Processing and Information Retrieval research group at UNED under the supervision of Julio Gonzalo and Constantine Lignos.