Project

A special focus is on less-resourced European languages, including Estonian, Slovenian, Croatian, Finnish, Latvian, ...

SOLUTIONS

EMA enables journalists, editors and researchers to automate text analysis processes, including:

comment moderation icon

COMMENT MODERATION

Analyse and filter user comments (e.g., detecting hate-speech)

keyword extraction icon

KEYWORD EXTRACTION

Extract important keywords and identify persons and organisations from articles to generate tags

Article generation icon

ARTICLE GENERATION

Generate text from raw data to produce articles for specific topics

TECHNOLOGY

Our solutions are based on state-of-the-art natural language processing using neural embeddings (deep learning) and transfer learning; this allows us to train language models and solutions with less data, making them practical for less-resourced languages like Estonian, Slovenian, Croatian, Finnish, Latvian.

technology

HOW TO USE OUR TOOLS (ACCESS POINTS)?

Selected tools can be used as pre-trained models (ready to use), available as APIs or Docker images.
For a larger number of tools, we provide the code and developers and scientists can train the models on their data.

texta toolkit logo

Web application with GUI

(Texta Toolkit)

github logo

EMBEDDIA tool's APIs

(GitHub)

gitlab logo

Docker

(Gitlab)

DEMO

Want to try some of our tools quickly and easily online? Click below!

Start journey!
picture of embeddia demo

TOOLS EXPLORER

[object Object] Task Language
NameFunctionalityDescriptionLanguages its possible to train onlanguages an example is available inlicenceAccess
TNT-KIDKeyword ExtractionA system for automatic keyword extraction; must be trained on a corpus of articles with human-assigned keywords. Pre-trained models are also available for a range of languages - see other TNT-KID entries here.allMIT
TNT-KIDKeyword ExtractionA system for automatic keyword extraction, trained on a corpus of articles with human-assigned keywords. Pre-trained version with API, for English. allMIT
TNT-KIDKeyword ExtractionA system for automatic keyword extraction, trained on a corpus of articles with human-assigned keywords. Pre-trained version with API, for Croatian; annotators were 24sata editors. allMIT
TNT-KIDKeyword ExtractionA system for automatic keyword extraction, trained on a corpus of articles with human-assigned keywords. Pre-trained version with API, for Slovenian. allMIT
TNT-KIDKeyword ExtractionA system for automatic keyword extraction, trained on a corpus of articles with human-assigned keywords. Pre-trained version with API, for Latvian; annotators were Latvian Delfi staff. allMIT
TNT-KIDKeyword ExtractionA system for automatic keyword extraction, trained on a corpus of articles with human-assigned keywords. Pre-trained version with API, for Estonian; annotators were Ekspress Meedia staff. allMIT
TEXTA TaggerGeneral text classificationA general toolkit for producing supervised text classifiers based on user-defined search results.allEstonianGPLv3
TEXTA Bert TaggerGeneral text classificationA supervised method for classifying texts using BERT models.allEstonianGPLv3
TEXTA MLPMultilingual PreprocessorA multilingual processing tool incorporating several approaches for extraction of named entities (people, organizations, locations); works for several languages and different types of text.Stanza supported languagesEnglish, Estonian, Croatian, Lithuanian, FinnishGPLv3
NER APINamed entity recognitionProvides an API giving easy access to the NER BERT BiLSTM models (see other entry).allCroatian,Slovenian, Finnish, Russian, Swedish, Latvian, Lithuanian and EstonianMIT
NER BERT BiLSTMNamed entity recognitionA Named Entity Recognition system based on a standard BiLSTM+CNN+CRF architecture but that uses additional types of embeddings to improve performance.allCroatian,Slovenian, Finnish, Russian, Swedish, Latvian, Lithuanian and EstonianMIT
COMMENTSUMcomment summarizationProvides an API giving easy access to COMMENTSUM: a multilingual summarization tool that returns the most informative sentences from a set of user comments. allEnglish, German, Slovenian, CroatianMIT
COMMENTSUMcomment summarizationA multilingual summarization tool that returns the most informative sentences from a set of user comments. allEnglish, German, Slovenian, CroatianMIT
Scalable and interpretable semantic shift detectionSemantic shift detectionDetects changes in word meaning over time, genre or across different media sources, and visualizing the differences, using representations derived from a BERT model.allEnglish, Latin, German, SwedishMIT
News sentimentsentiment analysisLabels a news article as positive, negative, or neutral based on its sentiment, using a fine-tuned BERT model.allSlovenian, CroatianMIT
Language variety classificationLanguage variety classificationDetects which language variety or dialect a text is written in, using neural networks combined with linguistic features.allCroatian, Serbian, Bosnian, German, Arabic, Portuguese, Spanish, EnglishMIT
Dutch author profilingDutch cross-genre gender classificationDetects the likely gender (male vs. female) of the author of a short text, using a range of linguistic features.allDutchMIT
English and Spanish author profilinggender and bot vs human classificationIdentifies whether a short text was written by a bot or a human, and (for humans) their gender (male vs. female), using a range of linguistic features. Pre-trained version with API, in English and SpanishallEnglish, SpanishMIT
English and Spanish author profilinggender and bot vs human classificationIdentifies whether a short text was written by a bot or a human, and (for humans) their gender (male vs. female), using a range of linguistic features.allEnglish, SpanishMIT
Semantic shift detection by averaging contextual embeddingsSemantic shift detectionDetects changes in word meaning over time, and how translations change over time, in different languages, using representations derived from a BERT model.allEnglish, SlovenianMIT
Multi-lingual semantic shift detectionSemantic shift detectionDetects changes in word meaning over time, using representations derived from a BERT model.allEnglish, Latin, German, SwedishMIT
NEL FilterNamed entity linking filterA post-processing filter that improves the accuracy of a Named Entity Linking tool by using heuristics and data from Wikidata and DBpedia.all Finnish, English, French and GermanMIT
NER FEDANamed entity recognitionA Named Entity Recognition system created to train models using multiple datasets, regardless of whether they use the same tagsets or not. This can improve performance, while sharing common aspects but also tagging documents using a variety of tagsets.allRussian, Slovenian, Bulgarian, Polish, Czech and Ukrainian.MIT
NER MultitaskNamed entity recognitionA Named Entity Recognition system that explores multiple simple methods for improving the performance of trained models.allCroatian,Slovenian, Finnish, Russian, Swedish, Latvian, Lithuanian and EstonianMIT
Multilingual Entity LinkingNamed entity linkingLinks named entities (names of organizations, places and people) to entries in Wikidata, which in turn can be linked to Wikipedia.allCroatian,Slovenian, Finnish, Russian, Swedish, Latvian, Lithuanian, Estonian, French, German, EnglishApache 2.0
Stacked NERNamed entity recognitionA multilingual Named Entity Recognition system based on fine-tuned BERT models.allCroatian,Slovenian, Finnish, Russian, Swedish, Latvian, Lithuanian, Estonian, French, German, EnglishMIT
Cross-lingual article linking Article linkingTakes as input an article in one language and a list of candidate articles in another language, and outputs the candidate article that is most similar to the query.allEstonian, LatvianMIT
Dynamic multilingual topic modellingTopic evolution trackingTakes as input thematically-aligned documents in two or more languages divided into time slices, and outputs topics for every time slice and language that shows the evolution of a topic for that language.allEnglish, German, Finnish, SwedishMIT
Tweet BERT-imentsentiment analysisLabels a short text (e.g. a tweet) as positive, negative, or neutral based on its sentiment, using a fine-tuned BERT model.allCroatian, English, Russian, SlovenianMIT
Fake-News spreaders detection on TwitterFake News detectionDetects users likely to be spreading fake news, based on a collection of their texts, using a range of linguistic features.allEnglish, SpanishMIT
Detection of COVID-19 related Fake-NewsFake News detectionDetects users likely to be spreading fake news, based on a collection of their texts, using stacked neural networks.allEnglishMIT
RaKUnkeyword extractionAPI giving easy access to RaKUn: extracts keywords from text, by turning text into a graph in which the most important nodes mostly turn out to be keywords. Does not need any training (it is unsupervised) so it can be used for any language.allEnglish, SlovenianGPL3
RaKUnkeyword extractionExtracts keywords from text, by turning text into a graph in which the most important nodes mostly turn out to be keywords. Does not need any training (it is unsupervised) so it can be used for any language.allEnglish, SlovenianGPL3
COVID-NLGnews article generationUses template-based natural language generation to describe salient statistics about the COVID-19 situation on a national level. The use of template-based processing gives a very high quarantee that the output is factually correct, but limits the fluency of the produced text.noEnglish
Eurostat NLGnews article generationUses template-based natural language generation to produce reports about several Eurostat statistics, most prominently consumer price index data for European countries, given the country the report should focus on, and the language the report should be in. Aso supports some statistical datasets beyond Eurostat, but reports about those can only be generated in English and Finnish. The use of template-based methods ensures the output is highly accurate, but limits the fluency of the output.noEnglish, Finnish, Croatian, Russian, Slovenian, Estonian
Comment analysis NLGComment reportingAnalyses a set of (online) news comments using various other EMBEDDIA text analysis tools, and then produces a brief natural language report designed to give a general understanding of what things are being discussed, and how.noInput: English, all, Croatian; Output: English
Simple Sentiment Analysissentiment analysisProvides the CardiffNLP sentiment analysis model (NOT developed within EMBEDDIA) as an online microservice that can be used by the Comment Analysis NLG tool.noall
Comment ModerationComment ModerationDetects whether newspaper comments should be blocked by moderators, using a fine-tuned BERT model. Can output the moderation policy rule which a blocked comment contravenes, given a suitable dataset/model.allCroatianMIT
Comment Moderation APIComment ModerationDetects whether newspaper comments should be blocked by moderators, using a fine-tuned BERT model.allCroatianMIT
Comment Moderation APIComment ModerationDetects whether newspaper comments should be blocked by moderators, using a fine-tuned BERT model.allCroatianMIT
Comment Moderation APIComment ModerationDetects whether newspaper comments should be blocked by moderators, using a fine-tuned BERT model.allCroatianMIT
Comment Moderation APIComment ModerationDetects whether newspaper comments should be blocked by moderators, using a fine-tuned BERT model.allCroatianMIT
Comment Topic Modelling APIComment Topic ModellingAnalyses newspaper comments in terms of the topics they discuss;, using a version of the Embedded Topic Model; topics are automatically detected and can be supplied as lists of keywords.allCroatianMIT
TeMoCo/TeMoTopicVisualisationVisualizes changes in topic distribution and associated keywords in a document or collection of articles.allallMIT