A new web-based corpus management and analysis platform
This presentation is the property of its rightful owner.
Sponsored Links
1 / 1

A new web-based corpus management and analysis platform PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on
  • Presentation posted in: General

A new web-based corpus management and analysis platform. For corpus management… Support for hierarchically structured data, parallel corpora and multimedia data (audio, video, images aligned with transcribed text).

Download Presentation

A new web-based corpus management and analysis platform

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A new web based corpus management and analysis platform

A new web-based corpus management and analysis platform

  • For corpus management…

  • Support for hierarchically structured data, parallel corpora and multimedia data (audio, video, images aligned with transcribed text).

  • Data structures and search algorithms optimised to handle large corpora, i.e. > 2 billion tokens.

  • CLARIN integration, harvestable metadata and persistent identifiers so corpora are sustained and can be easily found.

  • Federated login (eduGAIN, OpenIdP) with fine-grained authorization to control access according to the license for each corpus.

  • Analysis functionality includes…

  • Powerful query syntax, with both textual and graphical interfaces to form queries.

  • Wordlists, concordances and collocations.

  • Distribution statistics showing the frequencies of query results relative to chosen parameters, e.g. year, speaker, genre.

  • Searchable manual corpus annotation, with the option of controlled vocabulary.

  • Corpora already available include…

  • From ICAME – the Brown family, COLT (including audio), Helsinki, London-Lund

  • Aviskorpus – over 1 billion words from Norwegian newspapers

  • Wikipedia and ‘web as corpus’ corpora

  • A parallel Norwegian-Spanish corpus

  • NTAP – 660 million words from English blogs related to climate change

  • Try Corpuscle for yourself

  • Since Corpuscle works with eduGAIN, if you have an account at an academic institution you may be able to use that to login to Corpuscle. In case that doesn’t work, it is possible to sign with with an OpenIdP account. Even without logging in you can access some corpora; this depends on the license for each corpus.

  • Go to http://clarino.uib.no/corpuscle

  • Login

    • Choose eduGAIN from the top of the page and a search box will appear for you to find your institution; then follow instructions.

    • Otherwise, choose OpenIdP and follow the instructions.

    • Or, try some unrestricted corpora, e.g. Wikipedia corpus and KIAP.

  • Click on ‘Corpus List’ in the left-hand menu, and choose a corpus.

  • Read and accept the license conditions.

  • Click ‘Query’ in the menu. Enter a single word or a phrase and a concordance will be generated.

  • Follow the ‘Documentation’ link to explore other functionality.

  • Contact information

  • Corpuscle is developed and maintained by Uni Research Computing, Bergen, Norway.

  • For general enquiries about Corpuscle, email Paul Meurer – [email protected]

  • For information about the ICAME resources in Corpuscle, see http://icame.uib.no/clarin or email Knut Hofland – [email protected]

ICAME has been distributing corpora since 1979. This will now continue using Corpuscle: ICAME resources are being made available for on-line analysis, and for download.

The further development of Corpuscle and the ongoing integration of ICAME resources have been made possible through CLARINO – the Norwegian part of CLARIN (Common Language Resources and Technology Infrastructure).


  • Login