Corpora: From magnetic tape to web access Knut Hofland, [email protected] , icame.uib.no/history/poster.ppt UNIFOB Aksis, Bergen, Norway. Brown Corpus was made from 1961-64 12.02.1977: ICAME founded in Oslo 29-30.03.1979: First ICAME conference in Bergen 1977-79: ICAME News started March 1978
Brown Corpus was made from 1961-64
12.02.1977: ICAME founded in Oslo
29-30.03.1979: First ICAME conference in Bergen
1977-79: ICAME News started March 1978
Converted the Brown Corpus from original punched card format to a more readable format and corrected errors found during the tagging of the corpus (from 1971-78).
**R**T *THE *FULTON *COUNTY *GRAND *JURY SAID *FRIDAY AN INVESTIGATION 0010E1A01
OF *ATLANTA**AS RECENT PRIMARY ELECTION PRODUCED **QNO EVIDENCE**U TH 0020E1A01
AT ANY IRREGULARITIES TOOK PLACE. **R**T *THE JURY FURTHER SAID IN TER 0030E1A01
M-END PRESENTMENTS THAT THE *CITY *EXECUTIVE *COMMITTEE, WHICH HAD OVE 0040E1A01
R-ALL CHARGE OF THE ELECTION, **QDESERVES THE PRAISE AND THANKS OF THE 0050E1A01
*CITY OF *ATLANTA**U FOR THE MANNER IN WHICH THE ELECTION WAS CONDUCT 0060E1A01
ED. **R**T *THE *SEPTEMBER-*OCTOBER TERM JURY HAD BEEN CHARGED BY *FUL 0070E1A01
TON *SUPERIOR *COURT *JUDGE *DURWOOD *PYE TO INVESTIGATE REPORTS OF PO 0080E1A01
SSIBLE **QIRREGULARITIES**U IN THE HARD-FOUGHT PRIMARY WHICH WAS WON B 0090E1A01
Y *MAYOR-NOMINATE *IVAN *ALLEN *JR**.. **R**T **Q*ONLY A RELATIVE HAND 0100E1A01
A01 0010 1 The Fulton County Grand Jury said Friday an investigation
A01 0020 1 of Atlanta's recent primary election produced "no evidence"
A01 0020 9 that any irregularities took place.
A01 0030 5 The jury further said in term-end presentments that
A01 0040 3 the City Executive Committee, which had over-all charge
A01 0050 2 of the election, "deserves the praise and thanks of
A01 0050 11 the City of Atlanta" for the manner in which the election
A01 0060 11 was conducted.
A01 0070 1 The September-October term jury had been charged
A01 0070 9 by Fulton Superior Court Judge Durwood Pye to investigate
A01 0080 8 reports of possible "irregularities" in the hard-fought
A01 0090 6 primary which was won by Mayor-nominate Ivan Allen
A01 0100 5 Jr&.
LOB Corpus was finished in Oslo/Bergen in 1979. Concordances were made to both Brown and LOB Corpus. The texts and concordances were distributed on magnetic tape and microfiche. One fiche = 207 pages (each with 72 lines with 132 columns). The LOB concordance contained frequency counts from the Brown Corpus. The LOB KWIC used 100 fiches.
London-Lund corpus was distributed on tape.
1970s Mainframe computers: Univac, IBM, ICL
1971 Floppy disk (diskette)
1975 Altair 8800 Personal computer
1976 Apple I
1977 Apple II
1978 VisiCalc, spreadsheet
1979 WordStar, word processing software
1980 Seagate 5.25” 5 MB hard disk
1981 IBM PC (4.77 MHz, 16/64 kB RAM, 160 kB 5.25” diskette, MS-DOS, CGA)
1982 Commodore 64
1983 IBM PC XT (128 kB RAM, 10 MB HD, 360 kB diskette)
1983 Apple Lisa, first GUI interface
1984 Apple Macintosh (128 kB, 400 kB 3.5” diskette)
1984 First HP Laserprinter (Apple LaserWriter PS 1985)
1984 IBM PC AT (286 6-10 MHz, 20 MB HD, 256kB RAM, 1.2 MB diskette, EGA)
1984 MS/DOS 3.1
1985 Windows 1
1985 Philips CM-100 CD-ROM (Apple 1988)
1987 PS/2 (386 8-20 MHz, 640 kB RAM, 1,44 MB 3.5”, 20-70 MB HD, VGA)
1990 World Wide Web, text version
1990 Typical PC: 486 25 MHz, 4 MB RAM, 150 MB HD
1992 Windows 3.1
1993 Mosaic graphic web client
1994 MS/DOS 6.0
1995 Windows 95
1997 Typical PC: Pentium II 233 MHz, 64 MB RAM, 4 GB disk
2001 Windows XP
2007 Windows Vista
2009 Portable PC: Dual Core 2.2 GHz, 4 GB RAM, 400 GB HD
2009 Desktop PC: Quad Core 2.6 GHz, 16 GB RAM, 1000 GB HD
Development of computers 1970-2009
Moores law: transistor count doubling every two year
1981: London-Lund KWIC concordance available on tape.
1982-1985: POS-tagging of LOB in Lancaster and Bergen (CLAWS1, Constituent Likelihood Automatic Word-tagging System). Word list and suffix list for look-up were based on the tagged Brown Corpus. Text and concordance available on tape.
1987: Melbourne-Surrey Corpus available (100K word newspaper text). ICAME News -> Journal. A version of Brown Corpus indexed by the MS-DOS program WordCruncher was made by Randall L. Jones from Brigham Young University (11 MB including index files). The index was so efficient that the program could be used on a standard IBM PC XT/AT. Distribution on diskettes started. Kolhapur Corpus (Indian English) and Lancaster Spoken English corpus were added to the collection. A mail-based infoserver was started (FAFSRV at NOBERGEN, EARN/BITNET).
1990: Polytechnic of Wales Corpus.
1992: Lancaster Parsed Corpus, Corpora list started. FTP info-server. Gopher server in 1993.
ICAME CD-ROM collection, version 1. Contained Brown, LOB, Kolhapur, London_Lund and Helsinki Corpora, all indexed by WordCruncher. Macintosh/Unix version of the texts. Texts also indexed by MS-DOS program TA
1980 = 5 MB, 2009 = 1000 000 MB
1995: Newdigate newsletters, ICAME web-site, 900 members on Corpora list
2000: ICAME CD-ROM, version 2, COLT CD-ROM with sound files, Internet search for holders of the CD-ROM to the main corpora.
2009: More than 3000 members on the Corpora list.
Content of ICAME CD, version 2:
More material, new CD/DVD
More corpora searchable on Internet
Part of CLARIN (www.clarin.eu)