Start Over Save to list Export MARC Display MelCat
     
book jacket

Find in WorldCat
Author Schäfer, Roland.
Title Web corpus construction [electronic resource] / Roland Schäfer and Felix Bildhauer.
Publication Info. San Rafael, Calif. (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, c2013.
Location Call No. Status Notes
 Libraries Electronic Books  ELECTRONIC BOOKS-DDA    AVAIL. ONLINE
Description 1 online resource.
Series Synthesis digital library of engineering and computer science.
Synthesis lectures on human language technologies ; # 22. 1947-4059
Note Part of: Synthesis digital library of engineering and computer science.
Title from PDF t.p. (viewed on August 14, 2013).
Series from website.
Bibliography Includes bibliographical references (p. 111-128).
Contents 1. Web corpora --
2. Data collection -- 2.1 Introduction -- 2.2 The structure of the web -- 2.2.1 General properties -- 2.2.2 Accessibility and stability of web pages -- 2.2.3 What's in a (national) top level domain? -- 2.2.4 Problematic segments of the web -- 2.3 Crawling basics -- 2.3.1 Introduction -- 2.3.2 Corpus construction from search engine results -- 2.3.3 Crawlers and crawler performance -- 2.3.4 Configuration details and politeness -- 2.3.5 Seed URL generation -- 2.4 More on crawling strategies -- 2.4.1 Introduction -- 2.4.2 Biases and the pagerank -- 2.4.3 Focused crawling --
3. Post-processing -- 3.1 Introduction -- 3.2 Basic cleanups -- 3.2.1 HTML stripping -- 3.2.2 Character references and entities -- 3.2.3 Character sets and conversion -- 3.2.4 Further normalization -- 3.3 Boilerplate removal -- 3.3.1 Introduction to boilerplate -- 3.3.2 Feature extraction -- 3.3.3 Choice of the machine learning method -- 3.4 Language identification -- 3.5 Duplicate detection -- 3.5.1 Types of duplication -- 3.5.2 Perfect duplicates and hashing -- 3.5.3 Near duplicates, Jaccard coefficients, and shingling --
4. Linguistic processing -- 4.1 Introduction -- 4.2 Basics of tokenization, part-of-speech tagging, and lemmatization -- 4.2.1 Tokenization -- 4.2.2 Part-of-speech tagging -- 4.2.3 Lemmatization -- 4.3 Linguistic post-processing of noisy data -- 4.3.1 Introduction -- 4.3.2 Treatment of noisy data -- 4.4 Tokenizing web texts -- 4.4.1 Example: missing whitespace -- 4.4.2 Example: emoticons -- 4.5 POS tagging and lemmatization of web texts -- 4.5.1 Tracing back errors in POS tagging -- 4.6 Orthographic normalization -- 4.7 Software for linguistic post-processing --
5. Corpus evaluation and comparison -- 5.1 Introduction -- 5.2 Rough quality check -- 5.2.1 Word and sentence lengths -- 5.2.2 Duplication -- 5.3 Measuring corpus similarity -- 5.3.1 Inspecting frequency lists -- 5.3.2 Hypothesis testing with -- 5.3.3 Hypothesis testing with Spearman's rank correlation -- 5.3.4 Using test statistics without hypothesis testing -- 5.4 Comparing keywords -- 5.4.1 Keyword extraction with x2 -- 5.4.2 Keyword extraction using the ratio of relative frequencies -- 5.4.3 Variants and refinements -- 5.5 Extrinsic evaluation -- 5.6 Corpus composition -- 5.6.1 Estimating corpus composition -- 5.6.2 Measuring corpus composition -- 5.6.3 Interpreting corpus composition -- 5.7 Summary --
Bibliography -- Authors' biographies.
Indexed In: Compendex
INSPEC
Google scholar
Google book search
Note Also available in print.
Reproduction Electronic reproduction. Perth, W.A. Available via World Wide Web.
Subject Corpora (Linguistics) -- Data processing.
Computational linguistics.
Web search engines.
corpus creation
web corpora
web crawling
web characterization
boilerplate removal
language identification
duplicate detection
near-duplicate detection
tokenization
POS tagging
noisy data
corpus evaluation
corpus comparison
keyword extraction
Added Author Bildhauer, Felix.
Ebooks Corporation
Related To Print version: 9781608459834
ISBN 9781608459841 (electronic bk.)
1608459845 (electronic bk.)
9781608459834 (pbk.)
Standard No. 10.2200/S00508ED1V01Y201305HLT022 doi
Permanent url for this catalog record: