banners

METHOD FOR DETERMINING THE REPRESENTATIVITY OF A CORPUS

Description

The question of representativeness remains today one of the most controversial aspects of corpus linguistics. In the case of specialized corpus, which tend to have a much smaller size than the so-called “general” or “reference” corpus, the question of representativeness is really key, indeed, it is one of its defining characteristics. In practice, the quantification of the minimum size that a specialized corpus must have has not yet been objectively determined. And it is that there is no consensus on what is the minimum number of documents or words that a certain corpus must have in order for it to be considered valid and representative of the population to be represented. Thus, the present invention is an efficient solution to determine a posteriori the minimum size of a corpus or textual collection, regardless of the language or textual type of said collection, establishing, therefore, the minimum representativeness threshold through an algorithm ( N-Cor) analysis of the lexical density as a function of the incremental increase of the corpus. Starting from this premise, a proposal for a computer implementation has been arrived at, which has resulted in an application developed in Java, and which we have called ReCor. Said system has the following main classes: a) Words (algorithm of computation, reading and writing to file); b) Gui (user interface); and c) Graphic Window (adapter for graphic representation).

 

Advantages

Among the advantages of the present invention are: • It is independent of the language or textual type of the collection of documents analyzed. • Establishes the minimum threshold of representativeness of a corpus. • Includes input data, output data, graphical representation and output files. • It is easy to use, with a friendly interface.

 

Uses and Applications

The present invention is used as a computer-implemented data processing method, particularly linguistic data and information, to determine the representativeness of a corpus.

 

Keywords

     

 

Sectors

 

 

Areas

 

 

Patent Number

ES2320511B1

 

Applicants

Universidad De Málaga

 

Inventors

Gloria Corpas Pastor, Miriam Seghiri Dominguez, Romano Maggi

 

Filing Date

05/12/2006 

Protection Level: National (Spain)

Processing Status: Spanish patent

Before content body
Chunks
Chunks
News and Standingouts slideshow
After content body