How can textual entities be identified? Rote learning, i.e., dictionary lookup, is one idea, particularly when coupled with existing resources-lists of personal names and organizations, information about locations from gazetteers, or abbreviation and acronym dictionaries.
They can aid searching, interlinking and cross-referencing between documents. These terms act as single vocabulary items, and many document processing tasks can be significantly improved if they are identified as such. In addition, there are countless domain-specific entities, such as international standard book numbers (ISBNs), stock symbols, chemical structures, and mathematical equations.
Ordinary documents are full of such terms: phone numbers, fax numbers, street addresses, email addresses, email signatures, abstracts, tables of contents, lists of references, tables, figures, captions, meeting announcements, Web addresses, and more. The idea of metadata is often expanded to encompass words or phrases that stand for objects or “entities” in the world, leading to the notion of entity extraction. Metadata is a kind of highly structured (and therefore actionable) document summary. Metadata was mentioned above as data about data: in the realm of text the term generally refers to salient features of a work, such as its author, title, subject classification, subject headings, and keywords. Pal, in Data Mining (Fourth Edition), 2017 Information ExtractionĪnother general class of text mining problems is metadata extraction.