Natural Language Processing
- NLP Model
- Lexical Processing (Basic text/word extraction that are most relevant to the topic/text at hand)
- Syntactic Processing (Understanding the grammar)
- Semantic Processing (Understanding the meaning)
- PMI
- Pointwise Mutual Information (PMI) is a metric that is used as part of advanced tokenisation techniques for identifying words in a text that collectively could be referring to a single entity (or term) .
- It helps identify words that usually go together, representing a term/entity as opposed to treating them as independent words.
- e.g "Indian Institute of Technology" - While each of the words Indian, Institute & Technology have their own meaning or standing, together they represent a single entity/term.
- PMI is calculated as follows
- pmi (x; y) = log (P(x,y) / P(x)P(y))
where x & y are the individual words which collectively refer to a single term/entity - i.e., log of probability of the words x & y occurring together, divided by the product of probabilities of x & y appearing/occurring separately.
- Syntactic Processing (steps)
- Parsing: Understanding various parts of a sentence and how they interplay with each other i.e. identifying verbs, nouns, subjects, objects etc.
- Parts of Speech (POS) tagging aka Shallow Parsing
- Constituency (or Paradigmatic) parsing
- Dependencies parsing
- Dependency grammar (as opposed to Constituency Parsing) can be traced back to Panini's grammar rules.
- Topic modelling
- This pertains to determining the topic of a given document, also called aboutness i.e., what is the document or a particular chunk of text about.
- aboutness is not binary but is more of a degree of proximity. e.g. sugar can be about health, sugar industry, diabetes to varying degrees.
- Topic modelling/extraction approaches
- PLSA - Probabilistic Latent Semantic Analysis
- LDA - Latent Dirichlet Allocation
- ESA - Explicit Semantic Analysis
- NLP Resources
No comments:
Post a Comment