Monday, October 28, 2019

Natural Language Processing

Natural Language Processing

  • NLP Model
    • Lexical Processing (Basic text/word extraction that are most relevant to the topic/text at hand)
    • Syntactic Processing (Understanding the grammar)
    • Semantic Processing (Understanding the meaning)
  • PMI
    • Pointwise Mutual Information (PMI) is a metric that is used as part of advanced tokenisation techniques for identifying words in a text that collectively could be referring to a single entity (or term) . 
    • It helps identify words that usually go together, representing a term/entity as opposed to treating them as independent words. 
      • e.g "Indian Institute of Technology" - While each of the words Indian, Institute Technology have their own meaning or standing, together they represent a single entity/term.
    • PMI is calculated as follows
      • pmi (x; y) = log (P(x,y) / P(x)P(y))
        where x & y are the individual words which collectively refer to a single term/entity
      • i.e., log of probability of the words x & y occurring together, divided by the product of probabilities of x & y appearing/occurring separately.
  • Syntactic Processing (steps)
    • Parsing: Understanding various parts of a sentence and how they interplay with each other i.e. identifying verbs, nouns, subjects, objects etc.
      • Parts of Speech (POS) tagging aka Shallow Parsing
      • Constituency (or Paradigmatic) parsing
      • Dependencies parsing
  • Dependency grammar (as opposed to Constituency Parsing) can be traced back to Panini's grammar rules.
  • Topic modelling
    • This pertains to determining the topic of a given document, also called aboutness i.e., what is the document or a particular chunk of text about. 
    • aboutness is not binary but is more of a degree of proximity. e.g. sugar can be about health, sugar industry, diabetes to varying degrees. 
    • Topic modelling/extraction approaches
      • PLSA - Probabilistic Latent Semantic Analysis
      • LDA - Latent Dirichlet Allocation
      • ESA - Explicit Semantic Analysis

  • NLP Resources

No comments:

Post a Comment