Using Natural Language Processing (NLP) to extract keywords from webpages and calculate user’s intentions profile based on his browsing history

The work done by dataspectrum was just what you expect from top notch professionals: all the problems were solved accurately, promptly and efficiently. Look-alike recommendation and keyword-extraction/text-summarization systems were developed from business requirements to complete solution running in production. I would certainly recommend dataspectrum as the right guys to work with in a high standards project.

Client: OpenRTB Modular Platform is an Open Real-Time Bidding Platform. It’s a SaaS for mediabuyers, advertising agencies, AdTech startups, publishers and advertisers. You may run ads on CPM basis, use white label DSP/SSP solution or create a custom product for your special needs as shown in the following case study.

Problem: choose 15 out of 10000 keywords that best describe particular webpage

Understanding user’s intentions is crucial for ads personalization (choosing one out of thousands of possible ads that is most interesting for a particular user). The best way to do it is by summarizing his browsing history and therefore finding his “hottest” topics. For example, if we find out that 35% of webpages John visits contain either ‘baseball’, ‘bat’ or ‘new york yankees’ than we assume that John is fond of baseball and it’s sensible to show him yankees merchandise ads. Not only it’s needed for automatic ads recommendation but also for humans to read and get insights about their audience.

Key Highlights


Digital Advertising


  • Headquarters: USA
  • Operating: Worldwide

Technologies in Use

  • Python, Scikit-learn, Pandas, Gensim, Jupyther, iPython, NumPy, Flask, HDFS, Aerospike, Java 8, Deeplearning4j, jUnit, Spring Framework(Boot, Integration, Batch, Security, MVC), JSoup, boilerpipe
  • Deployment: Docker
  • Environment: AWS, Ubuntu

Big Data Scale

  • 110M webpages summarized and rapidly growing

Solution: TF-IDF, Word2Vec, SyntaxNet, Neural Networks, etc

Natural Language Processing (NLP) is a hard task. Mainly because human language requires human intelligence (or true artificial intelligence which as of 2016 doesn’t exist) to be understood. In other words, there are no good algorithm like there are for real numbers multiplication or 3d-objects rendering. So heuristics and machine learning have to be used. From a height of 10000 feet the process looks like this:

  1. Webpage is downloaded
  2. Meaningful text is extracted from HTML code
  3. Text is parsed and augmented with parts-of-speech (POS) & relations
  4. Named entities (NE) are found
  5. Whole extracted text is lemmatized
  6. Lemmatized text is split according to dictionary into ngrams: unigrams, bigrams and trigrams
  7. Top-1100 globally most frequent ngrams are removed from the list
  8. For each ngram in the list TF-IDF weight is found
  9. Top-20 “heaviest” (in sense of TF-IDF weights) ngrams are picked
  10. For each of those 10 words word2vec representation is found
  11. 10 Word2vec representations, their TF-IDF weights & all named entities are returned as a result

Now that every webpage is classified how do we calculate user’s global interests?
  1. First of all, the whole word2vec vector space have to be splitted into 10000 “equal” non-intersecting subspaces a.k.a. K-means
  2. clustering. It is done only once and is same for every user. Each cluster has it’s unique index - a positive integer number from 1 to 10000.
  3. For every webpage user have visited word2vec vectors found in the previous phase are picked and according clusters are calculated.
  4. Webpage-intention vector is built. It’s a 10000-vector with all zeros except those coordinates with indexes found on step 2 - they are sum of TF-IDF weights of webpage’s keywords that fell into according cluster. So it’s a sparse 10000-vector with a maximum of 25 non-zero values.
  5. Mean of webpage-intention vectors for every entry in particular user’s history is returned as a result.

If the result is to be consumed by some other program (like deep neural network in look-alike ads recommendation) it’s left as it is. If it’s to be consumed by human (for example, to find insights about audience) top-N “heaviest” vector coordinates are picked and 5-10 closest to the center of each according cluster words from word2vec space are returned. Both results can be strengthened with named entities derived previously. If particular NE is found in sufficient number of history’s webpages it may be concatenated to the result as a single one-hot-encoding coordinate or, in case of insights scenario, simply added to the resulting list of words. Each step of the process deserves it’s own article but to keep it concise only several highlights will be given.

Extracting text from HTML

Boilerpipe is a neat library designed exactly for that. It runs small neural network under-the-hood.

Part-of-speech & relations tagging

Syntaxnet is a state-of-the-art parser open-sourced by Google.

Named entities extraction

Custom recurrent neural network (LSTM) is utilized. About ten different features are used as an input, POS tags are among most important.


In information retrieval TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.


The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention. Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic.

Delivering solution: Horizontally scalable stateless docker-based microservices

End solution consists of a numerous independent microservices each responsible for it’s own specific task (collecting, parsing, processing, etc.). Sometimes called “unix-way” it has lots of advantages. None of those services share any state hence they could be instantly scaled to any capacity with click of a mouse - no conditions to meet and nothing to worry about. Docker is used as a container engine.


Keywords extraction & intentions calculation is an integral part of look-alike recommendation system which gave 21% conversion boost (case study). It’s also used as a tool for advertisers to help them get insights about their audience. But it’s hard to measure KPI for that.