How Google and Facebook Research Drives Much of Big Data in Finance

By Irene Aldridge

This article first appeared in the Big Data Finance magazine

Social media has fascinated Finance for about a decade. Extracting sentiment from online posts have proven to be both innovative for gauging investor sentiment and profitable for estimating direction of the impending price move and volatility. Companies like AbleMarkets, a Big Data platform and a supplier of Internet sentiment index for most U.S.-based stocks, and Suite LLC, an industrial-grade derivatives pricing and risk management software agree: Internet sentiment is highly predictive of impending volatility.

In addition to social media sentiment, a new kind of social media is entering Finance as we know it: social media Big Data techniques. Discussed in detail at the upcoming two-day Big Data Finance conference on May 9–10, 2019, at Cornell Tech in NYC (register here:, the techniques are poised to disrupt even today’s cutting edge financial applications and bring a completely new order into the financial system.

When we speak of true Big Data and the discipline’s techniques, many if not most of the tools were originally developed or perfected by the Social Media companies, present ‘unicorn’ billionaire entities like Google and Facebook. In addition to the efficient data parsing, factorization and content recognition discussed in the previous section, much of the social media research has been geared toward fast identification of ‘on-the-fly’ or dynamic transformations of data.

A leading application of Google, for example, was its efficient ranking system of web links. Google pioneered and optimized the use of web-crawling “spiders” to rank the world-wide web. The underlying sampling technology, on which Google founders Sergey Brin and Larry Page were working during their PhD studies at Stanford, is directly applicable to many traditionally time-consuming financial applications.

Google’s approach to automatically and efficiently ranking the web content works as follows: Google web spider starts at a random web page, from where it scans all the links to other web pages and randomly selects one to follow. Once on the next page, Google web spider repeats its activity, identifying and randomly following additional links until it reaches a page with no outbound links, and goes back ‘home’ to start the process from the beginning. There, Google spider once again selecting a new web page at random, identifies all the links presented on the page, selects one link at random, follows it and repeats the process until another “dead-end” web page is reached. While the spider crawls the Internet, it records and transmits its activity back to Google databases.

In randomly sampling the links, Google imitates web surfing activity of random individuals surfing at their leisure or another purpose. Once Google performs the surfing operation a sufficient number of times, Google creates the ‘transition probabilities’ of the web universe. In general, such transitions are known as Markov Chains. From the sample Markov Chain transitions, Google perfected a fast Big Data technique for extrapolating the transitions for the entire web-browsing population, allowing Google to efficiently rank the entire universe of websites in a very short amount of time.

What does any of this has to do with Finance? Markov Chains often occur in Finance. The most obvious application is transitions among credit ratings of a borrower. Here, the borrower may stay in the same rating bucket, as well as move up or move down the ladder. Markov Chains are also the foundation of the Poisson process, a model that is used to approximate information diffusion in the markets, long memory of disruptive events like economic crises and news announcements, jumps in prices of instruments underlying derivatives and other applications. A traditional approach to dealing with Markov Chains calls for lengthy and computationally-expensive multiplication of matrices. Big Data, on the other hand, provides a speedy and computationally-efficient solution discussed in this Chapter that makes Markov Chain estimation a breeze.

The innovation of the Google method, however, extends even further. Its intelligent sampling is extremely useful in all applications involving lots of data: take multi-asset portfolios with the number of instruments in excess of Russell 1500, or something like high-frequency trading where the number of intraday data points reachers billions for one instrument in just one day. All the data points are typically not equal in their information content: some are packed with value, whereas others are just place fillers. Strategic Big Data sampling allows us to discriminate between content and noise on the fly, and estimate the price trends and distributions with high speed and without processing every possible byte of information.

In the fastest-moving financial information environment to date, the newest motto is the good old “You snooze, you lose.” Big Data is upon us and waiting to learn it is deadly.

The techniques described above will be discussed at the upcoming 7th Annual Big Data Finance conference at Cornell Tech campus in NYC on May 9–10, 2019: Irene Aldridge will be speaking on the subject at the Big Data Finance conference.

Irene Aldridge is a Managing Director of AbleMarkets, a Big Data and AI Platform for Finance, and a Visiting Professor at Cornell University. Irene is an author of High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems (2nd edition, Wiley, 2013), a co-author of Real-Time Risk: What Investors Should Know About Fintech, High-Frequency Trading and Flash Crashes (Wiley, 2017) and Big Data Science in Finance: Theory and Applications (forthcoming). Irene is also a member of the Editorial Board of the Journal of Financial Data Science ( ).