by Srinivasan Seshadri (@sesh-zetatta) on Tuesday, 5 July 2011
- Session type
- Technical level
Hands on knowledge of building a substantially large scale system and experiences and learnings thereby..
I was the founding CTO of Kosmix which has now been acquired by Walmart for their ecommerce efforts. Kosmix started off in 2004 as a next generation search engine - the thesis was that categorization was fundamental to understanding the (info in the) web.
In any case, as a result of this endeavour we built a huge scalable system that crawled and indexed over 10 billion URLS and served over million search queries each day. Righthealth.com powered by Kosmix became the #1 health web site in the world in terms of traffic.
We realized first hand the need for a distributed file system such as GFS, the need for a job tracking system that would auto restart failed jobs automatically, the need for a computation framework that would make many computations simple and easy (now MapReduce, Pig, hive etc.).
We had to write some code in assembly to get the performance that was desired to keep the capex budget blowing out of proportion.
I am also involved and have helped several other large data projects such as Citrusleaf (www.citrusleaf.net); Inmobi's data warehouse; helped a few companies reason about how to build what the Aadhaar project (UID project) needs in terms of scale.
As far back as 1988 I was involved with building parallel database systems (Gamma from UW Madison, Brahma at IIT Bombay where I was a faculty member) which in todays world would have been called a DB system in the cloud -- but really there is nothing new conceptually in the idea of distributing data and work to multiple machines and collating the results.
I am looking for feedback on what aspects if any of these experiences would be interesting to the community. Accordingly we could tailor the session..
Here are some possible focus areas - topics I can think of where we can delve deep:
i) Building a large data warehouse (using cloud computing) - Issues How large is large? Do we need real time answers to queries? Are queries of streaming variety (need to look at only the latest data)? Depending on the tradeoffs that are possible different solutions can be manufactured from a combination of Hadoop, Hive, Pig, and other FOSS.
ii) Building a Feature Rich Ultra Fast Web Search (using cloud computing) How does one build an ultra fast search engine that also gives categorized results? How does one put together disparate media types in a search result? How does one rank these disparate media types..
iii) Building a large scalable search backend (a system for crawling, indexing, annotating, categorizing etc.) for billions of URLS.
None -- just good humour!