Introduction

 

Hydra (Hydra), a distributed task processing system developed by social tagging service provider AddThis six years ago, has received an open source license from Apache, just like Hadoop, but it has not yet gained the same popularity and momentum as Hadoop. The creators of Hydra claim that the 'multi head' platform is very adept at handling some large data tasks - real-time processing of very large datasets……

 

 

Hydra is a big data storage and processing platform jointly developed by Matt Abrams and his AddThis colleagues. AddThis, also known as Clearspring in the past, is a company that develops web server widgets that allow visitors to easily share their data through Twitter, Facebook, Pinrest, Google+, or Instagram.

 

As AddThis began to gradually expand its business, it gradually felt powerless against an increasing amount of user data. The company needs a scalable distributed system for real-time analysis of the data shared by its users. At that time, Hadoop was unable to meet the requirements of AddThis, so it developed Hydra.

Hydra is a distributed task processing system that can support both stream processing and batch processing. It utilizes a tree based data structure to store and process data with clusters of thousands of nodes. It has a Linux based file system, which makes it compatible with ext3, ext4, and even ZFS; It also has a job/cluster management component that can automatically assign new jobs to the cluster and balance existing jobs; The system can also automatically backup data and handle node failures.


 

 

 

Hydra includes many components: a distributed job execution system that processes tasks across heterogeneous clusters, a network accessible file service system, as well as local and remote backups (considering node failures that are difficult to prevent).

Based on a tree structure, it can process stream data and perform batch operations at the same time. Chris Burroughs, a member of the AddThis engineering department, first announced Hydra's open source in his blog post on January 23rd, It also provides an insightful description of Hydra: "It ingests stream data (such as log files) and generates aggregation trees, summary trees, or data transformation trees that can be used to explore (small queries), as part of machine learning (large queries), or to support real-time consoles (large queries) on the website


 

Hydra was originally used to help AddThis solve its own problems, for internal use, and to provide services for website operators.

AddThis continues to use Hydra to handle its massive data traffic and analyze the development trends of its clients' websites. AddThis can help you understand what people share online and which topics are more popular. The social tagging service is used by over 13 million websites, with 1.3 billion users accessing it in a month, and an average of 3 billion views per day generating 10TB of data. Hydra runs on thousands of network nodes on AddThis.


 

Abrams told Datanami via email, "We have been dealing with large datasets for a long time, and Hydra has always been very useful to us. We feel that it solves the problem of distributed data processing in a unique way

Traditional Hadoop is oriented towards batch processing, while Hydra can support both batch processing and real-time streaming processing. Abrams said, "The batch processing supported by Hydra mainly focuses on stream analysis and incremental data processing, which can use tree data structures to describe data, compress natural data, and efficiently query and access it. Hydra can produce and receive data from HDFS, but it completes operations on the local file system, which allows it to flexibly use other services on Hydra


 

Hydra is already open source, and Abrams hopes that the software will be more widely used and better developed. This will take some time, but we believe that in the future we will build a comprehensive Hydra open source community, so that both AddThis and OS (open source) communities can benefit from Hydra's future development. There are already some other companies using Hydra in Washington, D.C., and we look forward to further development of the Hydra community

In the autumn of 2013, Doug Cutting, the founder of Hadoop and the chief architect of Cloudera, lamented the lack of alternatives to Hadoop - at that time, Cutting said, "How I look forward to more systems like Hadoop appearing..." Although Hadoop now dominates the big data industry, who can say that it will be the only big data distributed computing platform? I believe that the future development of Hydra will not disappoint him. For the future development of Hydra, I would like to quote another sentence from Cutting: "The sky is the limit”

News Center