“We are heavy Lucene users and have forked the Lucene / SOLR source code to create a high volume, high performance search cluster with MapReduce”
The Apache Foundation is 20 years old this year and has grown to the point where it now supports over 350 open source projects; all maintained by a community of more than 770 individual members and 7,000 committers distributed across six continents. Here are the Top Five Apache Software projects in 2019, as listed by the foundation.
Top Five Apache Software Projects in 2019
Released in 2006, Apache Hadoop is an open source software library used to run distributed processing of large datasets on computers using simple programing models. A key feature of Hadoop is that the library will detect and handle failures at the application level. Essentially it’s a framework that facilities distributed big data storage and big data processing.
The Java-based programming framework consists of a storage element called Hadoop Distributed File System. The file system splits large files into blocks which are then spread out across different nodes in a computer cluster. Hadoop Common creates the main framework as its holds all of the common libraries and files that support the Hadoop modules.
Since Hadoop has the most active visits and downloads out of all of Apache’s software offerings it’s no surprise that a long list of companies rely on it for their data storage and processing needs.
One such user is Adobe, which notes: “We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We constantly write data to Apache HBase and run MapReduce jobs to process then store it back to Apache HBase or external systems.”
Apache Kafka – developed in 2011 – is a distributed streaming platform that lets developers publish and subscribe record streams in a method similar to a message queue. Kafka is used to build data pipelines that can stream in real-time, it is also used to create applications that can react or transform according to a ingested real-time data stream.
Kafka is writing in Scala and Java programming languages. When it stores streams of records in a cluster it calls them topics, each topic consists of a value, a key and a timestamp. It runs using four key APIs; Producer, Consumer, Streams and Connector. Kafka is used by many companies as a fault-tolerant publish-subscribe messaging system as well as means to run real-time analytics on data streams.
The open-source software is used by Linkedin – which incidentally first developed the software platform – to activity stream data and operation metrics. Twitter use it as part of its processing and archival infrastructure: “Because Kafka writes the messages it receives to disk and supports keeping multiple copies of each message, it is a durable store. Thus, once the information is in it we know that we can tolerate downstream delays or failures by processing, or reprocessing, the messages later.”
Lucene is a search engine software library that provides a java-based search and indexing platform. The engine can process ranked searching as well as a number of query types such as phrase queries, wildcard queries, proximity queries and range queries. Apache estimate text indexed using Lucene is done at 20-30 percent of its original size.
Lucene was first written in Java back in 1999 by Doug Cutting before the platform joined the Apache Software Foundation in 2001. Users can now get a version of it writing in the following programming languages; Perl, C++, Python, Object Pascal, Ruby and PHP.
Lucene is used by Benipal Technologies which states: “We are heavy Lucene users and have forked the Lucene / SOLR source code to create a high volume, high performance search cluster with MapReduce, HBase and katta integration, achieving indexing speeds as high as 3000 Documents per second with sub 20 ms response times on 100 Million + indexed documents.”
POI is an open-source API that is used by programmers to manipulate file formats related to Microsoft Office such as Office Open XML standards and Microsoft’s OLE 2 Compound Document format. With POI; programmes can create, display and modify Microsoft Office files using Java programs.
The German railway company Deutsche Bahn is among the major users, creating a software toolchain in order to establish a pan-European train protection system.
A part of that chain is a “domain-specific specification processor which reads the relevant requirements documents using Apache POI, enhances them and ultimately stores their contents as ReqIF. Contrary to DOC, this XML-based file format allows for proper traceability and versioning in a multi-tenant environment. Thus, it lends itself much better to the management and interchange of large sets of system requirements. The resulting ReqIF files are then consumed by the various tools in the later stages of the software development process.”
The name POI is an acronym for “Poor Obfuscation Implementation” which was the original developers making a joke that the file formats they handled appear to be deliberately obfuscated.
ZooKeeper is a centralised service that is used for maintaining configuration information. It’s a service for distributed systems and acts as a hierarchical key-value store, which is used for storing, manage and retrieving data. Essentially ZooKeeper is used to synchronise applications that are distributed across a cluster.
Working in conjunction with Hadoop it effectively works like a centralised repository where distributed applications can store and retrieve data.
AdroitLogic a enterprise integration and B2B service provider state that they use: “ZooKeeper to implement node coordination, in clustering support. This allows the management of the complete cluster, or any specific node – from any other node connected via JMX. A Cluster wide command framework developed on top of the ZooKeeper coordination allows commands that fail on some nodes to be retried etc.”