Athana Data Science Services(ADSS) – London Atmospheric Emissions Inventory (LAEI) report Athana Data Science Services (ADSS) can benefit greatly fromadopting the Hadoop framework for meeting the requirements of our client inline with their task of understanding and gaining actionable insight into theLondon Atmospheric Emissions Inventory data. With its use ADSS can makeoperations on data which can remain scalable with time, something not possibleotherwise on a conventional file system. The benefits and tools offered by theHadoop framework are looked into, in relation to our specific case, below.
Hadoop and its strengthsFile systems are designed to navigate a large amount of datastored on computer. The file system’s main job is to organize the files,programs, multimedia that you store so that it can be easily retrieved.Space in hard drive is limited, and over time, file spacediminishes and data can no longer be stored on the drive. As ADSS willultimately be working on datasets compiled during a 50-year period, we cannot relyon conventional file systems. In conventional file systems it is possible tocompress files, while still permitting applications to access data. Theoperating system has the ability to automatically decompresses the file whenneeded and compresses it again after use.
This is the most powerful advantage,particularly for systems which have limited storage space.There are number of demerits of local file system. As flat-filedatabases have files that contain records as text which is not structured dataso they cannot relate data from one file to another thereby causing redundancy.Local file systems usually do not support access for multiple users and requestingdata from multiple files: for example, transaction is not supported which willmake problems to get the insight from data.There are number of advantages of HDFS over local file system.
HDFS is inexpensive because the file system relies on commodity storage disksthat are cheaper than the storage media used for enterprise grade storage. Asthe data will increase from 1 to 50 years ADSS ultimately need HDFS to store thismuch of data to decrease cost. HDFS can exceed 2 gigabits/sec per computerwhile executing MapReduce tasks on a very low cost shared network, which willhelp to get results faster. HDFS supports replication which will make sure thatno data will be lost if any of the system in distributed environment fails(Nemschoff, 2013). Hadoop framework will allow the distributed processing of largedata sets across clusters of computers. HDFS can be used to store data indistributed environments and MapReduce can be used to perform analyticaloperations on our large datasets.Hadoop enables us to easily access new data sources with differentdata types – both structured and unstructured, to generate value from thatdata.
This means we can use Hadoop is more flexible and allows us to makebetter use of a bigger range of various data. The data generated within theLondon Atmospheric Emissions Inventory on the London Datastore website (LAEI,2016) can be most effectively engaged with and processed by ADSS using the followingtechnologies:Apache HiveApache Hive is a Hadoop application for data warehousing. Itoffers a simple way to apply structure to large amounts of unstructured data. Thisproperty of hive will help us to build a structured data sets from our spreadsheetsprovided by LAEI.Another useful functionality of hive which we can use to getinsight from our datasets is that it provides us command line tool to performqueries. We can write queries using a SQL-like language called HQL, in whichHive translates query into MapReduce jobs that are executed on the Hadoopcluster.
Complex queries are supported through User Defined Functions (UDF). Thesequeries can be written in Java and referenced by a HQL query (Hortonworks,2017).Another useful functional of hive is that Structure is applied todata at time of it being read. This will allow us not to worry about formattingthe data at the time when it is stored in their Hadoop cluster.
Hive allowsvariety of formats, like unstructured flat files with comma or space-separatedtext, semi-structured JSON files, to structured HBase tables.Also we can access hive tables from external source through acomponent called HiveServer2. We can take advantage of this capability for ouruse case which helps us to make Visualizations of our datasets in hive usingsome external application.Apache PigApache pig provides a high-level data flow language ‘Pig Latin’. Withoutwriting complex Java implementations in MapReduce, Pig Latin allows programmersto achieve the same implementations. Pig reduces the length of code by up to 20times using the multi-query approach. Many built-in operators like joins,filters, ordering, sorting among many others are supported in Apache pig.
(Bugra, 2014).Apache HbaseApache Hbase mainly runs on top of the HDFS and also supportsMapReduce jobs. Another useful feature for our use case is that HBase supportsother high level languages for data processing, which means that ADSS can inturn provide a high degree of flexibility to the client needs as they arise, aswe can abstract with it to create very case-specific applications. We can useHbase to perform MapReduce operations for parallel processing of large volumesof data. Hbase also supports the back up of Hadoop MapReduce jobs in HBase tables(Mapr.
com, 2018). Following aresome commands for ingesting the spreadsheet data into structured storage withinHadoop: hadoop fs-mkdir hdfs://localhost:5000/
After that, a Reduce function is only to find the maximumnumber through a whole list for each pollutant. Visualization and data explorationThe three most commonly used tools for data visualizations usingHadoop are Apache Zeppelin. Apache Ambari and Kibana. These can help us providevaluable insights to clients of ADSS without the need to delve into the’lower-level’ data by them on their own.
Kibana allows number of visualizations like histograms, linegraphs, pie charts, sunbursts etc. Ambanri gives better understanding ofcluster health and performance metrics through advanced visualizations andpre-built dashboards, with which it is possible to build a suite for our clientfrom which they can be able to quickly gather the necessary information fortheir own reporting in an easy and managed way. Apache zeppelin, on the otherhand, by being a Web-based multipurpose notebook that enables data-driven,interactive data analytics can help our client have full-time access to datainsights from any location where there is an internet connection (Nurgaliev etal., 2016).In conclusion, Hadoop provides a variety of strengths by itselfand a good range of tools to make data analysis tasks sustainable for futurechanges in client preference, while maintaining a level of scalability notoffered by conventional data analysis and storage methods. ADSS will upheld thequality of its work significantly by using Hadoop and its ecosystem for theirwork with LAEI. References:Bugra (2014). Pig Advantages and Disadvantages.
Blog Bugra.Available at: http://bugra.github.
io/work/notes/2014-02-08/pig-advantages-and-disadvantages/Accessed 31 Jan. 2018.Hortonworks (2017). How to Process Data with Apache Hive. BlogHortonworks. Available at:https://hortonworks.com/tutorial/how-to-process-data-with-apache-hive/Accessed 31 Jan.
2018.London Datastore (2016). London Atmospheric Emissions Inventory(LAEI). London: London Datastore.Mapr.com. (2018).
Apache HBase | MapR. online Available at:https://mapr.com/products/product-overview/apache-hbase/ Accessed 31 Jan.2018.Nemschoff, M. (2013). Big data: 5 major advantages of Hadoop.
Blog ItPortal. Available at:https://www.itproportal.com/2013/12/20/big-data-5-major-advantages-of-hadoop/Accessed 31 Jan. 2018.
Nurgaliev, I., Karavakis, E. and Aimar, A. (2016). Kibana, Grafanaand Zeppelin on Monitoring data. Openlab Summer Student report.
online CERN.Available at: https://zenodo.org/record/61079#.WnI0NqiWZcQ Accessed 31 Jan.2018.