Athana Data Science Services
(ADSS) – London Atmospheric Emissions Inventory (LAEI) report
Athana Data Science Services (ADSS) can benefit greatly from
adopting the Hadoop framework for meeting the requirements of our client in
line with their task of understanding and gaining actionable insight into the
London Atmospheric Emissions Inventory data. With its use ADSS can make
operations on data which can remain scalable with time, something not possible
otherwise on a conventional file system. The benefits and tools offered by the
Hadoop framework are looked into, in relation to our specific case, below.
Hadoop and its strengths
File systems are designed to navigate a large amount of data
stored on computer. The file system’s main job is to organize the files,
programs, multimedia that you store so that it can be easily retrieved.
Space in hard drive is limited, and over time, file space
diminishes and data can no longer be stored on the drive. As ADSS will
ultimately be working on datasets compiled during a 50-year period, we cannot rely
on conventional file systems. In conventional file systems it is possible to
compress files, while still permitting applications to access data. The
operating system has the ability to automatically decompresses the file when
needed and compresses it again after use. This is the most powerful advantage,
particularly for systems which have limited storage space.
There are number of demerits of local file system. As flat-file
databases have files that contain records as text which is not structured data
so they cannot relate data from one file to another thereby causing redundancy.
Local file systems usually do not support access for multiple users and requesting
data from multiple files: for example, transaction is not supported which will
make problems to get the insight from data.
There are number of advantages of HDFS over local file system.
HDFS is inexpensive because the file system relies on commodity storage disks
that are cheaper than the storage media used for enterprise grade storage. As
the data will increase from 1 to 50 years ADSS ultimately need HDFS to store this
much of data to decrease cost. HDFS can exceed 2 gigabits/sec per computer
while executing MapReduce tasks on a very low cost shared network, which will
help to get results faster. HDFS supports replication which will make sure that
no data will be lost if any of the system in distributed environment fails
Hadoop framework will allow the distributed processing of large
data sets across clusters of computers. HDFS can be used to store data in
distributed environments and MapReduce can be used to perform analytical
operations on our large datasets.
Hadoop enables us to easily access new data sources with different
data types – both structured and unstructured, to generate value from that
data. This means we can use Hadoop is more flexible and allows us to make
better use of a bigger range of various data. The data generated within the
London Atmospheric Emissions Inventory on the London Datastore website (LAEI,
2016) can be most effectively engaged with and processed by ADSS using the following
Apache Hive is a Hadoop application for data warehousing. It
offers a simple way to apply structure to large amounts of unstructured data. This
property of hive will help us to build a structured data sets from our spreadsheets
provided by LAEI.
Another useful functionality of hive which we can use to get
insight from our datasets is that it provides us command line tool to perform
queries. We can write queries using a SQL-like language called HQL, in which
Hive translates query into MapReduce jobs that are executed on the Hadoop
cluster. Complex queries are supported through User Defined Functions (UDF). These
queries can be written in Java and referenced by a HQL query (Hortonworks,
Another useful functional of hive is that Structure is applied to
data at time of it being read. This will allow us not to worry about formatting
the data at the time when it is stored in their Hadoop cluster. Hive allows
variety of formats, like unstructured flat files with comma or space-separated
text, semi-structured JSON files, to structured HBase tables.
Also we can access hive tables from external source through a
component called HiveServer2. We can take advantage of this capability for our
use case which helps us to make Visualizations of our datasets in hive using
some external application.
Apache pig provides a high-level data flow language ‘Pig Latin’. Without
writing complex Java implementations in MapReduce, Pig Latin allows programmers
to achieve the same implementations. Pig reduces the length of code by up to 20
times using the multi-query approach. Many built-in operators like joins,
filters, ordering, sorting among many others are supported in Apache pig.
Apache Hbase mainly runs on top of the HDFS and also supports
MapReduce jobs. Another useful feature for our use case is that HBase supports
other high level languages for data processing, which means that ADSS can in
turn provide a high degree of flexibility to the client needs as they arise, as
we can abstract with it to create very case-specific applications. We can use
Hbase to perform MapReduce operations for parallel processing of large volumes
of data. Hbase also supports the back up of Hadoop MapReduce jobs in HBase tables
some commands for ingesting the spreadsheet data into structured storage within
-copyFromLocal /usr/local/logs.xls hdfs://localhost:5000/
-put /usrlocal/hadoop/logs.csv hdfs://localhost:5000/
We can make a number of analytical processes and insights from the
data stored on LAEI after it has been ported into Hadoop.
A Map function can be used to extract the amount of pollutants
released from road traffic and other modes of transport, major industrial
sources, and smaller sources appearing in the LAEI datasets, and these pairs
can be sent to an intermediate temporary space specified by MapReduce. Through
intermediate processing by the Map function, the key/value pairs are grouped
according to the key, so that each year is followed by a list of quantity of
each pollutant. After that, a Reduce function is only to find the maximum
number through a whole list for each pollutant.
Visualization and data exploration
The three most commonly used tools for data visualizations using
Hadoop are Apache Zeppelin. Apache Ambari and Kibana. These can help us provide
valuable insights to clients of ADSS without the need to delve into the
‘lower-level’ data by them on their own.
Kibana allows number of visualizations like histograms, line
graphs, pie charts, sunbursts etc. Ambanri gives better understanding of
cluster health and performance metrics through advanced visualizations and
pre-built dashboards, with which it is possible to build a suite for our client
from which they can be able to quickly gather the necessary information for
their own reporting in an easy and managed way. Apache zeppelin, on the other
hand, by being a Web-based multipurpose notebook that enables data-driven,
interactive data analytics can help our client have full-time access to data
insights from any location where there is an internet connection (Nurgaliev et
In conclusion, Hadoop provides a variety of strengths by itself
and a good range of tools to make data analysis tasks sustainable for future
changes in client preference, while maintaining a level of scalability not
offered by conventional data analysis and storage methods. ADSS will upheld the
quality of its work significantly by using Hadoop and its ecosystem for their
work with LAEI.
Bugra (2014). Pig Advantages and Disadvantages. Blog Bugra.
Available at: http://bugra.github.io/work/notes/2014-02-08/pig-advantages-and-disadvantages/
Accessed 31 Jan. 2018.
Hortonworks (2017). How to Process Data with Apache Hive. Blog
Hortonworks. Available at:
Accessed 31 Jan. 2018.
London Datastore (2016). London Atmospheric Emissions Inventory
(LAEI). London: London Datastore.
Mapr.com. (2018). Apache HBase | MapR. online Available at:
https://mapr.com/products/product-overview/apache-hbase/ Accessed 31 Jan.
Nemschoff, M. (2013). Big data: 5 major advantages of Hadoop.
Blog ItPortal. Available at:
Accessed 31 Jan. 2018.
Nurgaliev, I., Karavakis, E. and Aimar, A. (2016). Kibana, Grafana
and Zeppelin on Monitoring data. Openlab Summer Student report. online CERN.
Available at: https://zenodo.org/record/61079#.WnI0NqiWZcQ Accessed 31 Jan.