Publications

Publications

Title: An Effective and Scalable Data Modeling for Enterprise Big Data Platform


 

Abstract:

The enormous growth of the internet, enterprise applications, social media, and IoT devices in the current time caused a huge spike in enterprise data growth. Big data platform provided scalable storage to manage enterprise data growth and served easier data access to decision-makers, stakeholders and business users. It is a well-known challenge to classify, organize and store all this data and process it to provide business insights. Due to nature, variety, velocity, volume and value of data make it difficult to effectively process big data. Enterprises face challenges to apply complex business rules, to generate insights and to support data-driven decisions in a timely fashion. As big data lake integrates streams of data from a bunch of business units, stakeholders usually analyze enterprise-wide data from various data models. Data models are a vital component of Big data platform. Users may do complex processing, run queries and perform big table joins to generate required metrics depending on the available data models. It is usually a time consuming and resource-intensive process to find the value from data. It is a no-brainer that big data platform in the enterprise needs high-quality data modeling methods to reach an optimal mix of cost, performance, and quality. This paper addresses these challenges by proposing an effective and scalable way to organize and store data in Big Data Lake. It presents some of the basic principles and methodology to build scalable data models in a distributed environment. It also describes how it overcomes common challenges and presents findings.

 

Keywords:

Big Data, Big Data Lake, Scalable Data Modeling, Hadoop, Spark, Business Intelligence, Big Data Analytic


Conference: 2019 IEEE International Conference on Big Data (Big Data)


Title: Mid-Tier Models for Big Data


 

Abstract:

With the rise of Big data, enterprises started accumulating significantly more data than they consume. Big Data lake made data consumption easier for all stakeholders, analysts, and developers. Variety, volume, and velocity of data and complexity of businesses added complexity in processing, organizing, and storing data to serve analytical solutions on a timely basis. It is often a big challenge for enterprises to cleanse, organize, classify, and store big data so that insights are accessible on time. Data consistency will also come into the picture when multiple data models define similar metrics. As numerous data sources are integrated into a single platform, stakeholders often analyze data from various subject areas. It leads to complex queries resulting in big joins and more processing power. Even with cheap storage and more processing power of Hadoop and big data technologies, modeling big data is a time-consuming and error-prone process. This paper addresses that challenge by introducing mid-tier models for big data. It discusses a novel data modelingMid-Tier models approach to organize and store big data in distributed storage. It outlines how it overcomes some of the challenges and showcases an example.

 

Keywords:

Enterprise Data Models, Mid-Tier Data Models, Big Data Lake, Dimensional Models, Big Joins, Hadoop, Spark


Conference: 2019 IEEE 5th International Conference on Big Data Intelligence and Computing (DATACOM)


Full Text

Title: Bridging Data Silos Using Big Data Integration


 

Abstract:

With cloud computing, cheap storage, and technology advancements, an enterprise uses multiple applications to operate business functions. Applications are not limited to just transactions, customer service, sales, finance but they also include security, application logs, marketing, engineering, operations, HR and many more. Each business vertical uses multiple applications which generate a huge amount of data. On top of that, social media, IoT sensors, SaaS solutions, and mobile applications record exponential growth in data volume. In almost all enterprises, data silos exist through these applications. These applications can produce structured, semi-structured, or unstructured data at different velocity and in different volume. Having all data sources integrated and generating timely insights helps in overall decision making. With recent development in Big Data Integration, data silos can be managed better and it can generate tremendous value for enterprises. Big data integration offers flexibility, speed, and scalability for integrating large data sources. It also offers tools to generate analytical insights which can help stakeholders to make effective decisions. This paper presents the overview on data silos, challenges with data silos and how big data integration can help to stun them.

 

Keywords:

Data Silo, Big Data, Data Pipelines, Integration, Data Lake, Hadoop


Journal: International Journal of Database Management Systems


Full Text


Click For Citation

Title: Overcoming Data Silos Through Big Data Integration


 

Abstract:

With cloud computing, cheap storage and technology advancements, an enterprise uses multiple applications to operate business functions. Applications are not limited to just transactions, customer service, sales, finance but they also include security, application logs, marketing, engineering, operations, HR and many more. Each business vertical uses multiple applications which generate a huge amount of data. On top of that, social media, IoT sensors, SaaS solutions, and mobile applications record exponential growth in data volume. In almost all enterprises, data silos exist through these applications. These applications can produce structured, semi-structured, or unstructured data at different velocity and in different volume. Having all data sources integrated and generating timely insights helps in overall decision making. With recent development in Big Data Integration, data silos can be managed better and it can generate tremendous value for enterprises. Big data integration offers flexibility, speed and scalability for integrating large data sources. It also offers tools to generate analytical insights which can help stakeholders to make effective decisions. This paper presents the overview on data silos, challenges with data silos and how big data integration can help to overcome them.

 

Keywords:

Data Silo, Big Data, Data Pipelines, Integration, Data Lake, Hadoop


Journals:


Full Text

Title: Stimulate ML Development using Feature Store


Abstract:

The state of ML development didn’t change much since the last few years. Data scientists and developers spend 80% of their time in data wrangling. It not only impedes the ML application development but it also negatively impacts overall consumer satisfaction and enterprise decision making. This article sheds light on a way to overcome these challenges

.

Access:


About HypeRight:

Hyperight is an international event service provider focusing on creating network-oriented and crowdsourced business events. It is committed to providing premium content and unique customer experience through a unique platform where data practitioners from around the world come to learn.


Title: Unification of Machine Learning Features

 

Abstract:

In the Information Age, Machine learning (ML) provides a competitive advantage to any business. Machine learning applications are not limited to driverless cars or online recommendations but are widely used in healthcare, social services, government systems, telecommunications, and so on. As many enterprises are trying to step up machine learning applications, it is critical to have a long-term strategy. Most of the enterprises are not able to truly realize the fruits of ML capabilities due to its complexity. It is easier to access a variety of data today due to data democratization, distributed storage, technological advancements, and big data applications. Despite easier data access and recent advancements in ML, developers still spend most of the time in data cleansing, data preparation, and data modeling for ML applications. These steps are often repeated and result in identical features. As identical features can have inconsistent processing while testing and training, more issues pop up at later stages in ML application development. The unification of ML features is an effective way to address these issues. This paper presents details about numerous methods to achieve ML features unification.

 

Keywords:

ML Pipeline, Feature Engineering, ML Development, Feature store, Big Data Engineering


Conference:

2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC)



Title: The Democratization of Machine Learning Features

 

Abstract:

In the Machine Age, Machine learning (ML) becomes a secret sauce to success for any business. Machine learning applications are not limited to autonomous cars or robotics but are widely used in almost all sectors including finance, healthcare, entertainment, government systems, telecommunications, and many others. Due to a lack of enterprise ML strategy, many enterprises still repeat the tedious steps and spend most of the time massaging the required data. It is easier to access a variety of data because of big data lakes and data democratization. Despite it and decent advances in ML, engineers still spend significant time in data cleansing and feature engineering. Most of the steps are often repeated in this exercise. As a result, it generates identical features with variations that lead to inconsistent results in testing and training ML applications. It often stretches the time to go-live and increases the number of iterations to ship a final ML application. Sharing the best practices and best features are not only time-savers but they also help to jumpstart ML application development. The democratization of ML features is a powerful way to share useful features, to reduce time go-live, and to enable rapid ML application development. It is one of the emerging trends in enterprise ML application development and this paper presents details about a way to achieve ML feature democratization.

 

Keywords:

Democratize Feature, Feature Store, ML Pipelines, Machine Learning Development, Big Data Analytics


Conference:

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI)