Category: Big Data

Text Normalization with Spark – Part 2

Overview This is second in a two part series that talks about Text Normalization using Spark.In this blog post, we are going to understand the jargon (jobs,stags and executors) of Apache Spark with Text Normalization application using Spark history server UI. To get a better understanding of the use case, please refer our Text Normalization […]

Importing and Analyzing Data in Datameer

Overview Datameer, an end-to-end big data analytics platform, is built on Apache Hadoop to perform integration, analysis, and visualization of massive volumes of both structured and unstructured data. It can be rapidly integrated with any data sources such as new and existing data sources to deliver an easy-to-use, cost-effective, and sophisticated solution for big data […]

Kylo Setup for Data Lake Management

Overview Kylo is a feature-rich data lake platform built on Apache Hadoop and Apache Spark. It provides data lake solution enabling self-service data ingest, data preparation, and data discovery. It integrates best practices around metadata capture, security, and data quality. It contains many special purposed routines for data lake operations leveraging Apache Spark and Apache […]

Apache Spark Performance Tuning – Degree of Parallelism

Table of Content [show] 1 Overview 2 Spark Partition Principles 3 Understanding Use Case Performance 4 Understanding Spark Data Partitions 5 Spark Partition Tuning 5.1 Running Spark on YARN with Partition Tuning 6 Conclusion 7 References Overview This is the third article of a four-part series about Apache Spark on YARN. Apache Spark allows developers […]

Embrace Relationships with Neo4J, R & Java

2 Use Case 3 Solution 3.1 Prerequisites 3.2 Download StackOverflow Dataset 3.3 Data Manipulation with R 3.4 Create Nodes and Relationship file with Java 3.5 Create GraphDB with Batch Importer 3.6 Visualize Graph with Neo4j 4 Conclusion 5 References Introduction Graphs are everywhere, used by everyone, for everything. Neo4j is one of the most popular […]

Customer Churn – Logistic Regression with R

1 Overview 2 Learning/Prediction Steps 2.1 Data Description 2.2 Data Preprocessing 2.3 Partitioning the Data & Logistic Regression 2.4 Model Summary 2.5 Prediction Accuracy 3 References Overview In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients […]

Call Detail Record Analysis – K-means Clustering with R

1 Overview 2 Data Description 3 Data Source Features Description 4 Data Preprocessing 5 CDR Exploratory Data Analysis (EDA) 6 Call Detail Record Clustering 7 Conclusion 8 References Overview Call Detail Record (CDR) is the information captured by the telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights […]

Bootstrap 3 – Tips and Tricks

1 Introduction 2 Use case 3 Solution 3.1 Five columns Layout 3.2 How to Enable Bootstrap 3 Hover Dropdowns 3.3 Don’t forget Container Fluid for Full Width Rows 3.4 Column ordering 3.5 Labels for Screen Readers 3.6 No Gutter Column 4 Conclusion 5 References Introduction Bootstrap is the most popular HTML, CSS, and JS framework […]

Advanced Avro: Schema Design & Reuse – Part 4

1 Introduction 2 Use Case 3 Solution 3.1 Create a Schema with nested sub-schemas: 3.2 Create Multiple sub-schemas and reuse: 3.3 Write a Java program that includes sub-schemas into main schema: 3.4 Convert the JSON file into binary Avro, and from binary Avro to JSON file using Avro Tools 4 Conclusion 5 References Introduction This […]