Sashi Kanth Pepalla

← Back to list

Registration: 23.04.2024

Shell scripting Scala Python Spark pandas nosql ETL Big data AWS SQL Azure

Skills

Spark

PySpark

HDFS

Hive

HBase

Cloudera

Python

Core Java

Scala

Shell Scripting

Cassandra

MongoDB

Avro

Parquet

ORC

JSON

CSV

Jupyter Notebook

Eclipse

TOAD

Talend

Maven

GIT

Bit Bucket

Jenkins

Databricks

MySQL

Oracle

Azure Data Factory

ADLS

Azure Databricks

EMR

Redshift

Athena

Glue

Work experience

Big Data Engineer

since 10.2023 - Till the present day |Optum (United Health Group)

Azure Data Factory, ADLS, Snowflake, Spark, Databricks, Logic Apps, Containers, Hive, SQL, Python, PySpark

● 8+ years of professional programming and software development experience in Big Data Engineering and having skills in areas related to data engineering, data analysis, application design, development, testing, and deployment of software systems from development stage to production stage in Big Data technologies. ● 6 years of experience in Big Data and tools in Hadoop Ecosystem including Spark, Apache Hive, HBase, Cloudera, Dbeaver. ● Firm understanding of Hadoop architecture and various components including HDFS, Yarn, Hive, HBase, Kafka, Oozie etc. ● Worked on building a seamless Data Migration EcoSystem utilizing primary services like Azure Data Factory, Azure Data Lake Services, Azure Databricks, Snowflake, SQL, Python, PySpark. ● Worked on transforming huge volumes of Data to derive it into useful insights by Machine learning Operations and Data Analysis teams. ● Utilized Azure Databricks to automate Stored Procedures, Schedule Jobs and run Notebook activities in Azure Data Factory Pipelines. ● Utilized Databricks for Cluster Management and Job Monitoring. ● Utilized Snowflake for Data Warehousing and creating complex Stored Procedures in Snowflake Worksheets. ● Worked on Building Azure Data Factory Data Pipelines to perform ETL transformations. ● Worked on PySpark scripts to Clean, Transform and Migrate Data. ● Implemented creating External Delta Tables in Databricks and Snowflake based on source data. ● Handling the Raw Layer and Curated Layer Automation using Stored Procedure Scripts. ● Implemented Data partitioning in the External Tables in Databricks using date columns based on source for daily ingestion of data. ● Migrated Data from On-Prem to Cloud using the Copy Activity, Linked Services, Lookup Activity, Control Tables and all the Azure Data Factory services. ● Designed and documented operational problems by following standards and procedures using Swift Kanban Board. ● Documented the workflow of the process performed to ensure readability across team members and organization. ● Implemented data orchestration and workflow automation using Azure services like Azure Logic Apps and Azure Functions. ● Worked on optimizing data pipelines and storage configurations for better performance and cost-efficiency. ● Using Dbeaver tool to check and validate with the Source Data that is available in Hive tables. ● Involved in creating Workflow of data that will be available for Data Analyst and ML OPS teams. ● Worked on tuning the Azure Data Factory Data pipelines based on the Data Load to improve the overall performance of the pipeline and Data Ingestion process. ● Collaborated with the infrastructure, network, database, application, and BA teams to ensure Data Quality and availability. ● Applied Global Restrictions on the Customer Data tables as per the business perspective to restrict the data to all the End Users and make it available to the Stakeholders.

Big Data Engineer

06.2020 - 10.2023 |Dish, Englewood, CO

AWS S3, EMR, Lambda, Redshift, Athena, Glue, Spark, PySpark, Scala, Hive, Kafka, Talend

● Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, AWS EMR, Redshift and Athena. ● Worked on migrating datasets and ETL workloads from On-prem to AWS Cloud Services. ● Designed and implemented a scalable data lake architecture on AWS S3, reducing data storage costs while improving data accessibility. ● Created Hive external tables on top of datasets loaded in AWS S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis. ● Utilized Databricks for Job scheduling and Cluster Management to help streamline the project workflows. ● Built automated ETL workflows using AWS Glue and Python, resulting in a 40% reduction in data processing time and improved data quality. ● Optimized AWS Redshift performance by fine-tuning query optimization, table distribution strategies, and compression techniques. ● Worked on creating Kafka producers using Kafka Java Producer API for connecting to external Rest live stream application and producing messages to Kafka topic. ● Worked with Apache Flink with AWS to handle large volumes of streaming data. ● Utilized Amazon Kinesis with Apache Flink to ingest data parallelly with high throughput and low latency. ● Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production. ● Involved in migrating existing traditional ETL jobs to PySpark and Hive jobs on the new cloud data lake. ● Used Databricks for storage purposes in Data Lakes, Delta Lakes, Data Warehouses etc. ● Worked on writing different DD (Resilient Distributed Datasets) transformations and actions using Scala Spark. ● Utilized Flink to ingest data reliably even in scenarios where there are chances of failure. ● Created on-demand tables in Lambda functions using Python, PySpark and S3 files using AWS Glue. ● Used Talend for mapping between schemas and user-friendly environment for visualization. ● Utilized Databricks SQL for computing complex and transformative SQL queries. ● ETL jobs developed with PySpark securely transform datasets in curated S3 storage into consuming data views. ● Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift. ● Worked in data preprocessing techniques such as data cleaning, feature engineering, and normalization to prepare data for Machine Learning models. ● All the multiple sources of data are transformed into a large Data Warehouse using ETL. ● Experience working for EMR cluster in AWS cloud and working with S3. ● Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption. ● Built series of Spark Applications and Hive scripts to produce various Analytical Datasets needed for digital marketing teams. ● Integrated Snowflake with external data tools and platforms, enabling seamless data analysis and visualization for business stakeholders. ● Expertise in securing AWS resources and data using Identity and Access Management (IAM) policies, encryption, and VPC configurations. ● Used Tableau for visually representing the Telcom Data and how they are segregated across the country with different graphical concepts. ● Ran PySpark job to load data into Hive tables for test data validation. Using Hue check the Hive tables which are loaded with the daily job. ● Integrated data governance and security policies into AWS resources, meeting compliance requirements and ensuring data protection. ● Worked on automating the infrastructure setup, launching and termination EMR clusters Etc.

Data Engineer

09.2018 - 05.2020 |NIH, Bethesda, MD

Spark, PySpark, Azure Data Factory, ADLS, Hive, Sqoop, Kafka, HBase, Scala, SQL

● Responsible for ingesting large volumes of user behavioral data and customer profile data to Azure Data Lake Storage Containers from different sources. ● Built Azure Data Pipelines from scratch in Azure Data Factory in a standard like Driver Pipeline, Orchestrator Pipeline and Data Load Pipeline. ● Migrated Data from On-Prem to Cloud using the Copy Activity, Linked Services, Lookup Activity, Control Tables and all the Azure Data Factory services. ● Having experience in using Python with PySpark in building Data Pipelines and writing python scripts to automate Azure Data Factory Pipelines. ● Developed and implemented data cleaning and quality control protocols to ensure accuracy of claims data used for research in Azure Data Factory Pipelines. ● Using Azure DevOps or Git for version control and CI/CD to manage changes to the Azure Data Factory pipelines and promote them through different environments like development, staging and production. ● Implemented Partitioning, Dynamic Partitions for columns in Hive, Databricks Tables and Snowflake Tables. ● Involved in creating data lake in Azure Cloud Platform (GCP for allowing business teams) to perform data analysis in Azure Synapse SQL. ● We incorporated Data Governance best practices into data engineering processes, so that we ensure compliance with data quality standards, metadata management and involved in Data tracking. ● Designed, developed, and maintained ETL (Extract, Transform, Load) packages using SSIS to integrate data from diverse sources into a unified Data Warehouse. ● Utilized Azure SQL as data lake and ensured all the processed data is written to ADLS directly from spark and hive jobs. ● Involved in creating Hive tables, Loading, and Analyzing data using hive scripts. ● Designed and developed interactive and static reports using SSRS to provide critical insights to stakeholders. ● Contributed to research publications and presentations on Healthcare utilization, cost analysis, and policy evaluation using Claims data. ● Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for Machine Learning and reporting teams to consume. ● Worked on fine-tuning spark applications to improve the overall processing time for the pipelines. ● Wrote Kafka producers to stream the data from external rest API to Kafka topics. ● Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase. ● Experienced in handling large datasets using Spark in Memory capabilities, using broadcast variables in Spark, effective & efficient joins, transformations, and other capabilities. ● Worked extensively with Sqoop for importing data from Oracle. ● Used Reporting tools like Tableau for generating daily reports of data. ● Collaborated with the infrastructure, network, database, application, and BA teams to ensure data quality and availability. ● Designed and documented operational problems by following standards and procedures using JIRA.

Big Data Engineer

01.2017 - 08.2018 |Soul page IT Solutions

Azure Cloud, Azure Data Lake Storage, Databricks, Synapse SQL Pools, Spark, PySpark, Scala, Java, Kafka, Spark Streaming, Hive, Teradata, Azure SQL, Azure Data Factory

● Involved in writing Spark applications using Scala to perform various data cleansing, validation, transformation, and summarization activities according to the requirement. ● Responsible for developing data pipelines involving ingesting raw Json files, transactional and user profile information from on-prem data warehouses and processing them using spark and finally loading the processed data to Synapse SQL. ● Experience in building data pipelines using Azure services like Azure Data Factory, Azure Databricks and loading Data to Azure Data Lake Storage, Azure SQL Database, Azure SQL Datawarehouse (Synapse Analytics) and Snowflake. ● Automated launch of Databricks Runtimes and autoscaling the clusters and submitted spark jobs to Databricks clusters. ● Designed and implemented CI/CD pipelines for Kubernetes deployments, ensuring automated testing and continuous integration of data engineering applications. ● Using PySpark we migrated MapReduce programs to Spark transformations using Spark. ● Utilized Kubernetes for the automated deployment and scaling of Apache Spark clusters, enabling efficient data processing. ● Written Kafka producers for streaming real time JSON messages to Kafka topics and processed them using spark streaming and performed streaming inserts to Synapse SQL. ● Worked extensively on performance tuning of Spark application to improve Job Execution times. ● Developed daily process to do incremental import of data from Teradata into Hive tables using Sqoop. ● Work with cross functional consulting teams within the data science and analytics team to design, develop and execute solutions to derive business insights and solve client's operational and strategic problems. ● Worked on different file formats like Text, Avro, Parquet, Delta Lake, JSON and Flat files using Spark. ● Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.

Java Developer

06.2015 - 12.2016 |Aetins

Adobe Flex, Struts, spring, IMS, IBM MQ, XML, SOAP, JDBC, JavaScript, Oracle 91, IBM WebSphere 6.0, ClearCase, Log4J, ANT, JUnit, IBM RAD, and Apache Tomcat

● Involved in various SDLC phases like requirements gathering, design, analysis, and code development. ● Developed applications using Java, Spring Boot, JDBC. ● Worked on various use cases in development using Struts and testing functionalities. ● Involved in preparing High Level and Detail Level Design of system using J2EE. ● Created struts form beans, action classes, JSPs following Struts framework standards. ● Involved in development of model, library, struts, and form classes (MVC). ● Used display tag libraries for decoration and sed display table for reports and grid designs. ● Designed and developed file upload and file download features using JDBC with Oracle Blob. ● Worked on Core Java, using file operations to read system files (downloads) and to present on JSP. ● Involved in development of underwriting process, which involves communications without side systems using IBM MQ and JMS. ● Used PL/SQL stored procedures for applications that needed to execute as part of a scheduling mechanisms. ● Designed and developed Application based on Struts Framework using MVC Design Pattern. ● Developed Struts Action classes using Struts Controller Component. ● Developed SOAP based XML Web Services. ● Used SAX XML API to parse XML and populate values for a bean. ● Used Jasper to generate rich content reports. Developed XML applications using XSLT transformations. ● Created XML document using STAX XML API to pass XML structure to Web Services. ● Used Apache Ant for the entire build process. ● Used Rational Clear Case for version control and JUnit for unit testing. ● Used quartz scheduler to process or trigger applications daily. ● Configured WebSphere Application server and deployed web components. ● Provided troubleshooting and error handling support in multiple projects. ● Deployed applications and patches in all environments and provided production support.

Educational background

Engineering and Technology

Pydah College of Engineering and Technology, Gambheeram

Languages

EnglishProficient