Table of Contents

  1. Introduction
  2. Academia and Research
  3. Tutoring and Content Development
  4. Coding and Data Science
  5. My journey at the Office for National Statistics
  6. My journey at Quantexa

Introduction

Hello, I’m Usman Kayani — a seasoned Data Engineer and Scientist with a rich academic foundation in Mathematics, reflected in my MSci and PhD degrees. Throughout my career, I’ve garnered invaluable analytical experience, mastering both theoretical and applied techniques across diverse fields such as mathematics, statistics, and numerical methods.

isolated

Armed with over a decade of Python development expertise and a recent proficiency in Scala, my specialty lies in engineering robust analytical data pipelines (ETL, RAP) and skillfully deploying machine learning models at scale. My proficiency in harnessing cloud services like GCP, Cloudera, and AWS, coupled with my adept usage of technologies such as Apache Spark, BigQuery, and SQL, enables me to design dynamic, scalable data solutions.

isolated        isolated

isolated

I identify as neurodivergent, living with ADHD and Dyslexia, and find that this unique perspective has richly informed my life and career. As an advocate for diversity and inclusion, I am passionate about fostering an environment of acceptance and understanding for neurodiversity in the workplace.

isolated

Fuelled by my unwavering belief in the transformative potential of data and technology, I continuously strive to apply my skills towards positive societal change. I am always on the lookout for innovative opportunities where data science can be leveraged to improve lives and create a meaningful impact.

Academia and Research

I completed an MSci in Mathematics, with First Class Honours, and a PhD in Applied Mathematics and Theoretical Physics at King’s College London.

isolated

During my MSci, I scored highly in my modules, most notably gaining 99% in linear algebra and partial differential equations with complex variables. Other pure and applied mathematics modules that I completed included topics covering advanced calculus, dynamical systems, probability and statistics, discrete mathematics, logic, complex analysis, manifolds, linear algebra, differential geometry and mathematical biology.

isolated

In my second year as an undergraduate, I had written a paper under the advice of one of my professors, who thought that it was interesting enough to be posted on the university website for the mathematics department. This gave me valuable experience on how to write a paper and investigate different mathematical techniques and ideas.

The paper I wrote was on a new method to solve linear ordinary differential equations of any order with non-constant coefficients and certain classes of partial differential equations. As a result, by the recommendation of another professor and based on my academic record, I was able to attend the fourth year module, Fourier analysis, for extra credit in my second year.

isolated

In the final two years of my MSci, I also completed modules in physics, covering quantum mechanics, advanced quantum field theory, statistical mechanics, advanced general relativity, string theory, and cosmology.

isolated

For my MSci dissertation, I studied integrable quantum spin chains. This introduced me to quantum interaction models particularly the Heisenberg spin chain, which describes the nearest neighbour interaction of particles with spin-$\frac{1}{2}$ (i.e electrons) and naturally arises in the study of ferromagnetism.

isolated

These models have acquired growing importance in quantum computing within the field of quantum information processing, mainly as a means of efficiently transferring information. In addition to the understanding of quantum computing that I acquired from the research, it also gave me insights into various mathematical methods to diagonalize large square matrices which can grow exponentially according to the number of particles.

isolated     isolated

For my academic excellence in my MSci, I was also awarded a scholarship by the Science and Technology Facilities Council (STFC) to undertake a PhD in Applied Mathematics and Theoretical Physics under the mathematics department. My research was in quantum gravity, which is the study of the relationship between quantum mechanics and general relativity.

isolated

In particular, I studied string and supergravity theories, specifically type IIA, massive type IIA and 5-dimensional (both gauged and ungauged).

isolated

The main theme of my PhD research was a study of the symmetries of black hole horizons in quantum gravity.

isolatedisolated

The work has been published in three peer-review papers in leading international journals with the third publication on the five-dimensional supergravity theories being a sole-author paper. My examiners remarked at the great achievement to produce such a sole-author paper during a PhD. I had also continued my research and publications in quantum gravity as an independent researcher for black holes in 6-dimensional gauged $N=(1,0)$ supergravity.

isolated

The methodology used to investigate these problems involves techniques in Lie algebras, differential geometry, differential equations on compact manifolds, general relativity, supersymmetry and string theory. Algebraic and differential topology were also essential in the analysis.

isolated

For my research in quantum gravity, I also performed extensive computations on multi-dimensional arrays for supergravity calculations using Python with Cadabra, particularly for Clifford algebras and spinors in higher dimensions.

isolated

I also attended various academic seminars and conferences such as the Winter School on Supergravity, Strings, and Gauge Theory at CERN in Switzerland, and I presented my research at many conferences including the Young Theorists’ Forum at Durham University.

isolated

Tutoring and Content Development

I have experience teaching and tutoring Mathematics, Physics and Programming at a Graduate (BSc/MSc), A-level and GCSE. I was a graduate teaching assistant at King’s College London for 6 years, and a private tutor for over 12 years. During my PhD, I also volunteered for roles in science communication, such as with the Institute of Physics, to explain my research on black holes to school children.

isolated

In the last few years, I have also undertaken freelance work with Witherow Brooke, a private tuition and educational consultancy company which was featured in The Telegraph. I am currently tutoring university students in mathematics topics such as advanced statistics, linear algebra and Python coding.

isolatedisolated

I also worked as a freelance mathematics content writer and video developer for Nagwa, a leading educational technology company in the Middle East and North Africa. I was responsible for developing educational videos and content for the company’s online platform for topics in mathematics from GCSE to graduate level, such as algebra, trigonometry, calculus and statistics.

isolated

Coding and Data Science

I have extensive experience coding in many high-level programming languages (e.g C++, Java), scripting languages (e.g Python, R, Bash, etc) and symbolic languages (e.g Mathematica, MATLAB, Maple etc) using various operating systems (e.g Linux, Windows, Mac OS).

isolated

isolated      isolated

In addition to Cadabra, I used Mathematica and Maple for my research in quantum gravity during my PhD and beyond. I have also used various programming languages or libraries (e.g NumPy, SciPy, Matplotlib etc) to perform mathematical, physical, and statistical computations for various analyses and datasets.

isolated

During my PhD, I also worked on a personal project to interact with bluetooth low energy (BLE) smart watches in Python. This was used to read the heart rate (bpm), systolic/diastolic blood pressure (mmHg) and SpO2 blood oxygen (%) with a live plot of real-time HR readings using gnuplot. I also used Python to perform data analysis on the data collected from the smart watch.

isolated

After leaving academia, one of the roles I decided to pursue was a Scientific Software Engineer at the Meteorological Office. For this role, I gave a presentation on the history, applications and good practices of scientific software engineering.

isolated

Ultimately I decided to embark on a career in data science instead. Nevertheless, the panel was very impressed with my application and noted that I had a good understanding of the importance of software quality, and awareness of the challenges of developing scientific software. I also demonstrated good examples of software development, especially in Python and a sensible approach to debugging code, including when working with other’s code.

I am experienced with data science and machine learning Python libraries (e.g Scikit-learn, Keras, Tensorflow) as well as data visualization software such as Tableau and Power BI.

isolated isolated

As a freelance data scientist, I analysed instantaneous power consumption data of a large number of households with supervised machine learning models to identify various devices (e.g the TV or kettle) and classify when they are turned on and the occurrences/duration of their usage, to identify routines and the detection of anomalies. This data was provided by a particular company that aims to use ML and modelling with domestic electric appliances.

isolated

My journey at the Office for National Statistics

Throughout my tenure at the Office for National Statistics (ONS), I have made pivotal contributions to data science and analysis, specializing in the development of advanced statistical and machine learning models, designing analytical data pipelines, and crafting data visualizations to distil complex data insights. Key projects include devising sophisticated multilateral price indices for the treatment of alternative data sources, implementing novel machine learning methodologies in Python, and significantly enhancing the computational efficiency of data processes. Furthermore, my commitment to knowledge sharing has positioned me as a mentor to new team members and a key presenter in various organizational seminars.

isolated

Data Scientist in Reproducible Data Science and Analysis

In June 2021, I was employed as a Data Scientist at the Higher Executive Officer grade at the ONS within the Economics Statistics Group (ESG) and the Reproducible Data Science and Analysis (RDSA) team, formerly known as Emerging Platforms Delivery Support (EPDS).

After only my second month at the ONS, I was a member of the induction team responsible for onboarding new starters and aiding or mentoring to new members of the team. My main work was researching and implementing multilateral price indices, using calculations and time series extension methods in Python. This work was as part of an ETL Reproducible Analytical Pipeline (RAP) on Cloudera with Apache Spark for the treatment of alternative data sources (scanner and web-scraped data) and new index methods which will be used to determine the consumer price index (CPI) in the future.

isolated

As part of the new index methods, I had been looking at mutlilateral methods which simultaneously make use of all data over a given time period. Their use for calculating temporal price indices is relatively new internationally, but these methods have been shown to have some desirable properties relative to their bilateral method counterparts, in that they account for new and disappearing products (to remain representative of the market) while also reducing the scale of chain-drift.

While working on building a data pipeline for the CPI, I made very significant contributions both to methodology and computational efficiency for the integration of alternative data sources. In my first few months, I led an investigation into a particular implicit hedonic multilateral index method known as the Time Product Dummy (TPD) method, which uses a log-linear price model with weighted least squares regression and expenditure shares as weights.

\[\begin{aligned} \ln p_i^{t} &= \alpha + \sum_{r=1}^T \delta^r D_i^{t,r} + \sum_{j=1}^{N-1}\gamma_j K_{i,j} + \epsilon_i^{t} \ , \\ s_i^{t} &= \frac{p_i^t q_i^t}{\sum_{j=1}^N p_j^t q_j^t} \ . \end{aligned}\]

After noticing an error in the formulae and example workbooks produced for these methods and bringing this to the attention of the ONS, I worked closely with people from methodology on making sure we got all the technical details right.

My first task was to implement the TPD method within the CPI pipeline using PySpark. Spark’s native ML library though powerful generally lacks many features, and is not suited for modelling on multiple groups or subsets of the data at once. The usual approach to use custom functions or transformations which are not part of the built-in functions provided by Spark’s standard library is to use a User Defined Function (UDF). However, the downside of this is they have performance issues, since they executed row-at-a-time and thus suffer from high serialization and invocation overhead.

isolated

This led me toward discovering Pandas UDFs, which allow for vectorized operations on Big Data and increase performance by up to 100x compared to regular UDFs using Apache Arrow. They have since been implemented in various multilateral index methods and are an integral part of the CPI pipeline.

isolated

isolated

I also used the same ideas for the Time Dummy Hedonic (TDH) method, which is an explicit hedonic model similar to TPD, but also uses the item characteristics in the WLS regression model.

\[\begin{aligned} \ln p_i^{t} = \delta^0 + \sum_{r=1}^T \delta^r D_i^{t,r} + \sum_{k=1}^K \beta_k z_{i,k} + \epsilon_i^{t} \ . \end{aligned}\]

After implementing the TPD and TDH methods, I turned my attention to another multilateral method known as Geary-Khamis (GK) and the usual method involves iteratively calculating the set of quality adjustment factors simultaneously with the price levels.

\[\begin{aligned} b_{n}&=\sum_{t=1}^{T}\left[\frac{q_{t n}}{q_{n}}\right]\left[\frac{p_{t n}}{P_{t}}\right] \ , \nonumber \\ P_{t}&=\frac{p^{t} \cdot q^{t}}{ \vec{b} \cdot q^{t}} \ . \end{aligned}\]

I was able to independently research and implement a method solely based on matrix operations, which makes the method more efficient since it has vectorized operations which act on the entire data. I also refactored my code for TPD and TDH using matrix operations, which turned out to be more efficient and increased performance by up to 7x compared to standard statistical libraries. The Pandas UDFs were also applied to the time series extension methods for TPD, TDH, GK and another multilateral method known as GEKS.

In October 2021, after working closely with methodology on index numbers, I was invited to join the Index Numbers Expert Group (INEG) and the Data Science and High-performance computing (DaSH) expert group.

In November 2021, I delivered a presentation in a seminar to my team and deparment, to introduce the concept of Pandas UDFs. This turned out to be a success as I got good engagement and questions after the presentation, as well as interest from other parties in DaSH, to watch the recording and slides. I also presented a seminar aimed at people both little and extensive knowledge of the subjects, and a Jupyter Notebook of worked examples. I discussed this material with a computing specialist, and with their feedback have produced useful material with a full set of instructions and worked examples, which is accessible to a wider audience.

Data scientist in the Data Science Campus

In March 2022, I joined the Data Science Campus at the ONS with a promotion to Senior Executive Officer and a permanent role in the civil service.

isolated

My first project was on the least cost index, which was published in May 2022. I played a significant role in researching and implementing the price index and aggregation methods, which was powered by a Python price index package which I created called PriceIndexCalc.

isolated

My package and work was used to track the prices over time of the lowest-cost grocery items for 30 products over multiple retailers, using web-scraped data and a data pipeline on the Google Cloud Platform. This analysis was conducted as part of the ONS’s current and future analytical work related to the cost of living.

isolated

isolated

In April 2022, I also joined the Data Access Platform Capability And Training Support (DAPCATS) as a mentor, where I have been helping other data scientists and analysts with their work and projects.

isolated

I also took part in the Spark at the ONS event hosted by DAPCATS and created for the launch of a new online book. This event was used to discuss various topics and resources related to Spark and Big Data, and I delivered a presentation titled Spark application debugging, tuning and optimization. For this talk, I discussed various tips and techniques to increase efficiency, identify bugs or bottlenecks that can cause Spark applications to be slow or fail, and tuning Spark parameters accordingly. This can help to reduce overall developer and compute time, costs for resources to run the Spark application or the environmental impact that comes with using unnecessary extra resources or having significantly longer runtimes.

In August 2022, I received the Recognition Award for outstanding collaboration and contribution to the ONS. I provided very important support to help another team to publish the Capital Stocks user guide article and the work has made the UK the only country to introduce such transparency. The process involved sharing their statistical production code in the ONS’s GitHub account and I dedicated my time to help them set up the initial account, and to upload the packages in GitHub as the team hadn’t experienced using this platform before. I also took the time to give them a very detailed walk through of how the platform works, and helped them by sharing tips and examples of good practice. My support enabled them to make their capital stocks statistical production system accessible and reproducible by all external users, helping them make the statistics more inclusive and introducing innovating platforms to help their users improve their analysis and budgetary forecasting.

In September 2022, I continued to work on a project to investigate the feasibility of using transparency declarations to improve intelligence on public sector expenditure and increase the quality of ONS public statistics. The declarations refer to expenditure data that local councils and central government bodies must publish to meet their transparency requirements. This work may also offer insights into the spatial distribution of public spending, which could be useful for policy agendas.

In October 2022, I became a founding member of the ONS Data Science Network, a new cross-departmental group that promotes data science events and training across the organization. The network also provides a forum for data scientists and analysts to discuss and share ideas on data science and analysis, and to promote the use of data science and analysis. The network consists of founding members from the Reproducible Data Science and Analysis (RDSA), Methodology and Quality Directorate (MQD), and Data Science Campus (DSC).

In November 2022, I had a one-on-one chat with Professor Sir Ian Diamond, the National Statistician and head of the ONS, about ways to improve transparency for statistics. We discussed the importance of releasing code for scrutiny and learning purposes, as well as the challenges that prevent people from releasing code. We also talked about the potential for the ONS Data Science Network to promote code quality for data science projects across the organization. Additionally, we discussed the importance of data visualization and communication in transparency and the ONS’s efforts to improve in this area.

In December 2022, I received the Recognition Award again for outstanding collaboration and contribution to the ONS. I helped another team in a different department who recently migrated their systems and team of new developers to GCP. They encountered issues getting things set up with the on-prem laptop and GCP, largely due to niche ONS system restrictions which made it difficult to find resources on the internet to solve them. I generously shared my knowledge and expertise, which saved the team a lot of time and helped them gain a deeper understanding of the topic.

In May 2023, despite leaving the ONS in January of the same year, I was honored with another Recognition Award for the impactful contribution I had made in establishing a cross-ONS network of data scientists during my time at the organization. Displaying initiative, I set up a project management tool to handle different aspects of the network and its prospective deliverables. Moreover, I created a dedicated communication channel and a network inbox, both of which were critical for effective communication within the network. My proactive role in laying the foundation for the network was acknowledged as instrumental in creating a thriving environment where data scientists across the ONS could collaborate and learn. This achievement not only acknowledges my contributions to the data science community within the ONS but also underscores the importance of fostering collaboration and community in the ever-evolving field of data science.

My journey at Quantexa

My role as a Data Engineer at Quantexa has been a deep dive into the exciting and challenging world of data engineering. Embarking on this journey has allowed me to push the boundaries of big data technology and harness its transformative potential to augment decision-making capabilities for businesses across a myriad of sectors. My journey at Quantexa is marked by perseverance, ambition, teamwork, and accountability—principles that resonate deeply with the company’s core values. Whether it was enhancing operational efficiency via Python scripts or unveiling system vulnerabilities, I have consistently sought out ways to drive impact. My approach to teamwork has been rooted in inclusivity, fostering an environment that champions collective growth. I have maintained a strong focus on accountability by regularly soliciting feedback and setting high standards for continual self-improvement.

isolated

Data Engineer in Research and Development (R&D)

In January 2023, I embarked on a new journey as a Data Engineer with Quantexa, a dynamic fintech company founded in 2016. Renowned for its ground-breaking work in advanced network analytics and dynamic entity resolution, Quantexa operates with a vision that better decisions can be made through a greater understanding of context. It’s inspiring to be part of an organization whose pioneering technologies are applied across diverse sectors such as Finance, Insurance, Energy, and Government, generating valuable insights from data and addressing significant business issues like lead generation, customer insight, fraud, and financial crime.

My role at Quantexa has been dynamic and impactful, contributing to the development, testing, and documentation of a wide array of data engineering tools and best practices. These materials have found applications in Quantexa’s software deployments, strengthening the company’s data engineering function and enhancing the quality and efficiency of project delivery.

Despite being relatively new to Scala, I’ve made significant strides in mastering this powerful language during my tenure at Quantexa. With its fundamental role in data processing tasks within big data ecosystems, Scala quickly became a priority in my learning path. My growing proficiency in Scala has already yielded tangible results, with my contributions starting to influence development sprints and code optimization. Whether it’s grappling with complex data problems or working towards system stability, my burgeoning understanding of Scala has been put to good use.

One of my primary responsibilities involves defining big data best practices across the business. With a solid background in big data technologies such as Spark, Hadoop, and Elasticsearch, I’ve been able to bring valuable expertise to the table. My programming skills in Java, Python, and Scala have further enabled me to contribute effectively to the team’s initiatives.

Stakeholder engagement has been a crucial aspect of my role, allowing me to work closely with delivery teams, clients, and partners to provide high-quality solutions. These solutions span both cloud and on-premise environments and cover a variety of tasks including big data processing/ETL pipelines, cleansing, parsing and standardising global datasets, data classification, and entity extraction/resolution.

Building on my accomplishments at Quantexa, I undertook the Quantexa Academy, where I excelled by scoring an impressive 97% and earning the title of Quantexa Certified Data Engineer. My success in the academy has fortified my understanding and expertise in the field. Furthermore, I’ve frequently extended my support to other academy participants and my team, establishing a collaborative learning environment.

My journey at Quantexa is continuously evolving, offering daily opportunities for learning, development, and contribution to the exciting field of data engineering. As this journey continues, I look forward to sharing more updates in due course.