Table of Contents
- Introduction
- Academia and Research
- Tutoring and Content Development
- Coding and Data Science
- My journey at the Office for National Statistics
- My journey at Quantexa
Introduction
Hello, I’m Dr. Usman Kayani — a Senior Data Engineer and Scientist with a solid academic foundation in Mathematics, showcased through my MSci and PhD degrees. My career is rooted in data engineering, data science, and theoretical physics, where I specialize in creating scalable data solutions, deploying machine learning models, and leveraging big data to solve complex, industry-wide challenges.
With over a decade of experience in Python development, enhanced by recent proficiency in Scala, I design and implement robust analytical data pipelines (ETL, RAP) and deploy machine learning models at scale. My technical expertise includes Apache Spark, Airflow, BigQuery, and SQL, enabling me to build dynamic, high-performance data solutions. Proficient in cloud platforms such as GCP, Cloudera, and AWS, I leverage cloud-based infrastructures to transform vast datasets into actionable insights, empowering businesses to make data-driven decisions with confidence.
As a neurodivergent individual with ADHD and Dyslexia, I bring a unique approach to problem-solving and innovation in data engineering. My neurodivergent perspective has fostered creativity and resilience, driving me to develop inventive solutions. As an advocate for diversity and inclusion, I am committed to fostering an environment where neurodiversity is understood and celebrated within the workplace.
In addition to my technical expertise, I am deeply committed to education and have taught Mathematics and Physics across various levels, from GCSE to advanced university topics, including statistics, linear algebra, and Python programming. My approach to teaching emphasizes accessibility, making complex topics understandable and engaging. I have also worked as a freelance mathematics content writer and video developer, producing educational materials for platforms reaching a wide audience.
Driven by a steadfast belief in the power of data and technology, I aim to use my skills for positive societal impact. I am always seeking innovative ways to apply data science to improve lives and drive meaningful change. Whether it’s mentoring new engineers, enhancing data engineering practices, or advocating for inclusivity, I am dedicated to harnessing the transformative potential of data to make a lasting difference.
Academia and Research
I completed an MSci in Mathematics, with First Class Honours, and a PhD in Applied Mathematics and Theoretical Physics at King’s College London.
During my MSci, I scored highly in my modules, most notably gaining 99% in linear algebra and partial differential equations with complex variables. Other pure and applied mathematics modules that I completed included topics covering advanced calculus, dynamical systems, probability and statistics, discrete mathematics, logic, complex analysis, manifolds, linear algebra, differential geometry and mathematical biology.
In my second year as an undergraduate, I had written a paper under the advice of one of my professors, who thought that it was interesting enough to be posted on the university website for the mathematics department. This gave me valuable experience on how to write a paper and investigate different mathematical techniques and ideas.
The paper I wrote was on a new method to solve linear ordinary differential equations of any order with non-constant coefficients and certain classes of partial differential equations. As a result, by the recommendation of another professor and based on my academic record, I was able to attend the fourth year module, Fourier analysis, for extra credit in my second year.
In the final two years of my MSci, I also completed modules in physics, covering quantum mechanics, advanced quantum field theory, statistical mechanics, advanced general relativity, string theory, and cosmology.
For my MSci dissertation, I studied integrable quantum spin chains. This introduced me to quantum interaction models particularly the Heisenberg spin chain, which describes the nearest neighbour interaction of particles with spin-$\frac{1}{2}$ (i.e electrons) and naturally arises in the study of ferromagnetism.
These models have acquired growing importance in quantum computing within the field of quantum information processing, mainly as a means of efficiently transferring information. In addition to the understanding of quantum computing that I acquired from the research, it also gave me insights into various mathematical methods to diagonalize large square matrices which can grow exponentially according to the number of particles.
For my academic excellence in my MSci, I was also awarded a scholarship by the Science and Technology Facilities Council (STFC) to undertake a PhD in Applied Mathematics and Theoretical Physics under the mathematics department. My research was in quantum gravity, which is the study of the relationship between quantum mechanics and general relativity.
In particular, I studied string and supergravity theories, specifically type IIA, massive type IIA and 5-dimensional (both gauged and ungauged).
The main theme of my PhD research was a study of the symmetries of black hole horizons in quantum gravity.
The work has been published in three peer-review papers in leading international journals with the third publication on the five-dimensional supergravity theories being a sole-author paper. My examiners remarked at the great achievement to produce such a sole-author paper during a PhD. I had also continued my research and publications in quantum gravity as an independent researcher for black holes in 6-dimensional gauged $N=(1,0)$ supergravity.
The methodology used to investigate these problems involves techniques in Lie algebras, differential geometry, differential equations on compact manifolds, general relativity, supersymmetry and string theory. Algebraic and differential topology were also essential in the analysis.
For my research in quantum gravity, I also performed extensive computations on multi-dimensional arrays for supergravity calculations using Python with Cadabra, particularly for Clifford algebras and spinors in higher dimensions.
I also attended various academic seminars and conferences such as the Winter School on Supergravity, Strings, and Gauge Theory at CERN in Switzerland, and I presented my research at many conferences including the Young Theorists’ Forum at Durham University.
Tutoring and Content Development
I have experience teaching and tutoring Mathematics, Physics and Programming at a Graduate (BSc/MSc), A-level and GCSE. I was a graduate teaching assistant at King’s College London for 6 years, and a private tutor for over 12 years. During my PhD, I also volunteered for roles in science communication, such as with the Institute of Physics, to explain my research on black holes to school children.
In the last few years, I have also undertaken freelance work with Witherow Brooke, a private tuition and educational consultancy company which was featured in The Telegraph. I am currently tutoring university students in mathematics topics such as advanced statistics, linear algebra and Python coding.
I also worked as a freelance mathematics content writer and video developer for Nagwa, a leading educational technology company in the Middle East and North Africa. I was responsible for developing educational videos and content for the company’s online platform for topics in mathematics from GCSE to graduate level, such as algebra, trigonometry, calculus and statistics.
Coding and Data Science
I have extensive experience coding in many high-level programming languages (e.g C++, Java), scripting languages (e.g Python, R, Bash, etc) and symbolic languages (e.g Mathematica, MATLAB, Maple etc) using various operating systems (e.g Linux, Windows, Mac OS).
In addition to Cadabra, I used Mathematica and Maple for my research in quantum gravity during my PhD and beyond. I have also used various programming languages or libraries (e.g NumPy, SciPy, Matplotlib etc) to perform mathematical, physical, and statistical computations for various analyses and datasets.
During my PhD, I also worked on a personal project to interact with bluetooth low energy (BLE) smart watches in Python. This was used to read the heart rate (bpm), systolic/diastolic blood pressure (mmHg) and SpO2 blood oxygen (%) with a live plot of real-time HR readings using gnuplot. I also used Python to perform data analysis on the data collected from the smart watch.
After leaving academia, one of the roles I decided to pursue was a Scientific Software Engineer at the Meteorological Office. For this role, I gave a presentation on the history, applications and good practices of scientific software engineering.
Ultimately I decided to embark on a career in data science instead. Nevertheless, the panel was very impressed with my application and noted that I had a good understanding of the importance of software quality, and awareness of the challenges of developing scientific software. I also demonstrated good examples of software development, especially in Python and a sensible approach to debugging code, including when working with other’s code.
I am experienced with data science and machine learning Python libraries (e.g Scikit-learn, Keras, Tensorflow) as well as data visualization software such as Tableau and Power BI.
As a freelance data scientist, I analysed instantaneous power consumption data of a large number of households with supervised machine learning models to identify various devices (e.g the TV or kettle) and classify when they are turned on and the occurrences/duration of their usage, to identify routines and the detection of anomalies. This data was provided by a particular company that aims to use ML and modelling with domestic electric appliances.
My journey at the Office for National Statistics
Throughout my tenure at the Office for National Statistics (ONS), I have made pivotal contributions to data science and analysis, specializing in the development of advanced statistical and machine learning models, designing analytical data pipelines, and crafting data visualizations to distil complex data insights. Key projects include devising sophisticated multilateral price indices for the treatment of alternative data sources, implementing novel machine learning methodologies in Python, and significantly enhancing the computational efficiency of data processes. Furthermore, my commitment to knowledge sharing has positioned me as a mentor to new team members and a key presenter in various organizational seminars.
Data Scientist in Reproducible Data Science and Analysis
In June 2021, I was employed as a Data Scientist at the Higher Executive Officer grade at the ONS within the Economics Statistics Group (ESG) and the Reproducible Data Science and Analysis (RDSA) team, formerly known as Emerging Platforms Delivery Support (EPDS).
After only my second month at the ONS, I was a member of the induction team responsible for onboarding new starters and aiding or mentoring to new members of the team. My main work was researching and implementing multilateral price indices, using calculations and time series extension methods in Python. This work was as part of an ETL Reproducible Analytical Pipeline (RAP) on Cloudera with Apache Spark for the treatment of alternative data sources (scanner and web-scraped data) and new index methods which will be used to determine the consumer price index (CPI) in the future.
As part of the new index methods, I had been looking at mutlilateral methods which simultaneously make use of all data over a given time period. Their use for calculating temporal price indices is relatively new internationally, but these methods have been shown to have some desirable properties relative to their bilateral method counterparts, in that they account for new and disappearing products (to remain representative of the market) while also reducing the scale of chain-drift.
While working on building a data pipeline for the CPI, I made very significant contributions both to methodology and computational efficiency for the integration of alternative data sources. In my first few months, I led an investigation into a particular implicit hedonic multilateral index method known as the Time Product Dummy (TPD) method, which uses a log-linear price model with weighted least squares regression and expenditure shares as weights.
\[\begin{aligned} \ln p_i^{t} &= \alpha + \sum_{r=1}^T \delta^r D_i^{t,r} + \sum_{j=1}^{N-1}\gamma_j K_{i,j} + \epsilon_i^{t} \ , \\ s_i^{t} &= \frac{p_i^t q_i^t}{\sum_{j=1}^N p_j^t q_j^t} \ . \end{aligned}\]After noticing an error in the formulae and example workbooks produced for these methods and bringing this to the attention of the ONS, I worked closely with people from methodology on making sure we got all the technical details right.
My first task was to implement the TPD method within the CPI pipeline using PySpark. Spark’s native ML library though powerful generally lacks many features, and is not suited for modelling on multiple groups or subsets of the data at once. The usual approach to use custom functions or transformations which are not part of the built-in functions provided by Spark’s standard library is to use a User Defined Function (UDF). However, the downside of this is they have performance issues, since they executed row-at-a-time and thus suffer from high serialization and invocation overhead.
This led me toward discovering Pandas UDFs, which allow for vectorized operations on Big Data and increase performance by up to 100x compared to regular UDFs using Apache Arrow. They have since been implemented in various multilateral index methods and are an integral part of the CPI pipeline.
I also used the same ideas for the Time Dummy Hedonic (TDH) method, which is an explicit hedonic model similar to TPD, but also uses the item characteristics in the WLS regression model.
\[\begin{aligned} \ln p_i^{t} = \delta^0 + \sum_{r=1}^T \delta^r D_i^{t,r} + \sum_{k=1}^K \beta_k z_{i,k} + \epsilon_i^{t} \ . \end{aligned}\]After implementing the TPD and TDH methods, I turned my attention to another multilateral method known as Geary-Khamis (GK) and the usual method involves iteratively calculating the set of quality adjustment factors simultaneously with the price levels.
\[\begin{aligned} b_{n}&=\sum_{t=1}^{T}\left[\frac{q_{t n}}{q_{n}}\right]\left[\frac{p_{t n}}{P_{t}}\right] \ , \nonumber \\ P_{t}&=\frac{p^{t} \cdot q^{t}}{ \vec{b} \cdot q^{t}} \ . \end{aligned}\]I was able to independently research and implement a method solely based on matrix operations, which makes the method more efficient since it has vectorized operations which act on the entire data. I also refactored my code for TPD and TDH using matrix operations, which turned out to be more efficient and increased performance by up to 7x compared to standard statistical libraries. The Pandas UDFs were also applied to the time series extension methods for TPD, TDH, GK and another multilateral method known as GEKS.
In October 2021, after working closely with methodology on index numbers, I was invited to join the Index Numbers Expert Group (INEG) and the Data Science and High-performance computing (DaSH) expert group.
In November 2021, I delivered a presentation in a seminar to my team and deparment, to introduce the concept of Pandas UDFs. This turned out to be a success as I got good engagement and questions after the presentation, as well as interest from other parties in DaSH, to watch the recording and slides. I also presented a seminar aimed at people both little and extensive knowledge of the subjects, and a Jupyter Notebook of worked examples. I discussed this material with a computing specialist, and with their feedback have produced useful material with a full set of instructions and worked examples, which is accessible to a wider audience.
Senior Data scientist in the Data Science Campus
In March 2022, I joined the Data Science Campus at the ONS with a promotion to Senior Executive Officer and a permanent role in the civil service.
My first project was on the least cost index, which was published in May 2022. I played a significant role in researching and implementing the price index and aggregation methods, which was powered by a Python price index package which I created called PriceIndexCalc.
My package and work was used to track the prices over time of the lowest-cost grocery items for 30 products over multiple retailers, using web-scraped data and a data pipeline on the Google Cloud Platform. This analysis was conducted as part of the ONS’s current and future analytical work related to the cost of living.
In April 2022, I also joined the Data Access Platform Capability And Training Support (DAPCATS) as a mentor, where I have been helping other data scientists and analysts with their work and projects.
I also took part in the Spark at the ONS event hosted by DAPCATS and created for the launch of a new online book. This event was used to discuss various topics and resources related to Spark and Big Data, and I delivered a presentation titled Spark application debugging, tuning and optimization. For this talk, I discussed various tips and techniques to increase efficiency, identify bugs or bottlenecks that can cause Spark applications to be slow or fail, and tuning Spark parameters accordingly. This can help to reduce overall developer and compute time, costs for resources to run the Spark application or the environmental impact that comes with using unnecessary extra resources or having significantly longer runtimes.
In August 2022, I received the Recognition Award for outstanding collaboration and contribution to the ONS. I provided very important support to help another team to publish the Capital Stocks user guide article and the work has made the UK the only country to introduce such transparency. The process involved sharing their statistical production code in the ONS’s GitHub account and I dedicated my time to help them set up the initial account, and to upload the packages in GitHub as the team hadn’t experienced using this platform before. I also took the time to give them a very detailed walk through of how the platform works, and helped them by sharing tips and examples of good practice. My support enabled them to make their capital stocks statistical production system accessible and reproducible by all external users, helping them make the statistics more inclusive and introducing innovating platforms to help their users improve their analysis and budgetary forecasting.
In September 2022, I continued to work on a project to investigate the feasibility of using transparency declarations to improve intelligence on public sector expenditure and increase the quality of ONS public statistics. The declarations refer to expenditure data that local councils and central government bodies must publish to meet their transparency requirements. This work may also offer insights into the spatial distribution of public spending, which could be useful for policy agendas.
In October 2022, I became a founding member of the ONS Data Science Network, a new cross-departmental group that promotes data science events and training across the organization. The network also provides a forum for data scientists and analysts to discuss and share ideas on data science and analysis, and to promote the use of data science and analysis. The network consists of founding members from the Reproducible Data Science and Analysis (RDSA), Methodology and Quality Directorate (MQD), and Data Science Campus (DSC).
In November 2022, I had a one-on-one chat with Professor Sir Ian Diamond, the National Statistician and head of the ONS, about ways to improve transparency for statistics. We discussed the importance of releasing code for scrutiny and learning purposes, as well as the challenges that prevent people from releasing code. We also talked about the potential for the ONS Data Science Network to promote code quality for data science projects across the organization. Additionally, we discussed the importance of data visualization and communication in transparency and the ONS’s efforts to improve in this area.
In December 2022, I received the Recognition Award again for outstanding collaboration and contribution to the ONS. I helped another team in a different department who recently migrated their systems and team of new developers to GCP. They encountered issues getting things set up with the on-prem laptop and GCP, largely due to niche ONS system restrictions which made it difficult to find resources on the internet to solve them. I generously shared my knowledge and expertise, which saved the team a lot of time and helped them gain a deeper understanding of the topic.
In May 2023, despite leaving the ONS in January of the same year, I was honored with another Recognition Award for the impactful contribution I had made in establishing a cross-ONS network of data scientists during my time at the organization. Displaying initiative, I set up a project management tool to handle different aspects of the network and its prospective deliverables. Moreover, I created a dedicated communication channel and a network inbox, both of which were critical for effective communication within the network. My proactive role in laying the foundation for the network was acknowledged as instrumental in creating a thriving environment where data scientists across the ONS could collaborate and learn. This achievement not only acknowledges my contributions to the data science community within the ONS but also underscores the importance of fostering collaboration and community in the ever-evolving field of data science.
My journey at Quantexa
My role as a Data Engineer at Quantexa has been a deep dive into the exciting and challenging world of data engineering. This journey has allowed me to push the boundaries of big data technology, harnessing its transformative potential to enhance decision-making for businesses across multiple sectors. My time at Quantexa has been marked by perseverance, ambition, teamwork, and accountability—principles that resonate deeply with the company’s core values. From optimizing processes with Python scripts to identifying system vulnerabilities, I have consistently sought impactful contributions. My approach to teamwork emphasizes inclusivity, fostering an environment that champions collective growth. Accountability remains a priority, maintained through regular feedback and high standards of continual self-improvement.
Data Engineer in Research and Development (R&D)
In January 2023, I joined Quantexa as a Data Engineer, diving into advanced network analytics and dynamic entity resolution in a dynamic fintech setting. With its 2016 founding, Quantexa has established itself as a leader in contextual data insights, applying its pioneering technologies across diverse sectors like Finance, Insurance, Energy, and Government to tackle issues from lead generation and customer insights to fraud detection and financial crime. It’s inspiring to contribute to an organization driven by the belief that better decisions are made through contextual understanding.
My role has been dynamic and impactful, encompassing the development, testing, and documentation of data engineering tools and best practices. These materials support Quantexa’s deployments, strengthening the data engineering function and improving the quality and efficiency of project delivery.
Though relatively new to Scala, I’ve made significant strides in mastering this language, a vital tool for data processing within big data ecosystems. My proficiency in Scala has produced tangible results, with contributions that influence development sprints and code optimization.
My knowledge of big data technologies—particularly Spark, Hadoop, and Elasticsearch—has also been pivotal in defining best practices across the business. With skills in Java, Python, and Scala, I’m well-equipped to support the team’s initiatives and deliver efficient solutions across both cloud and on-premise environments.
Stakeholder engagement is essential to my role, allowing me to work closely with delivery teams, clients, and partners to deliver high-quality solutions. These solutions span diverse tasks, including ETL pipelines, data cleansing, parsing, and standardization, as well as data classification and entity extraction/resolution.
A significant milestone in my journey was achieving a 97% score in the Quantexa Academy, earning the title of Quantexa Certified Data Engineer. This accomplishment deepened my expertise in Quantexa’s technology stack, and I actively support other Academy participants, fostering a collaborative learning environment within the team.
Senior Data Engineer in Research and Development (R&D)
In October 2024, I was promoted to Senior Data Engineer within Quantexa’s Data Engineering Accelerators and Demos (DEAD) team in R&D. This role enables me to lead complex data engineering projects and drive innovation within Quantexa’s technology stack. My responsibilities now include architecting, developing, and optimizing scalable data pipelines that support large-scale data integration, advanced analytics, and entity resolution. I work across both cloud and on-premise infrastructures, leveraging tools like Airflow, Spark, and Google Cloud, with a strong focus on infrastructure optimization, automation, and performance.
A recent highlight was spearheading the implementation of dynamic Directed Acyclic Graphs (DAGs) and autoscaling in Airflow, which has optimized resource usage and boosted scalability. This project aligns with my technical focus on big data and cloud solutions, which allows me to refine Quantexa’s infrastructure management practices for improved accessibility and best practices across the team.
Mentorship has become a rewarding focus of my role. I work closely with junior engineers, guiding them through coding challenges, conducting code reviews, and creating a collaborative learning environment. My expertise in Scala and Python has been instrumental in tackling complex data engineering problems and supporting the professional growth of emerging engineers, especially as we manage large-scale data and complex systems.
Innovation is central to my role, and I’ve recently developed a dynamic DAG generator, allowing ETL pipelines to be created directly from configuration files, which has streamlined Quantexa’s ETL workflow and enabled rapid deployment of custom data pipelines. My vision is to expand this into a web-based application with a drag-and-drop interface, empowering data engineering teams to design, configure, and deploy ETL pipelines more efficiently. This project represents my commitment to creating accessible tools that democratize data engineering workflows and enhance productivity.
Beyond technical work, I’m responsible for setting and implementing best practices in ETL pipelines, big data processing, and data standardization. These initiatives ensure Quantexa’s ability to provide high-quality, standardized data for accurate analytics and decision-making. My experience with Spark, Hadoop, and Elasticsearch supports strategic efforts like data classification and entity extraction, both of which are integral to Quantexa’s data solutions.
I am also a strong advocate for neurodiversity and inclusivity within Quantexa, particularly for ADHD and dyslexic perspectives. I bring this advocacy to my work by emphasizing clear documentation, accessible resources, and continuous learning opportunities, ensuring that all team members can contribute effectively.
As I continue my journey at Quantexa, I look forward to the new challenges, learning opportunities, and contributions to data engineering. I am dedicated to aligning my work with Quantexa’s mission of delivering actionable insights through contextualized data, and I am excited to drive even greater successes with the company in the years to come.