Why a Hedge Fund Built Its Own Database
16 Sep 2024 (2 months ago)
Man Group and ArcticDB
- James Monroe works at Man Group, an alternative asset management firm. (29s)
- Man Group manages over $160 billion in assets and trades $6 trillion or more annually. (2m24s)
ArcticDB Development and Use Cases
- Arctic DB was initially developed around 2011 as a Python-based database solution utilizing MongoDB as its backend. (5m27s)
- M Group rewrote the core of their system in C++ and removed the separate database layer to improve performance and scale. This system is now used across M Group and by banks, asset managers, and data providers. (7m7s)
Justification for Building a Database
- The speaker argues that building a database, while seemingly counterintuitive, can be justified by the need for specialized tools, particularly when dealing with high-frequency data. (8m50s)
- Research productivity is crucial for systematic quant hedge funds like M Group, as it drives the generation of alpha and portfolio optimization. (10m40s)
Challenges in Quantitative Finance
- Research productivity is a key challenge in quantitative finance due to increasing data, competition, and market efficiency. (12m57s)
- Asset managers utilize high-frequency, low-latency market data, including alternative data such as weather, consumer, and environmental data. (13m38s)
Man Group's Data Science Transition
- In 2011, Man Group transitioned to Python for data science, despite the popularity of R, MATLAB, and C++ at the time, and supported the growth of the Python data science community. (14m39s)
- Eagle Alpha indexed thousands of datasets. (17m42s)
M Group's Data Science Operations
- M Group employs hundreds of quants who work on diverse problems using a variety of statistical methods, deep learning, and tools like ChatGPT. (18m36s)
- Systematic traders at M Group utilize a Lambda architecture, incorporating streaming and batch data pipelines for high-frequency and fundamental/alternative data analysis, respectively. (19m11s)
Bond Data Organization and Analysis
- Bonds represent a larger market than equities, being three times larger globally and even larger in the US. (23m55s)
- A normalized way of organizing bond data often involves using Python Pandas data frames with dates, bond IDs (like CUSIPs and ISINs), prices, and fundamental data points like duration. (24m56s)
- A more practical approach for analysis involves pivoting this data to have bond IDs as columns, time series data for a single measure (like price) as rows, which allows for efficient time series and cross-sectional analysis. (25m32s)
Database Design Considerations
- Columnar storage is beneficial for large tables, especially when users primarily need to read specific columns, which can involve billions or trillions of rows for data like tick data or decades of daily data. (28m42s)
- It is important to design databases that can handle both large tables and a high volume of columns, potentially hundreds of thousands or even millions, to support cross-sectional analysis. (29m3s)
- In database design, prioritizing agility and performance may involve sacrificing some degree of isolation, which helps users coordinate transactions from multiple sources. (31m3s)
Serverless Database Architecture
- A 2008 paper proposed building a database on Amazon S3, highlighting the potential of running databases without servers. (33m33s)
- The proposed system aimed to achieve atomicity, consistency, and delegate durability to the storage system, utilizing an immutable data structure where new versions are added instead of modifying existing ones. (35m47s)
- Serverless databases, while potentially sacrificing some transactional capabilities, offer advantages such as a heavier-weight client handling indexing and query execution, and reliance on robust storage systems like S3 for security, capacity, and resiliency. (38m9s)
ArcticDB Features and Benefits
- ArcticDB scales alongside its users, meaning those conducting the most work benefit from the most database resources. (39m18s)
- ArcticDB offers a serverless architecture, allowing users to focus on data storage without server management. (39m58s)
- ArcticDB provides database semantics for reading, writing, updating, and deleting data from data frames. (40m29s)
- ArcticDB allows users to update data in the middle of a data frame, automatically updating indexing and versions. (45m0s)
- ArcticDB is capable of handling large datasets, with one example showing a data set with 100,000 columns and 100,000 rows. (45m59s)
ArcticDB Adoption and Use Cases
- Bloomberg uses ArcticDB in their bquant tool, a quantitative data science Python tool. (48m37s)
- Some users are dissatisfied with Pandas, finding it slow. (50m38s)
- Data can be stored in a single cluster node or distributed across storage. (50m43s)
- AWS dynamically distributes data to meet demand. (50m57s)