Abhijit is a seasoned technology leader with a rich history of spearheading transformative projects in the financial sector. Since joining Morgan Stanley Advantage Services (MSAS) in 2016, he has overseen back-office, middle-office, and services technology, championing initiatives in Data, Analytics, AI, Machine Learning, Cloud, and Salesforce CRM. His efforts have led to the creation of award-winning products like Next Best Action and Genome. In his talk at the Data Engineering Summit 2024 in Bengaluru, Abhijit shared insights on integrating generative AI into data engineering pipelines to drive innovation and efficiency.
Generative AI: The Synthetic Data Magician
Abhijit humorously kicked off his talk by sharing his interactions with ChatGPT, where he posed three questions related to data engineering. The responses, though amusing, highlighted critical aspects of generative AI’s role in data engineering. Generative AI was playfully dubbed a “synthetic data magician,” hinting at its potential to transform traditional data engineering processes.
Understanding Generative AI and Its Patterns
Generative AI, built on advanced machine learning and neural network models, represents the next stage of automation. Unlike earlier automation technologies like Robotic Process Automation (RPA) and basic machine learning models, generative AI continuously generates outputs based on the input data. This necessitates high-quality, accurate, and detailed data.
Abhijit outlined six key patterns where generative AI excels:
- Information Extraction: Efficiently extracting data from vast document collections.
- Information Summarization: Summarizing unstructured data sets.
- Language Translation: Translating languages accurately, essential for international operations.
- Q&A Retrieval: Implementing question-and-answer functionalities across various applications.
- Code Generation: Generating code and documentation, enhancing software development efficiency.
- Reasoning and Action: Integrating structured and unstructured data to support decision-making.
Data in Financial Services
In financial services, data is predominantly structured and stored in relational databases, data warehouses, and data lakes. Unstructured data, such as call logs, videos, manuals, and research documents, often remains untapped. Generative AI helps unlock the potential of this unstructured data.
Transforming the Data Engineering Pipeline
Abhijit detailed the typical data engineering pipeline, which includes:
- Requirements Gathering: Involving business owners, product owners, data scientists, and engineers.
- Data Ingestion: Ensuring data quality, sensitivity, and appropriate partitioning.
- ETL Processes: Extracting, transforming, and loading data, followed by curation and distribution.
Generative AI can significantly enhance these steps. Abhijit highlighted several use cases:
- Code Generation and Documentation: Generative AI can reverse-engineer legacy systems, generating updated code and documentation.
- Lineage Tracking: Ensuring accurate metadata and lineage tracking for regulatory compliance.
- Feature Store and Cataloging: Automating the creation and management of feature stores, facilitating efficient data analysis.
- Synthetic Data Generation: Creating diverse, high-quality data sets for comprehensive testing.
- NLP-Based Search: Implementing natural language processing for efficient data discovery.
Generative AI Implementation
Integrating generative AI into the data engineering pipeline requires several steps:
- Embedding Creation: Chunking data and creating embeddings.
- Vector Management: Storing and indexing vectors in a suitable database.
- Platform as a Service: Offering generative AI capabilities as a service within the organization.
- Custom Models: Training models on internal data sets to meet specific business needs.
Abhijit emphasized the need for a platform-as-a-service approach, enabling various teams to leverage generative AI without extensive customization. This approach ensures consistency and efficiency across the organization.
Regulatory Challenges in Financial Services
One critical aspect Abhijit addressed was the regulatory landscape in financial services. Generative AI models must be transparent and explainable to meet regulatory requirements. Closed models, which operate as black boxes, pose challenges in this highly regulated industry. Abhijit stressed the importance of using open models and developing custom models to ensure compliance.
Enhancing Efficiency and Innovation
Abhijit concluded by reiterating the transformative potential of generative AI in data engineering. By automating code generation, enhancing data lineage tracking, and creating synthetic data for testing, generative AI can significantly boost efficiency and innovation. However, successful implementation requires rethinking the traditional data engineering pipeline and adopting a flexible, platform-based approach.
Conclusion
Abhijit’s talk at the Data Engineering Summit 2024 provided a comprehensive overview of how generative AI can revolutionize data engineering. By leveraging generative AI’s capabilities, organizations can enhance efficiency, drive innovation, and unlock the full potential of their data. The financial services industry, with its complex data needs and regulatory requirements, stands to benefit immensely from these advancements, provided that careful attention is paid to data quality and compliance.