Exploring the New World of Connectors

Not all connectors are created equal: Glean's connector framework for enterprise search

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In this session, James Simonsen, a founding engineer at Glean, discusses the development and benefits of Glean's connector framework for enterprise search. Glean's choice to build a centralized Google-like search engine as opposed to a federated search approach offers a superior user experience by crawling and understanding data directly. The process of building each connector is complex due to the unique nature of various data sources such as Slack and GitHub, and their permissions systems. Glean maintains scalability by continuously learning and adapting to changes across data sources, including through partnerships with companies like Databricks and DataStax. Additionally, Glean's Actions feature allows users to push insights back into applications like Slack and Jira, automating responses and enhancing efficiency.

Highlights

Glean's decision for a Google-like search engine over federated search improves user experience. 📈
Unique complexities of each data source require specific attention in connector development. 🌐
The scalability of Glean's framework relies on adaptable learning from diverse data sources. 💡
Glean partners with companies like Databricks to provide robust connector ecosystems. 🔗
Through Glean Actions, users can automate tasks such as answering Slack queries efficiently. 🤖

Key Takeaways

Glean chose a centralized Google-like search architecture over federated search for a better user experience. 🕵️‍♂️
Each data connector represents a unique world, requiring deep understanding and respect for its permissions system. 🔍
Glean's adaptation to evolving data sources offers scalable search solutions across diverse industries. 🌍
Partner and custom connector ecosystems enhance Glean's integration capabilities. 🤝
Glean Actions facilitate data insights being pushed back into applications, enhancing workflow efficiency. 🚀

Overview

When Glean embarked on its journey to enhance enterprise search, it faced a pivotal choice — Google-like centralized search or federated search. Ultimately, Glean opted for the former, prioritizing a user experience that draws from deep data understanding and processing. This decision allows Glean to provide fast, comprehensive search results, setting it apart in the enterprise search market.

Building each connector from the ground up poses a challenge due to the specialized nature of diverse data sources. From handling Slack's varied communication channels to managing GitHub repositories, each connector requires careful consideration of permissions and data structures. This meticulous approach ensures security and accuracy in data retrieval, vital for enterprise functionality.

Glean's ecosystem thrives through its continuous learning and adaptability to dynamic data sources, maintaining its relevance and scalability. Partnerships with industry players like Databricks further extend Glean's reach, allowing seamless integration within varied systems. Additionally, Glean Actions enable proactive data interactions, enhancing how businesses utilize search data to optimize their workflow.

Chapters

00:00 - 00:30: Introduction to Glean's Connector Framework In this chapter, James Simonsen, a founding engineer at Glean, introduces the connector framework of Glean's AI. He discusses his role in leading efforts to connect various data sources. The chapter delves into understanding the differences between connectors, emphasizing that not all connectors are created equal, and explains the construction and purpose of Glean's connector framework for Work AI.
00:30 - 01:00: Decision on Search Model: Federated vs Centralized The chapter discusses the decision-making process Glean went through when choosing between a federated search model and a centralized search model. Initially, federated search was tempting due to existing search APIs in various data sources like Google Drive and GitHub. However, ultimately, Glean decided to adopt the Google-like centralized search model.
01:00 - 01:30: Challenges with Federated Search The chapter titled 'Challenges with Federated Search' discusses the issues arising from federated search systems. It highlights the problem with user experience due to inconsistency in search results and latency. The chapter emphasizes that the effectiveness of search results in federated systems is limited by the weakest system component. Additionally, it contrasts this with the approach used by major search engines like Google, which involves directly crawling, understanding, and processing data to provide the best possible search experience.
01:30 - 02:00: Building a Connector Architecture The chapter discusses the development of a connector architecture at Glean, which involves creating over 100 unique connectors. Each connector is considered a unique world, requiring an understanding of what makes each special. The example of Slack's connectors includes handling private channels, public channels, DMs, mentions, and attachments.
02:00 - 02:30: Handling Different Data Sources and Unique Connectors The chapter discusses the complexity involved in handling various data sources within a system. Each data source, such as Jira, GitHub, or Figma, presents unique challenges due to its distinct nature and purpose. This necessitates a deep understanding of each platform's unique features and how they can be integrated into a unified search architecture. The discussion also highlights the need to balance leveraging common architectural frameworks with accommodating the distinct characteristics of each data source. Additionally, when developing connectors for these different platforms, the varying permissions models associated with each must be considered, adding another layer of complexity to the integration process.
02:30 - 03:00: Managing Permissions in Connectors The chapter discusses how Glean manages permissions within connectors to prevent unauthorized access, exemplified by ensuring one does not accidentally access a sensitive document like a salary spreadsheet. It highlights the critical nature of respecting permission systems in the enterprise environment. The unique data models and diverse content types within each system require a distinct approach to permissions management.
03:00 - 04:00: Scalability Through a Common Framework The chapter discusses the complexity and variability of permission systems across different platforms and frameworks. It highlights differences like ACLs, file system permissions based on directory locations, hierarchical group memberships, and link sharing complexities. Managing these varied permission systems and tracking changes is portrayed as a challenging task.
04:00 - 05:00: Glean's Native and Partner Connectors The chapter delves into the complexity of integrating various data connectors, both native and partner, into a system. It discusses the 'butterfly effect' where a minor change can impact a vast number of documents, and the importance of quickly reflecting such changes in the system. The text highlights the strategies used to understand and propagate changes within permission schemes efficiently, often almost immediately. It also touches on the process of designing unique connectors for each data source.
05:00 - 06:30: Integrating Insights Back into Applications: Glean Actions In this chapter, the focus is on the scalability of integrating insights back into applications using a framework called Glean Actions. The speaker reveals that one of their key strengths—referred to as 'secret sauces'—is their ability to tackle a diverse range of data sources across various customers. This experience allows them to generalize insights from one dataset and apply them across others, enhancing scalability and efficiency.
06:30 - 07:30: Conclusion The conclusion focuses on the ongoing nature of handling data sources, emphasizing the importance of continuous learning and adaptation. The speaker highlights that lessons learned from one customer are applied to others, signifying a collective improvement process. They mention the dynamic nature of data sources, which constantly change and require updates, illustrated by recent changes in platforms like Google Drive, Slack, and Zendesk. The chapter underlines the necessity of evolving with these changes to maintain effectiveness, and briefly mentions the discussion of Glean's native connectors.

Not all connectors are created equal: Glean's connector framework for enterprise search Transcription

00:00 - 00:30 - We're here for our series on working AI, and I'm joined by James, a member of the Glean Engineering team. James, can you go ahead and introduce yourself? - I'm James Simonsen. I'm one of the founding engineers here at Glean, and I have been leading our efforts when it comes to connecting to all the different data sources that we support. - Thanks for coming today. So today we're talking about connectors and why not all connectors are created equal and how we built Glean's connector framework for Work AI.
00:30 - 01:00 So when Glean was getting started, we faced a choice. We could go down a route of federated search, or we could build Google-like search with a centralized index. Which route did Glean decide to go down? - So ultimately we decided to go down the Google route, but initially it was very tempting to kind of consider the federated search route. All these data sources already support a search API. So if you look at Google Drive or you look at GitHub, they already have a search endpoint. You can call that, tie all those together,
01:00 - 01:30 and then you would be done; that easy. And the problem is, is it just doesn't lead to a good user experience for a number of reasons. So one being that the search results are only as good as the least common denominator there. Same thing with the latency. So one might be very fast, and then another might be very slow. And so you're ending up with this not very good experience. And we know, coming from Google, that the way you build like a really good search engine is like you have to crawl the data. You have to understand it. You have to process it. You serve it yourself. And that way you get the best of all possible worlds there.
01:30 - 02:00 So that's the way we ultimately decided to go. - And I know that Glean has, you know, over 100 different connectors. So, you know, how did it approach building this connector architecture from the ground up? - The first thing you have to recognize is just like each one of these connectors is like its own unique world, and you really have to understand what makes each one of those special. So if you look at something like say Slack, you've got private channels, you've got public channels, you've got DMs, you've got mentions, you've got attachments.
02:00 - 02:30 There's just all this complexity, and that's just like one data source, right? And now you have to repeat that for all the other ones. So now how do you handle Jira? How do you handle source control like GitHub? How do you handle something like Figma where it's like mostly like a visual thing, right? So each one of these is kind of like unique, and you have to understand like what makes all those unique and then be able to represent that in a way where you can both take advantage of your common search architecture, but then also, you know, get everything that's like unique about each one of those. - When you start to build out these connectors, what you also realize is they each have their own different permissions model.
02:30 - 03:00 So how does Glean work within these different permissions and assure that one person doesn't accidentally get access to the salary spreadsheet at the company? - So that's one of the things that's really critical when it comes to the enterprise is like you just have to respect the permission system. Like that's just fundamental to doing things in the enterprise world. And so the same way that each one is kind of unique in the way that they model their data and, you know, have different types of content, it's not just that. Each one will have the same thing
03:00 - 03:30 when it comes to permissions too. So you might have one that uses ACLs. You might have another where it's like a file system, and so your permissions depend on which folder you're in or which shared drive you're in. There can be hierarchies of groups where it's like you need to be a member of this group which is a member of this group, and then, but you can't be a member of this group. And then there's things like link sharing where not only do you have to know that there's all these links that refer to this document, but then who has seen those links so that you can grant them access to it. So it's very complicated to keep track of all that stuff. And then, when anytime something changes,
03:30 - 04:00 you have to be able to like reflect that throughout your system too, right? So it's kind of like this butterfly effect where, you know, maybe one little change happens, but then that ripples to a thousand documents or a million documents. And so how do you get that to reflect really quickly? So we have all these different strategies that we use where we understand the permission scheme. We understand the change that's taking place. We're able to propagate that for most situations. We can propagate that almost immediately. - When I think about all of these different connectors, you're building and designing unique ones for every single data source.
04:00 - 04:30 How is that scalable? - So that's actually one of our, I feel, like one of our secret sauces is because we've had to tackle so many different data sources across so many different customers, we're able to take those learnings from one and like apply them to all, right? So we've built this really powerful framework that drives everything, takes care of like all the common situations. So, as I mentioned, there's all these kind of specific categories of data sources. So there'll be like chat data sources, email data sources, source control data sources. And we can pull out kind of the common bits there.
04:30 - 05:00 Same thing with like error-handling and rate-limiting and scaling, all those things. So we learn one thing in one customer, and then we apply that to everybody. And these data sources though, they're not a static thing. It's not like you write one crawl, and then, "Oh right, we're done." They change over time. So just in the last week, we've seen Google Drive introduce tabs and then we've seen Slack introduce canvases, and we've seen Zendesk introduce new permissions models. So you have to keep evolving in order to kind of stay on top of these things. We've discussed Glean's native connectors,
05:00 - 05:30 but we also have partner connectors as well. Can you discuss more about the larger Glean connector ecosystem? - There's a number of different ways that you can integrate with Glean, starting with just like our custom data sources where if you just have like your own on-prem say ticketing system, or you have your internet portal, we can push that data into Glean, and then it can show up in the search results just like any other data source that we natively support. We also have partner-built connectors. So those will be things where companies,
05:30 - 06:00 SaaS vendors will already have a pre-built connector for Glean. So we have examples like Guide, Outline who have built those, and we've been expanding into more partnerships as well. So we have a partnership with Databricks to integrate with their Genie. So there you're getting the best of both worlds, their business intelligence software combined with Glean's understanding and corporate knowledge. And then there's DataStax Langflow where they have like a workflow engine, and then they can use Glean as one of the sources. So now you have access to all the different data sources that Glean connects to,
06:00 - 06:30 and you can integrate those into your workflows. - So, so far we've talked about connectors in terms of bringing data into Glean, but there's also sending insights or triggers back. So if I think about, you know, Slack, if there's a question, how can Glean step in and answer that question? Or Jira, how can we help with IT ticket resolution back in the applications where people are already working? So can you share a little bit more about Glean Actions? - Actions is the way that you are able to push information back out from Glean.
06:30 - 07:00 Examples of that would be summarize an outage and then generate a postmortem document and actually create that document using the the information that you've collected from Glean itself. Then you're able to not just pull information to Glean, but then push it back out. And then you can build more-sophisticated things. So you can automatically do all this. So we have like the Glean Assistant, which you can add to your Slack channels, and then say you have like a support channel, it can automatically answer questions. And so now you don't need a person there to do that stuff, and you can have Glean just post itself. And then we've seen customers do this too.
07:00 - 07:30 So we have like a leading mobile carrier who is using Glean to power their support portal. Support agents are able to use the single portal. They have access to all their information. Everything's integrated together. So they answer questions all in one spot without having to leave their servers. - Thank you so much James for joining me today and talking about connectors. I appreciate the time.