Not all connectors are created equal: Glean's connector framework for enterprise search
Estimated read time: 1:20
Learn to use AI like a Pro
Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.
Summary
In this session, James Simonsen, a founding engineer at Glean, discusses the development and benefits of Glean's connector framework for enterprise search. Glean's choice to build a centralized Google-like search engine as opposed to a federated search approach offers a superior user experience by crawling and understanding data directly. The process of building each connector is complex due to the unique nature of various data sources such as Slack and GitHub, and their permissions systems. Glean maintains scalability by continuously learning and adapting to changes across data sources, including through partnerships with companies like Databricks and DataStax. Additionally, Glean's Actions feature allows users to push insights back into applications like Slack and Jira, automating responses and enhancing efficiency.
Highlights
Glean's decision for a Google-like search engine over federated search improves user experience. ๐
Unique complexities of each data source require specific attention in connector development. ๐
The scalability of Glean's framework relies on adaptable learning from diverse data sources. ๐ก
Glean partners with companies like Databricks to provide robust connector ecosystems. ๐
Through Glean Actions, users can automate tasks such as answering Slack queries efficiently. ๐ค
Key Takeaways
Glean chose a centralized Google-like search architecture over federated search for a better user experience. ๐ต๏ธโโ๏ธ
Each data connector represents a unique world, requiring deep understanding and respect for its permissions system. ๐
Glean's adaptation to evolving data sources offers scalable search solutions across diverse industries. ๐
Partner and custom connector ecosystems enhance Glean's integration capabilities. ๐ค
Glean Actions facilitate data insights being pushed back into applications, enhancing workflow efficiency. ๐
Overview
When Glean embarked on its journey to enhance enterprise search, it faced a pivotal choice โ Google-like centralized search or federated search. Ultimately, Glean opted for the former, prioritizing a user experience that draws from deep data understanding and processing. This decision allows Glean to provide fast, comprehensive search results, setting it apart in the enterprise search market.
Building each connector from the ground up poses a challenge due to the specialized nature of diverse data sources. From handling Slack's varied communication channels to managing GitHub repositories, each connector requires careful consideration of permissions and data structures. This meticulous approach ensures security and accuracy in data retrieval, vital for enterprise functionality.
Glean's ecosystem thrives through its continuous learning and adaptability to dynamic data sources, maintaining its relevance and scalability. Partnerships with industry players like Databricks further extend Glean's reach, allowing seamless integration within varied systems. Additionally, Glean Actions enable proactive data interactions, enhancing how businesses utilize search data to optimize their workflow.
Chapters
00:00 - 00:30: Introduction to Glean's Connector Framework In this chapter, James Simonsen, a founding engineer at Glean, introduces the connector framework of Glean's AI. He discusses his role in leading efforts to connect various data sources. The chapter delves into understanding the differences between connectors, emphasizing that not all connectors are created equal, and explains the construction and purpose of Glean's connector framework for Work AI.
00:30 - 01:00: Decision on Search Model: Federated vs Centralized The chapter discusses the decision-making process Glean went through when choosing between a federated search model and a centralized search model. Initially, federated search was tempting due to existing search APIs in various data sources like Google Drive and GitHub. However, ultimately, Glean decided to adopt the Google-like centralized search model.
01:00 - 01:30: Challenges with Federated Search The chapter titled 'Challenges with Federated Search' discusses the issues arising from federated search systems. It highlights the problem with user experience due to inconsistency in search results and latency. The chapter emphasizes that the effectiveness of search results in federated systems is limited by the weakest system component. Additionally, it contrasts this with the approach used by major search engines like Google, which involves directly crawling, understanding, and processing data to provide the best possible search experience.
01:30 - 02:00: Building a Connector Architecture The chapter discusses the development of a connector architecture at Glean, which involves creating over 100 unique connectors. Each connector is considered a unique world, requiring an understanding of what makes each special. The example of Slack's connectors includes handling private channels, public channels, DMs, mentions, and attachments.
02:00 - 02:30: Handling Different Data Sources and Unique Connectors The chapter discusses the complexity involved in handling various data sources within a system. Each data source, such as Jira, GitHub, or Figma, presents unique challenges due to its distinct nature and purpose. This necessitates a deep understanding of each platform's unique features and how they can be integrated into a unified search architecture. The discussion also highlights the need to balance leveraging common architectural frameworks with accommodating the distinct characteristics of each data source. Additionally, when developing connectors for these different platforms, the varying permissions models associated with each must be considered, adding another layer of complexity to the integration process.
02:30 - 03:00: Managing Permissions in Connectors The chapter discusses how Glean manages permissions within connectors to prevent unauthorized access, exemplified by ensuring one does not accidentally access a sensitive document like a salary spreadsheet. It highlights the critical nature of respecting permission systems in the enterprise environment. The unique data models and diverse content types within each system require a distinct approach to permissions management.
03:00 - 04:00: Scalability Through a Common Framework The chapter discusses the complexity and variability of permission systems across different platforms and frameworks. It highlights differences like ACLs, file system permissions based on directory locations, hierarchical group memberships, and link sharing complexities. Managing these varied permission systems and tracking changes is portrayed as a challenging task.
04:00 - 05:00: Glean's Native and Partner Connectors The chapter delves into the complexity of integrating various data connectors, both native and partner, into a system. It discusses the 'butterfly effect' where a minor change can impact a vast number of documents, and the importance of quickly reflecting such changes in the system. The text highlights the strategies used to understand and propagate changes within permission schemes efficiently, often almost immediately. It also touches on the process of designing unique connectors for each data source.
05:00 - 06:30: Integrating Insights Back into Applications: Glean Actions In this chapter, the focus is on the scalability of integrating insights back into applications using a framework called Glean Actions. The speaker reveals that one of their key strengthsโreferred to as 'secret sauces'โis their ability to tackle a diverse range of data sources across various customers. This experience allows them to generalize insights from one dataset and apply them across others, enhancing scalability and efficiency.
06:30 - 07:30: Conclusion The conclusion focuses on the ongoing nature of handling data sources, emphasizing the importance of continuous learning and adaptation. The speaker highlights that lessons learned from one customer are applied to others, signifying a collective improvement process. They mention the dynamic nature of data sources, which constantly change and require updates, illustrated by recent changes in platforms like Google Drive, Slack, and Zendesk. The chapter underlines the necessity of evolving with these changes to maintain effectiveness, and briefly mentions the discussion of Glean's native connectors.
Not all connectors are created equal: Glean's connector framework for enterprise search Transcription
00:00 - 00:30 - We're here for our series on working AI, and I'm joined by James, a member of the Glean Engineering team. James, can you go ahead
and introduce yourself? - I'm James Simonsen. I'm one of the founding
engineers here at Glean, and I have been leading our efforts when it comes to connecting to all the different data
sources that we support. - Thanks for coming today. So today we're talking about connectors and why not all connectors
are created equal and how we built Glean's
connector framework for Work AI.
00:30 - 01:00 So when Glean was getting
started, we faced a choice. We could go down a route
of federated search, or we could build Google-like search with a centralized index. Which route did Glean decide to go down? - So ultimately we decided
to go down the Google route, but initially it was very tempting to kind of consider the
federated search route. All these data sources
already support a search API. So if you look at Google
Drive or you look at GitHub, they already have a search endpoint. You can call that, tie all those together,
01:00 - 01:30 and then you would be done; that easy. And the problem is, is it just doesn't lead
to a good user experience for a number of reasons. So one being that the search results are only as good as the least
common denominator there. Same thing with the latency. So one might be very fast, and then another might be very slow. And so you're ending up with
this not very good experience. And we know, coming from Google, that the way you build like
a really good search engine is like you have to crawl the data. You have to understand it. You have to process it. You serve it yourself. And that way you get the best
of all possible worlds there.
01:30 - 02:00 So that's the way we
ultimately decided to go. - And I know that Glean has, you know, over 100 different connectors. So, you know, how did it approach building this connector architecture from the ground up? - The first thing you have to recognize is just like each one of these connectors is like its own unique world, and you really have to understand what makes each one of those special. So if you look at
something like say Slack, you've got private channels, you've got public channels, you've got DMs, you've got
mentions, you've got attachments.
02:00 - 02:30 There's just all this complexity, and that's just like
one data source, right? And now you have to repeat
that for all the other ones. So now how do you handle Jira? How do you handle source
control like GitHub? How do you handle something like Figma where it's like mostly
like a visual thing, right? So each one of these
is kind of like unique, and you have to understand like
what makes all those unique and then be able to
represent that in a way where you can both take advantage of your common search architecture, but then also, you know, get everything that's like unique
about each one of those. - When you start to build
out these connectors, what you also realize is they each have their own
different permissions model.
02:30 - 03:00 So how does Glean work within
these different permissions and assure that one person
doesn't accidentally get access to the salary spreadsheet at the company? - So that's one of the
things that's really critical when it comes to the enterprise is like you just have to
respect the permission system. Like that's just
fundamental to doing things in the enterprise world. And so the same way that
each one is kind of unique in the way that they model their data and, you know, have
different types of content, it's not just that. Each one will have the same thing
03:00 - 03:30 when it comes to permissions too. So you might have one that uses ACLs. You might have another where
it's like a file system, and so your permissions depend
on which folder you're in or which shared drive you're in. There can be hierarchies of groups where it's like you need to
be a member of this group which is a member of this group, and then, but you can't
be a member of this group. And then there's things like link sharing where not only do you have to know that there's all these links
that refer to this document, but then who has seen those links so that you can grant them access to it. So it's very complicated to
keep track of all that stuff. And then, when anytime something changes,
03:30 - 04:00 you have to be able to like reflect that throughout your system too, right? So it's kind of like this butterfly effect where, you know, maybe
one little change happens, but then that ripples
to a thousand documents or a million documents. And so how do you get that
to reflect really quickly? So we have all these different
strategies that we use where we understand the permission scheme. We understand the change
that's taking place. We're able to propagate
that for most situations. We can propagate that almost immediately. - When I think about all of
these different connectors, you're building and designing unique ones for every single data source.
04:00 - 04:30 How is that scalable? - So that's actually one of our, I feel, like one of our secret sauces is because we've had to tackle so many different data sources across so many different customers, we're able to take
those learnings from one and like apply them to all, right? So we've built this
really powerful framework that drives everything, takes care of like all
the common situations. So, as I mentioned, there's all these kind
of specific categories of data sources. So there'll be like chat data sources, email data sources, source
control data sources. And we can pull out kind
of the common bits there.
04:30 - 05:00 Same thing with like
error-handling and rate-limiting and scaling, all those things. So we learn one thing in one customer, and then we apply that to everybody. And these data sources though,
they're not a static thing. It's not like you write one crawl, and then, "Oh right, we're done." They change over time. So just in the last week, we've seen Google Drive introduce tabs and then we've seen
Slack introduce canvases, and we've seen Zendesk introduce
new permissions models. So you have to keep evolving in order to kind of stay
on top of these things. We've discussed Glean's native connectors,
05:00 - 05:30 but we also have partner
connectors as well. Can you discuss more about the larger Glean
connector ecosystem? - There's a number of different ways that you can integrate with Glean, starting with just like
our custom data sources where if you just have like your own on-prem say ticketing system, or you have your internet portal, we can push that data into Glean, and then it can show up
in the search results just like any other data source
that we natively support. We also have partner-built connectors. So those will be things where companies,
05:30 - 06:00 SaaS vendors will already have a pre-built connector for Glean. So we have examples like Guide, Outline who have built those, and we've been expanding into
more partnerships as well. So we have a partnership with Databricks to integrate with their Genie. So there you're getting
the best of both worlds, their business intelligence software combined with Glean's understanding and corporate knowledge. And then there's DataStax Langflow where they have like a workflow engine, and then they can use Glean
as one of the sources. So now you have access to all
the different data sources that Glean connects to,
06:00 - 06:30 and you can integrate
those into your workflows. - So, so far we've talked about connectors in terms of bringing data into Glean, but there's also sending
insights or triggers back. So if I think about, you know, Slack, if there's a question, how can Glean step in
and answer that question? Or Jira, how can we help
with IT ticket resolution back in the applications where people are already working? So can you share a little
bit more about Glean Actions? - Actions is the way that you
are able to push information back out from Glean.
06:30 - 07:00 Examples of that would
be summarize an outage and then generate a postmortem document and actually create that document using the the information
that you've collected from Glean itself. Then you're able to not just
pull information to Glean, but then push it back out. And then you can build
more-sophisticated things. So you can automatically do all this. So we have like the Glean Assistant, which you can add to your Slack channels, and then say you have
like a support channel, it can automatically answer questions. And so now you don't need a
person there to do that stuff, and you can have Glean just post itself. And then we've seen customers do this too.
07:00 - 07:30 So we have like a leading mobile carrier who is using Glean to
power their support portal. Support agents are able
to use the single portal. They have access to all their information. Everything's integrated together. So they answer questions all in one spot without having to leave their servers. - Thank you so much James
for joining me today and talking about connectors. I appreciate the time.