-

@ mleku
2025-05-15 21:47:35
so, apart from one lack of bounds check, adding the e tag index has not caused any problems with function of the relay, the way the searches were made all of the serials are usually the last 8 bytes of the key and the new index follows that pattern so it slotted straight in
i have now added use of this index (so yes, it already will be speeding up searches for e tags btw, in the standard filters) and then for tag filtering if it doesn't find p or e tags it will fall back to the same mechanism already in there (which is a lot less frequent and needed than those two for threading) and um...
so, what i have got already is because of what's in the fulltext indexes, i can very quickly reject events based on elements of the filter: authors, kinds, since and until
then to filter out the tags, i have got the p tag index search, and if it doesn't find the p tags associated with the serial of the fulltext index, it can skip that, and with the e tag index i can now also filter out events that don't have the e tag in the filter, and whatever remains after that is a match on one of the terms in the query
oh yes, and i have a language tag filter to add too, almost no events use that but they will in #alexandria so it's necessary also, it might be better even to filter on that one before p and e tags, and then after p and e tags what remains is "any tag" and that can search for the matching serial to the word index candidate and reject if it doesn't find the tag associated with that event serial
after that, then it will sort the indexes it found to group them by serial, and then second, by number of words that match, more at the top, less and less at the bottom
then in the groups of them with progressively decreasing numbers of matches on the words in the search, sort them by the number of words that appear sequentially in the same event serial group, with all, then one less, and one less until at the bottom of each group of those with x number of words in them.
then check the limit, cut off the serials that extend beyond that number, and flatten the list down to just the serials on each group, et voila, full text search
so probably i will finish implementing this tomorrow or the next day, depending on how much work i have to do tomorrow, probably not much, just finishing up with integrating with my colleague's text similarity analysis engine
after that, it's the weekend anyway, and next week they want release so i'll just spend 4 hours a day working on optimizing the performance, and refactoring the code to be able to adapt it (probably using interfaces) to a different type of data set, as the other project i'll be getting integrated with is a linkedin type job search/social site, and there will be a fair bit of similar kind of matching in there, based on people's resumes mostly. that should be interesting, i'll be interrogating whoever is designing the data for that project and it probably will be a little easier because a lot more of the data will be fairly rigidly predefined, but they also have this idea in mind of seeking contracts with other projects that need a social matchmaking system as a value add that distinguishes them from existing competitors.