[FEATURE] Indexed Search: add mySQL fulltext support
[Packages/TYPO3.CMS.git] / typo3 / sysext / indexed_search / pi / considerations.txt
1 - Search is always case insensitive. If you need a case sensitive search, use a binary collation for the index_fulltext and index_words tables.
2
3
4 MAILS about:
5
6
7
8
9
10
11 Message-ID: <200203020316380679.03EE30B9@smtp.worldonline.dk>
12 X-Mailer: Calypso Version 3.30.00.00 (4)
13 Date: Sat, 02 Mar 2002 03:16:38 +0100
14 From: "Kasper Skårhøj" <kasper@typo3.com>
15 To: typo3feature
16 Subject: Dev-help: Any SQL-wizards?
17 Mime-Version: 1.0
18 Content-Type: text/plain; charset="ISO-8859-1"
19
20
21 If you are an SQL wizard, you may be able to help me here.
22
23 In the (coming) index searching thing, I have three main tables.
24
25 - index_words which contains all the words indexed
26 - index_pages which represents a link to a page id or external url
27 - index_rel which links the two tables together.
28
29
30 So searching an OR search for "content" and "management" could be done like this:
31
32
33 SELECT STRAIGHT_JOIN [some fields here...] FROM
34 index_words AS IW,
35 index_rel AS IR,
36 index_phash AS IP
37 WHERE
38 IR.phash = IP.phash AND
39 IW.wid=IR.wid AND
40 (IW.baseword = 'content' OR IW.baseword = 'management')
41 [... and here comes some GROUP BY, ORDER BY and LIMIT]
42
43
44
45 This is not a problem. Actually it works very well apparently. However performing an AND search is much harder. Because you cannot just switch OR with AND (because "baseword" cannot be two things at the same time... :-)
46
47
48
49 Therefore I tought of a little trick to do it:
50
51 SELECT STRAIGHT_JOIN [some fields here...] FROM
52 index_words AS IW,
53 index_rel AS IR,
54 index_words AS IW2,
55 index_rel AS IR2,
56 index_phash AS IP
57 WHERE
58 IW.wid=IR.wid AND
59 IW2.wid=IR2.wid AND
60 IR.phash = IP.phash AND
61 IR2.phash = IP.phash AND
62 (IW.baseword = 'content' and IW2.baseword = 'management')
63 [... and here comes some GROUP BY, ORDER BY and LIMIT]
64
65
66
67 ... and actually I think this works, but it's very slow, probably because the internal result in MySQL becomes extremely large due to the joins.
68
69 Can anyone help me?
70
71
72
73
74 I checked out kwIndex from hotscripts and he does it like this:
75
76 1) Select the word-ids (SQL-query 1)
77 2) If both words were found, make another query for all linking-table entries matching the words and group by the word-id. The count(*) statement shows the number equal to the number of searchwords if they were both found. So select only records which delivers this. Then you have the document ids.... (SQL 2)
78
79
80 However this solution will not let us:
81 - search for parts of a word like "content%" or metaphone values. The word is matched exact!
82 - It uses 2 SQL queries - I hope you do fine with one only...
83
84
85
86
87
88
89 *****************************************************************************************************************
90 *****************************************************************************************************************
91
92
93
94
95
96
97
98 Message-ID: <200203021103320083.000F523D@smtp.worldonline.dk>
99 References: <200203020316380679.03EE30B9@smtp.worldonline.dk>
100 X-Mailer: Calypso Version 3.30.00.00 (4)
101 Date: Sat, 02 Mar 2002 11:03:31 +0100
102 From: "Kasper Skårhøj" <kasper@typo3.com>
103 To: typo3-features@netfielders.de
104 Subject: Re: [Typo3-features] Dev-help: Any SQL-wizards?
105 Mime-Version: 1.0
106 Content-Type: text/plain; charset="ISO-8859-1"
107 Content-Transfer-Encoding: quoted-printable
108
109
110
111 1) To the AND question:
112
113 Maybe it's best to make a search for each word; After getting the total list of page-ids from first search, this is included as a condition in the next search, which generates a new list which is included in the next search, which...
114 However this approach forces us to get a list of ids into PHP and include that in the next search. This will work for small sites (still more than 1000 pages though) but is not very wise in the long run (because this list could be very, very long).
115 Then MySQL has an option of creating a temporary table which one could store the result in and then join with that table upon the next search. This makes more sense I think, but that is only MySQL 3.23+ (I run 3.22).
116 Comments?
117
118 2) Search query syntax
119 Any suggestions to a search query syntax.
120 - Search for "content management" is by default AND search
121 - But should "+content -management" be the same as "content not management" ?
122 - What about nesting? Like "content and (management or production)". And which operator (AND or OR) has precedence anyways?
123
124
125 Please comment.
126
127
128
129
130 *****************************************************************************************************************
131 *****************************************************************************************************************
132
133
134
135
136
137
138
139
140
141 Return-Path: <typo3-features-owner@netfielders.de>
142 Delivered-To: pop3user-typo3-kasper@typo3.com
143 Received: (qmail 18622 invoked from network); 3 Mar 2002 00:19:45 -0000
144 Received: from unknown (HELO netfielders.de) (194.245.114.28) by 192.168.1.4 with SMTP; 3 Mar 2002 00:19:45 -0000
145 Received: from host1.deltaphon.net [209.239.36.16] by mailman.k1net.de (SMTPD32-6.06 EVAL) id AAA2EDE03EC; Sun, 03 Mar 2002 01:13:22 +0100
146 Received: from [10.0.1.2] (pD9EB6EA1.dip.t-dialin.net [217.235.110.161]) by host1.deltaphon.net (8.10.2/8.10.2) with ESMTP id g230F5532210 for <typo3-features@netfielders.de>; Sat, 2 Mar 2002 19:15:05 -0500
147 User-Agent: Microsoft-Entourage/10.0.0.1309
148 Date: Sun, 03 Mar 2002 01:15:06 +0100
149 Subject: Re: [Typo3-features] Dev-help: Any SQL-wizards?
150 From: Daniel Hinderink <hinderink@schweisfurth.de>
151 To: <typo3-features@netfielders.de>
152 Message-ID: <B8A7299A.3455%hinderink@schweisfurth.de>
153 In-Reply-To: <200203021103320083.000F523D@smtp.worldonline.dk>
154 Mime-version: 1.0
155 Content-type: text/plain; charset="ISO-8859-1"
156 Content-transfer-encoding: quoted-printable
157 Precedence: bulk
158 Sender: typo3-features-owner@netfielders.de
159 Reply-To: typo3-features@netfielders.de
160
161 Hi Kasper,
162
163 OK, I am not an sql-wizard. Just some hints and wishes, which I hope are
164 helpful.
165 >
166 > 1) To the AND question:
167 >
168 > Maybe it's best to make a search for each word; After getting the total list
169 > of page-ids from first search, this is included as a condition in the next
170 > search, which generates a new list which is included in the next search,
171 > which...
172 > However this approach forces us to get a list of ids into PHP and include that
173 > in the next search. This will work for small sites (still more than 1000 pages
174 > though) but is not very wise in the long run (because this list could be very,
175 > very long).
176 > Then MySQL has an option of creating a temporary table which one could store
177 > the result in and then join with that table upon the next search. This makes
178 > more sense I think, but that is only MySQL 3.23+ (I run 3.22).
179 > Comments?
180
181 In fact I believe there has to be an auto-indexing spider which builds a
182 search reference table at record creation time to compare against. I thought
183 this is what the current typo3 indexing is doing?
184 Best would be if that would be filtering the text for indexing words against
185 a "filler"-set to extract the noise and keep the search swift.
186 Here is a very short bit on the fulltext in feature in mysql, is that what
187 you are talking about?
188 This seems to be the only way for me to dit inside mysql: It's from 3.23.23
189 on: http://www.mysql.com/doc/F/u/Fulltext_Search.html
190
191 However it would be very nice to have a result ranking mechanism, not only
192 going by the frequency of a match in a given record, but also by relevance,
193 as shown by hits and by the rank in the pagetree, so that a match in a
194 record on rootlevel +1 is shown before a record in rootlevel +2, etc.
195
196 Take a look at atomz.com ->search. The administration surface has quite a
197 few important features I would love to see in a typo3 admin search surface.
198
199 I have to say a really elaborate, fast and multiformat (pdf's !) search
200 engine is really one the single most important things for every 100+
201 website.
202 >
203 > 2) Search query syntax
204 > Any suggestions to a search query syntax.
205 > - Search for "content management" is by default AND search
206 > - But should "+content -management" be the same as "content not management" ?
207 > - What about nesting? Like "content and (management or production)". And which
208 > operator (AND or OR) has precedence anyways?
209
210 As for the syntax, simple boolean is much more widespread in use, but
211 implied boolean (+,-) should be the prevailing statement of them. That's how
212 it was done in harvest and is still done in most search engines.
213
214 Here is a handy little comparison table of search engine syntax:
215 http://lisweb.curtin.edu.au/staff/gwpersonal/compare.html
216
217 Thanks for listening, good night,
218
219 Daniel
220
221
222
223
224
225
226
227
228
229 *****************************************************************************************************************
230 *****************************************************************************************************************
231
232
233
234
235
236
237
238
239
240
241 Return-Path: <typo3-features-owner@netfielders.de>
242 Delivered-To: pop3user-typo3-kasper@typo3.com
243 Received: (qmail 4770 invoked from network); 4 Mar 2002 07:58:03 -0000
244 Received: from unknown (HELO netfielders.de) (194.245.114.28) by 192.168.1.4 with SMTP; 4 Mar 2002 07:58:03 -0000
245 Received: from av001.thyssen.com [149.211.49.30] by mailman.k1net.de (SMTPD32-6.06 EVAL) id A75E110303EC; Mon, 04 Mar 2002 08:50:54 +0100
246 Received: from mail.thyssen.com (unverified) by av001.thyssen.com (Content Technologies SMTPRS 4.2.10) with ESMTP id <T596cf325de95d3311e374@av001.thyssen.com> for <typo3-features@netfielders.de>; Mon, 4 Mar 2002 08:52:43 +0100
247 Received: from srv479.thyssen.com (srv479 [149.206.183.11]) by mail.thyssen.com (8.10.0.Beta6/1.0.3) with ESMTP id g247leJ66914 for <typo3-features@netfielders.de>; Mon, 4 Mar 2002 08:47:41 +0100
248 Received: from SRV533.thyssen.com (SRV533.thyssen.com [149.206.246.196]) by srv479.thyssen.com (8.11.1/8.11.1) with ESMTP id g247sSj20724 for <typo3-features@netfielders.de>; Mon, 4 Mar 2002 08:54:28 +0100
249 Subject: Antwort: [Typo3-features] Dev-help: Any SQL-wizards?
250 To: typo3-features@netfielders.de
251 X-Mailer: Lotus Notes Release 5.0.5 September 22, 2000
252 Message-ID: <OF0A6CE201.863BB902-ONC1256B72.002973BF@thyssen.com>
253 From: Malecki@blohmvoss.thyssen.com
254 Date: Mon, 4 Mar 2002 08:48:27 +0100
255 X-MIMETrack: Serialize by Router on SRV533/Server/BuV(Release 5.0.6a |January 17, 2001) at 04.03.2002 08:52:40
256 MIME-Version: 1.0
257 Content-type: text/plain; charset=iso-8859-1
258 Content-transfer-encoding: quoted-printable
259 Precedence: bulk
260 Sender: typo3-features-owner@netfielders.de
261 Reply-To: typo3-features@netfielders.de
262
263
264 Hi Kasper,
265 I was working some time with Oracle and (less) with mySQL on similar
266 problems.
267 I think, mySQL goes the same base ways, so the same hints apply to mySQL.
268
269 Basic hint is: always try to keep the driving set of the query as small as
270 possible.
271
272 The mentionerd query is like:
273 ....
274 WHERE
275 IR.phash = IP.phash AND
276 IW.wid=IR.wid AND
277 (IW.baseword = 'content' OR IW.baseword = 'management')
278
279 and probably there is a huge amount of rows matching the first line:
280 IR.phash = IP.phash AND
281
282 (this is a HHUUGGEE JJOOIINN i think, right?)
283
284 which is creating the "driving set" (of matching rows) of the selection.
285 All following "AND"'s are applied to cut-down
286 this initial amount of rows.
287 It's then obvious, if this initial amount of rows is small, then the
288 succeding narrowing takes less time and resources.
289
290 Suggestion: try first:
291 (IW.baseword = 'content' OR IW.baseword = 'management')
292 AND ....
293 AND ....
294
295 The first line of the criteria shall provide as small as possible amount
296 of rows matching.
297 (this looks then strange somehow, but is effective).
298
299 Basically, when the tables in the database are well analysed (statistics is
300 actual), then some optimization
301 shall (and most likely will) be done by the query optimizer.
302 But there is no guarantee. What the guaranteed is to don't rely on
303 something else.
304 A good query works good without any artificial help.
305
306 BTW. If You like to see IN ADVANCE how the query will behave: call the
307 EXPLAIN for this query.
308 You will get then more information about how mySQL will go to process the
309 query.
310
311 I hope, this helps.
312
313 Regards and God's Blesses for Your week
314 Piotr
315
316
317
318
319 *****************************************************************************************************************
320 *****************************************************************************************************************
321
322 OK there were some fancy calculations promoted by Graeme Merrall:
323
324 "However, regarding relevance you probably want to look at something like
325 Salton's formula which is a good easy way to measure relevance.
326 Oracle Intermedia uses this and it's pretty simple:
327 Score can be between 0 and 100, but the top-scoring document in the query
328 will not necessarily have a score of 100 -- scoring is relative, not
329 absolute. This means that scores are not comparable across indexes, or even
330 across different queries on the same index. Score for each document is
331 computed using the standard Salton formula:
332
333 3f(1+log(N/n))
334
335 Where f is the frequency of the search term in the document, N is the total
336 number of rows in the table, and n is the number of rows which contain the
337 search term. This is converted into an integer in the range 0 - 100.
338
339 There's a good doc on it at
340 http://ls6-www.informatik.uni-dortmund.de/bib/fulltext/ir/Pfeifer:97/
341 although it may be a little complex for what you require so just pick the
342 relevant parts out.
343 "
344
345 However I chose not to go with this for several reasons.
346 I do not claim that my ways of calculating importance here is the best.
347 ANY (better) suggestion for ranking calculation is accepted! (as long as they are shipped with tested code in exchange for this.)
348
349
350
351
352 *****************************************************************************************************************
353 *****************************************************************************************************************