In a previous post I showed a sample of the source index tool. Today we can talk about a webUI to a source index server. I started with an ASP.NET MVC4 WebAPI project and added model and views for a search from and results display. The index built before-hand from a local enlistment source seems to work just fine.
There is a caveat though. The Tokenizer or in this case as it is called the Analyzer in Lucene.Net object model, is not able to tokenize some symbols. But this is customizable and so long as we strip the preceding or succeeding delimiters such as '[', '{', '(', '=', ':", we should be able to use the syntax.
The size of the index is around a few MB for around 32000 files of source with the basic tokenizer. So this is not expected to grow beyond a few GB of storage if all the symbols were tokenized.
Since the indexing happens offline and can be rebuilt periodically or when the source has sufficiently changed, the index rebuilding does not affect the users for the source index server.
As an aside, for those following the previous posts on OAuth token database, maintenance of the source index server is similar to maintenance of the token database. An archival policy helps the token database to be in certain size because all expired tokens are archived and the token expiry time is usually one hour. Similarly, the input index for the source index server can periodically rebuilt and swapped with the existing index.
With regards to the analyzer, a StandardAnalyzer could be used instead of a simple Analyzer. It takes a parameter for different versions and more recent version could be used to better tokenize source code.
There is a caveat though. The Tokenizer or in this case as it is called the Analyzer in Lucene.Net object model, is not able to tokenize some symbols. But this is customizable and so long as we strip the preceding or succeeding delimiters such as '[', '{', '(', '=', ':", we should be able to use the syntax.
The size of the index is around a few MB for around 32000 files of source with the basic tokenizer. So this is not expected to grow beyond a few GB of storage if all the symbols were tokenized.
Since the indexing happens offline and can be rebuilt periodically or when the source has sufficiently changed, the index rebuilding does not affect the users for the source index server.
As an aside, for those following the previous posts on OAuth token database, maintenance of the source index server is similar to maintenance of the token database. An archival policy helps the token database to be in certain size because all expired tokens are archived and the token expiry time is usually one hour. Similarly, the input index for the source index server can periodically rebuilt and swapped with the existing index.
With regards to the analyzer, a StandardAnalyzer could be used instead of a simple Analyzer. It takes a parameter for different versions and more recent version could be used to better tokenize source code.
No comments:
Post a Comment