Friday, February 7, 2014

I've been reading a book on exploring Splunk to index and search machine data. I want to continue on that discussion and include my take from a training video I've seen. Today I want to take a break to paint a vision of my text processing system  that can do both keyword weighting and topic extraction from text. I've attempted different projects on different kinds of implementations - both in algorithms and implementations. Most of them have not been satisfactory except perhaps the more recent ones and even there there are some refinements still to be done. But I've learned some and can associate an appropriate algorithm for the task at hand. For the most part, it will follow conventional wisdom. By that I mean where documents are treated as term vectors and vectors are reduced from a high number of dimensions before they are clustered together. I've tried thinking about alternative approaches to avoid the curse of dimensions and I don't feel I have done enough on that front but the benefits of following the convention is that there is plenty of literature on what has worked before. In many cases, there is a lot of satisfaction if it just works. Take for instance the different algorithms to weigh the terms and the clustering of topics. We chose some common principles from most of the implementation discussion in papers and left out the fancy ones.We know that there are soft memberships to differ topics, different ways in which the scope of the search changes and there are different tools to be relied on but overall we have experimented with different pieces of the puzzle so that they can come together.
I now describe the overall layout and organization of this system. We will have layers for different levels of engagement and functionalities starting with the backend all the way to the front end. The distinguishing feature of this system will be that it will allow different algorithms to be switched in and out for the execution of the system. And a scorecard to be maintained for each of the algorithms that can then be evaluated against the text to choose what's best. Due to the nature of the input and the the emphasis of each algorithm, such a strategy design pattern becomes a salient feature of the core of our system. The engine may have to do several mining techniques and may even work with big data, hence it should have a distributed framework where the execution can be forked out to different agents. Below the processing engine layer will be  a variety of large data sources and a data access layer. There could also be node initiators and participants from a cluster. The processing engine can sit on top of this heterogenous system.
Above the processing engine comes the management layer that can handle remote commands and queries. These remote commands could be assumed to come over http and may talk to one or more of the interfaces that the customer uses.These could include command line interfaces, User Interface and an administration panel.
The size of the data and the scalability of the processing as well as the distributed tasks may require modular components with communication so that they can be independently tested and they can be switched in and out. Also, the system may perform very differently for data that doesn't fit in main memory be it at the participant or the initiator machine.

No comments:

Post a Comment