I have a very difficult task, bear with me because I do not want to stop my words here. I am doing some research, and my group is converting into a database database. MySQL was used before our research, but data exited the database (192 million rows in 16G memory - this was the only way to ask data faster). There is a lot of stuff in the data itself, but there is a little bit slow on this point.
In the data classifiers-squire pairs are loaded. We prepare the query for the database, which basically says, "Give me the top 500 for the following categories" then the database gives multiple scores. For example, if we ask for the top 500 points for 2 classifiers, then We get back 1000 rows (each line contains a classifier id and one digit - i.e. [4, 9100]). Scores themselves are non-uniforms (the distribution is bent towards one end of the value - the way it happens from -10000 to 10000)
As we make changes in the cassandra, there are many requirements. First of all, we should be able to query the top and bottom N score based on the per-classifier. Normally, I can see that the ordering partier would be suitable for this, although I said that curbing the extremes (which would put a lot of load on a node). So my first question is that I still distribute the classifier / score pairs evenly while being able to query for the above or below N.
There is a secondary requirement that is necessary to first find one score at all that is passed second score, so if I look at Classifier 6 with a score of 400 So, I can ask, tell me about the score of 500 which are all (inside Classifier 6). I am totally stumped about this I have read that the cassandra supports the secondary index (yay), but there are only hash types (bu - no categories), do we create separate column families for this use case? And finally, the speed is paramount, the data is being used in an interactive GUI application. Ideally, the questions should only take a few seconds. And if all data gets stuck on a particular node, then things will be slowed down. We have tried to do all kinds of crafty tricks that our best idea was to put the data in the bucket so that the top 500 was moved to bucket 1, the next 500 went to bucket 2, and this Kind. The advantage is that to get the top 500 we just ask for Bucket 1. In addition, all data will be distributed evenly using random separator. Although most of our questions are only interested in 1 bucket, it will only put a lot of burden on one node (remember, if En classifier is included, then this is actually 500 * N score per bucket). The real disadvantage of this plan is that when we need someone to inquire based on the scores, we are isolated (we have to find some strange binary search on the bucket to find our starting price ). At this time we are running low on ideas. I am surprised at everything I saw about Kassandra whether it is also suitable for this task. We have selected it primarily due to the horizontal scalability, which is important (adding an RDBM to the node more than the cover), I think my whole question is how would you approach it? If Casandra, please address any of the above issues. Otherwise no insights or knowledge will be appreciated. Thank you. Why classifiers should not be stored as columns in the column family row and column names Used to be. Because the columns are sorted, for a given classifier, it is really fast to query up / down 500 columns. The second type of query is also possible, for example when you are looking for a score near s , for example <500> before s and s and then s .
Comments
Post a Comment