Фриланс-проекты › Randomly extract non-overlapping sets Randomly extract non-overlapping sets
Generate content for a word game. Input is data about relations (A-to-B strength). Output is distant relationships.
For example:
GOOD: apple, airplane, dog, house
BAD: banana, cherry, peach, strawberry.
You do NOT need to speak very much English. This is purely data.
I have two source files. A list of ranked relationships between words, and a separate list of words which may be in the first file. This is real English word data, similar to a thesaurus. https://www.powerthesaurus.org/
Task is to randomly output sets of lines from the 2nd file, which are NON-overlapping concepts.
An ideal algorithm would create a multidimensional mesh, and then randomly extract distant nodes. I don't know how to do that. i.e. output sets of words which are all distant in vector space. See: https://dzone.com/articles/introduction-to-word-vectors
A non-ideal algorithm would randomly pull lines from file 2, measure similarity to other output lines, if dissimilar, keep and remove from file 2. If a line in file 2 is similar to too many test lines, remove it and return to file 2. i.e. a "bag of coins", and you keep randomly testing/replacing coins so they are all different.
No word pair should be more than 0.3 similar; and the total similarity of all words between sets should be <0.5.
Preferred programming language is: ruby, perl, python.
Two input files:
A) relations.txt
#aaa [syn]: aab | aac; [syn-score]: 100.0 | 8.0;
#aab [syn]: aaa | aac; [syn-score]: 75.0 | 5.0;
#bbb [syn]: bba | bbc; [syn-score]: 50.0 | 4.3;
#bba [syn]: bbb | bbc; [syn-score]: 150.0 | 1.2;
#ccc [syn]: ccd | ccz; [syn-score]: 150.0 | 0.4;
... etc.
B) lists.txt
#aaa = aab | aac
#bbb = bbd | bba
#bba = bbd | bbx
#ccc = cca | ccz
#cca = ccd | cce
#ddd = dda | ddb
... etc.
The real file A is 300+ MB, with 855k lines.
The real file B is ~15k lines.
I will want to be able to set N, number of sets; and Y number of packs. N will typically be around 25 sets; and Y will likely be 1000 packs.
Output, with N=2:
#aaa = aab | aac
#cca = ccd | cce
#bbb = bbd | bba
#ddd = dda | ddb
#bba = bbd | bbx
#ccc = cca | ccz
Output, with N=3:
#aaa = aab | aac
#ccc = cca | ccz
#bba = bbd | bbx
#bbb = bbd | bba
#ddd = dda | ddb
#cca = ccd | cce
-
ставка скрыта фрилансером
-
177 Hello!
My name is Andrei, I'm a CEO of a Kodep company and I would like to apply for this job. We are on the market for more than 9 years, and since we are an outsourcing company, we work with clients all over the world. We care about our customers and the quality of code that we provide, that's why more than 95% of our clients were satisfied with the result. We work with both startups developing projects from scratch and big companies with live projects and code legacy.
All the requirements written in your offer fit us well. I can send you CVs of available developers, so you could choose the most interesting ones, or the full list of technologies that we use for work, or the list of portfolio projects. Please let me know if you have any questions.
Our salary rate 15$ per hour.
Regards,
Andrei
-
Задайте ваш вопрос заказчику