As of 2016_10 I do not want to be the idiot, who misses the obvious. Specially at my occupation, software development. I do not need to be aware of the latest developments in the field of software development, because I need to be aware of the technically best solutions and the latest solutions are not always the best solutions, but I do need to find the smartest solutions, regardless of whether they are new or old, and for that I need a capability to search "the knowledge of humanity", at least the part of it that I am capable of understanding. I do understand that whatever I do by using my home internet connection without using anonymization tools like tor is public. I do not want to give a nice data record about my searches to any search engine. Nor do I trust the mainstream search engines for giving me uncensored results, specially after the Google people admit openly that they offer propaganda services to governments. Another issue with the main-stream search engines is that even if they do not apply censorship and even if they were usable privately like the tor network search engines are, there is still the issue that the mainstream search engines have to offer results that majority of its users LIKE and people differ. Therefore, even if the Google were not evil, like it once advertised, then either it would either have to display mediocre results that are not a total failure, but do not get top marks from hardly anybody, or the Google would have to offer personalized search results to please everybody individually. The moment party P_1 has data D_1, some party P_2 might snatch that data, the D_1, from the party P_1. In 2016 the party P_2 is must likely some NSA or its counterpart or just a blatant Police raid in the middle of a day at everybody's view. Excuses vary, usually tax fraud is a nice excuse, but in the current context the excuse does not matter, because at the end of the Police raid the data D_1 is owned by more parties than P_1 and even in a situation, where the Police does not even care about the D_1, despite being aware exactly, what the D_1 is about. There's a reason, why they want to store everything in Utah and the reason is not that they are not capable of figuring out, how to funnel tax-money to the contractors in some other way. (One might argue that the bulk data collection is a cover to create a huge data center for auto-detecting military targets from 3D radar images, but if that were the case, then the Washington would want its enemies to avoid shooting the data-center with rockets and the Washington would make great efforts to make everybody economically dependent on the well being of the Utah data center, offer free drop-box services from there, etc.)
I'll skip, how I reached the idea that what is needed is RSS-reader analogue for search engines. In the case of the RSS-reader, the RSS-reader creates a query and a blog engine responds in some standardized manner. A single RSS-reader instance contacts multiple blogs and then applies its own processing to the data. None of the blog owners is informed of the whole list of blogs that the RSS-reader instance is polling or what other data processing is applied to the blog engine response at the RSS-reader side.
The good news is that actually the archives of freely accessible scientific publications have all their own site specific search engine instance. As of 2016_10 I have not studied the capabilities of those search engines, but at least the open source Gigablast search engine does have its own API and there is an effort to standardize automated publication indexing among libraries and electronic archives. If a party P_3 that runs an electronic archive, has made scientific publication item D_2 publicly accessible to everybody, then the P_3 does not have any motives to censor it at its site specific search engine. If the P_3 outsources the search engine functionality to Google or Microsoft or some other 3. party that has varying interests, then the problem will be that the scientific publication gets censored against the will of its publisher, the P_3, but luckily Microsoft is not known for providing reliable services and the Google with its propaganda services makes the smarter bunch of the scientific publishing people, at least in the long run, wary enough to avoid relying on Google for the vital parts of their project.
Summary of the proposed architecture is that an RSS-reader analogue, hereafter SEARCH-reader, connects to variety of search engines by using search engine specific API-s. Some of the search engine instances are personal, run on people's personal virtual machines or even Raspberry Pi-s. Just like operating systems have device drivers to allow different hardware to be used, the ODBC drivers also qualify as an example, the SEARCH-reader has a plugin-architecture, where the "search-engine drivers" are plugged in and the user-domain-specific-language-specific queries are first translated to some modern SQL-analogue and the search-engine drivers are all designed to work with the SQL-analogue. The SQL-analogue might be an internal domain specific language that is implemented as a Ruby library, which can internally use some theorem prover, may be SPASS.
A thing to keep in mind is that a query is a filter that is applied to the data set that contains all available data. The searchengine drivers must normalize the search results to allow the rest of the data processing to work with all search engines. The responses to the queries are received asynchronously, allowing search engines to go offline, internet connections to break and Silktorrent based search systems to be used alongside the more traditional, real-time, search engines. Some searchengine driver can connect to a local email client like the Mozilla Thunderbird or connect to some other personal data storage that has its own embedded search engine. The data storage might be even an RSS-reader that searches from downloaded blog posts. Given that the domain specific language for assembling queries is a programming language, people might implement parts of their queries as re-usable libraries. Non-technical users might use some children robotics programming language like the Scratch or the Logo. The Scratch might even work on touchpads, if not for serious programming then at least for scratching the screen and enjoying it. The SEARCH-reader might even create a new market for Raspberry Pi like computers, because some people might want to buy pre-configured search engines and snap them on their home LAN. If those Raspberry Pis and Raspberry Pi like computers use GenodeOS or some formally verified fork of MINIX3, then the Raspberry Pi-s will probably stay without successful intruders long enough to be considered a reliably private data storage.
A fine salesperson might even offer "freshly crawled" topic specific search engine Raspberry Pi-s or crawl-index-diff subscriptions that are charged regularly like the old-fashioned newspapers used to charge for their paper-newspapers. The subscription license might explicitly state that users are ENCOURAGED to share the crawl results in a P2P-fashion and the crawl-service is a convenience service with about 50-euro-cent monthly fee, paid all-at-once, 5€/year and has some clever data download limits that actually keep the client's search engine up to date, but do not allow the service to be bankrupted with arbitrarily huge serviceprovider-2-subscriber upload traffic. Competitors of that business actually help their clients to surf the web more privately (algorithms can vary), because if at least one competitor misses some sites, censors them, it's enough for breaking the censorship if at least one other competitor lists the site at its search results. Crawl-service-provider-2-subscriber "Google bombs" and DoS-attacks are also excluded, because people can use e-mail-spam-filter-like evaluation functions on all search engines to give multidimensional marks, points, to every returned search result of every search engine and if one provider turns out to be a bully and all others offer fine results then it's easily visible to the end user. That works even, if most of the providers are bullies, because then the majority of the providers get low marks, meaning, supermafia/state based legal censorship by coercing search-engine-diff-providers or search-engine-hosters to do something, is not going to work.
Thank You for reading this post. :-)
Update on 2016_10_25.
The "A Dichotomy in the Intensional Expressive Power of Nested Relational Calculi augmented with Aggregate Functions and a Powerset Operator" (source) talks about the expressiveness of query languages.
The "Efficiently Supporting Edit Distance based String Similarity Search Using B+-trees"(source) is essentially about fuzzy string search.
Update on 2016_11_05.
Scientific papers, technical reports, thesis, study books, etc. do seem to be essentially stories that describe, how to do something. Probably the most clear examples of scientific papers that describe, how to _do_something_ are the various computer science algorithm proposals and mathematics papers. The biology and geography related scientific works and study books tend to consist of stories that give an approximate description of something static. The reason, why I use the word "static" here is that the statistical conclusions about some set of measurements, for example, mouse leg lengths, does not change within the set of conditions that accompany the statistical conclusions. If some third party could get different statistical conclusions while carrying out the experiment and measurements the same way as the original party did, then the at least one of the parties must do something wrongly, does shoddy work, or the conclusions do not hold and the set of experiment conditions needs to be revised. From that point of view from a perspective of software development the set of scientific works looks a lot like a set of functions and data.
To simplify it even further, data might be seen as a fragment of a function description. For example, if a function is "kill_a_mouse(mouse)", additional constraints are that it must be done at a fraction of a second and without a blood pool and it is allowed to use a needle with anesthetic, then the description of the mouse nervous system, the fact that a mouse has a thing called "nervous system", is a fraction of a kill_a_mouse-function description, because if the mouse gets the needle to its stomach, then it will not die within a fraction of a second and the constraint that a mouse must die in a fraction of a second, is not adhered to. A different kind of function might be healing: if a person that was submerged to water during swimming and is unconscious, then the function "revive(human)" does contain a sub-parts that depict the fact that a human has something called "lungs" and those need to be poured empty from water and then they need to be ventilated artificially, because a human needs something called "oxygen", which is needed by something called "brain", which is by some weird manner related to consciousness and can not be fully amputated and thrown to a bio-waste bin regardless of circumstances. With further simplification that there exists only one, single, thread, the task of studying prior work of others by getting to know study books, technical reports, scientific works reduces to a task, where there are hashtables that contain data and functions that return only a single hashtable and receive any number, including zero, hashtables and the solution consists of a directed graph of instances of those functions so that the constraints are always adhered to. A parallel from the chains of chemical reactions is that the constraints determine the maximum air pressure, maximum temperature at any given moment, maximum energy input, maximum energy output, lack of explosions in the form of maximum energy output wattage, etc. Essentially the task looks like a Haskell code generation project, with the exception that the programming language can be anything that supports functional programming "comfortably enough". (The "comfort" requirement is to counter the argument that any Turing machine supports functional programming, anything can be written in assembler and even assembler is not required, because the hex-editor can be used for editing the binaries directly.)
The main idea of the current update to this blog post is the idea that may be search engines that specialize on study books, scientific databases, scientific papers, thesis, technical reports, should not use only queries about the content of the information artifacts, the functions and function fragments, but they should implement the loop at the above image. The "query input" should be the set of constraints and the initial data and the initial set of functions. The result should not consist of a list of information artifacts, but a set of function graphs that meet the various constraints. A nice toy case is classical chemistry combined with a simplification that in stead of raw scientific papers, technical reports, the information artifacts have been encoded as functions of some internal domain specific language. If that works, then the next task is to write a converter that "digitizes", translates, the human language based, messy, scientific articles to that domain specific, formal, language. The historical sequence is:
- Paper/papyrus is scanned to digital images, files.
- The files are analyzed with optical character recognition (OCR) software.
- The next, future, step in library sciences is the translation of the OCR results to a formal language.
The classical chemistry based toy problem seems to be in line with the pattern that just like Turing machines are very universal, despite being very unpleasant to program, many problems can be described by tediously translating them to some graph representation. The phrase Graph MInor Project seem to be recurrently occurring(source) at various computer science related academic works.
Update on 2016_12_11
From privacytools.io I found a link to the searx.me search engine aggregator ("meta-searchengine") and at its About page(archival copy) there was a reference to the Seeks project. I find that the relying on censored or just dumbed down search engines for search results is a mistake, but the Seeks project can still offer some insight into some search engine aggregator related problems and their possible solutions.
Update on 2017_01_11
Given the data sizes, the commoncrawl.org might be a fine additional component of a custom search engine.
Update on 2017_09_23
Update on 2018_07_29
Update on 2018_11_09
Update on 2018_12_23