Information retrieval and filtering

In this section, we will focus mainly on topics closely related to the search for information. It will be mainly about how search works and the modern technologies for more efficient search results. We will also look more at WoS and Scopus as two critical databases of professional texts.

As we have already mentioned, the ability to search for information is one of the most fundamental competencies. It represents a specific difference between membership in the information and knowledge society. The well-known saying that ‘knowledge is power is met here because power is built on the search for information. Therefore, due attention and interest need to be paid to this part of the competence framework.

No description

Defining information needs

It might seem that the first step in finding specific information is technically unchanged and stable. The user must define his information needs and, based on its explanation - for example, in the form of keywords - can then proceed with the search. There are two essential dimensions - first of all, one is expected to know what one needs - then the definition is explicit, and the search process is a question of specific craftsmanship. The more complicated (and probably more common) situation is that we cannot explain the need. For example, we solve a problem that is too complex or we have a topic for which we cannot name a key concept, etc. Even less clearly defined is when we do not know that we need the information at all.

How can search engines help in this case? There is currently the talk of a difference between information retrieval and information discovery. With the help of digital information curation, it is possible, for example, to create thematic collections that map a specific area. The user then approaches them in such a way that they reveal to him a particular area of human knowledge and, as a result, give him that basic orientation and insight so that a search query can be formulated that ultimately leads to the goal.

The creation of such collections has long been discussed and reflected upon, but there are very few effectively usable examples. It is much more common for a teacher or librarian, for example, to create such a collection specifically for the needs of their users. They can then work with his pack and possibly further supplement or develop it.

Such collections can be part of an organization of information that each user creates. Notably, there are both theoretical and technical ways to show the user what might be exciting and essential, with personalized searches being one of the most widely discussed ways.

No description

https://en.wikipedia.org/wiki/Filter_bubble#/media/File:Filter_bubble_illustration.png

No description

Search personalization

Search engines have a massive amount of information about users - Google, for example, knows how old we are, what our gender is, and what we enjoy and are interested in. Interestingly, it may know this more accurately than we do ourselves. Suppose a person is logged in to a browser or search engine (the line between the two terms is gradually thinning in the case of Google Chrome). In that case, the operator can gather information about what we are looking for, create different models of what we might be interested in, and then project them into the results search.

The goal of personalizing search results is to find what they need and as soon as possible. For example, if the search engine knows that a person is a gardener, then the word “copulation" will offer him a botanical term, not sexual intercourse. Or, for a chemist, the word “latex" offers plastic, not a markup language, etc.

At the same time, search engines adjust the results based on a large number of parameters. In addition to the aforementioned interests and demographic information, personalization also reflects, for example, geographical location and experience with previous searches. The personalization of search results (but also of content in general) is criticized from two basic positions. The first is that the parameters are not precisely known. One can become a victim of data modeling, with which one can hardly do anything. Let us consider that the information we find on the Internet influences our political decisions or, for example, economic activity. It is clear that the enormous power in the hands of search engines, which is currently very underregulated, is an important security issue.

The second difficulty is that such a form of content filtering leads to so-called information bubbles. Search engines naturally offer information that we want to consume and hide what is not interesting for us (for example, they reflect a different political or religious opinion). This creates groups of people who share a specific narrative, which is increasingly separated from society. The result is an increasingly divided and fragmented society that does not share everyday events and stories. Effective content delivery can thus have similar effects as propaganda or censorship.

No description

This is one of the reasons why a specific part of society does not want to use personalized results (however fast and convenient, of course) and prefers a less comfortable path. Search engines like DuckDuckGo strive to make search results as personalized as possible.

However, it must be said that the political-legislative dimension also enters this area. In some countries, some of the content commonly available through a search engine may be illegal (for example, in the EU, it may be data about a person who does not want it to appear online, in China, it may be about filtering and monitoring specific sensitive topics such as Xi. Jinping’s similarity to Winnie the Pooh. Therefore, different states may interfere with and change search results.

No description

Modern technology

We'd also like to stop at selected technologies that may importantly affect or change one's search in the future. Probably the most talked about is the semantic web, which should bring several fundamental changes. The first is working with natural language. The phrase“search query" includes the fact that it should be a question to which the user wants an answer. Not in the form of thousands of links to websites, but in the form of a structured sentence - he or she wants to know who Aj Wej Wej is or when Václav Havel died. This change will be technical but, above all, accurate - instead of critical work with resources, the user will often (certainly not always) rely on the answer provided by the search engine. This affects both what we are looking for and how we work with sources, and the forms and methods of learning.

The second effect with equally far-reaching consequences is the possibility of the better interconnection of information. Semantic search allows you to automatically provide relevant content and recommend information sources according to the needs of individual users. The result should be a network of associated documents that it will be possible to browse and so streamline the study, which will then acquire a less linear character and bring discovery closer than a search with both mentioned aspects of the semantic web.

No description

A big topic of the present is the analysis of sentiment - that is, the emotions that individual documents or, for example, contributions on social networks have. They could make it easier to automatically assess the relevance of messages and better refine search results.

However, as indicated in personalization, the essential topic will probably be the delivery (clustering) of content. Recommendations of relevant content are known to all users of social networks - not everything can be displayed on the wall of Twitter or Facebook, and the algorithm based on machine learning decides what the user will most likely be interested in. In this respect, web sanitization and machine learning systems, emotion analysis, or artificial intelligence can be fascinating. Again, this will be a relatively complex issue of ethics and social integration. Still, it is undoubtedly a topic with great economic potential, which is interesting for finding and delivering content in specialized forms, such as music or video. In the future, this may also include scientific articles or interesting research articles.

Human knowledge is closely related to how one can build one's social bonds. These links then serve as sources of information. Building ever more information bubbles can lead to other people becoming an increasingly important source of factual information. An example is the Researchgate social network, where you can build thematic communities to share scientific data, texts, and other artifacts. Only a user that is a member of the relevant scientific community or social structure can receive vital information. For some pedagogical disciplines, such as connectivism, building personal, educational networks and developing social capital is an essential part of the learning process. According to him, knowledge arises in the network and according to which user has access to the network. This is key to learning. Such a concept reflects well the current technical changes, but at the same time, brings possible social (and political) problems.

No description

No description

WoS and Scopus

Rather than weighing the advantages and disadvantages of databases and how to search them, this section will consider them in the same context as described above. Web of Science and Scopus are two large databases crucial for measuring the prestige of scientific outputs. Unlike, for example, ERIC, which is intended for educational texts, and so has strongly filtered content comparable in some respects. Although both databases grow, it is not a random selection from libraries but a carefully crafted database.

In terms of working with the results, both offer several exciting functions worth mentioning because they are closely related to what we are doing in this area. First of all, as they are constructed (not based on indexing robots with standard references but in the form of editorial approval based on clear criteria), they can be used to create survey studies, analyze topics or search for people or workplaces with a problem. They can be used to map publishing behavior of particular issues or, for example, the transfer of an interest to a topic between disciplines. This information is essential for scientific work, and every university student should work with it.

Both databases also create the so-called impact factor, i.e. a number that evaluates the importance or prestige of a particular journal. It is (as well as the indexation itself) linked to the financing of science and the evaluation of workplaces or the career progression of individual scientists. This logically leads to the emergence of predatory magazines or conferences that try to access these databases and in order to extort money from the authors to publish articles.

Both databases are associated with another critical problem, and that is the availability of scientific articles. Suppose a person wants to access the full texts contained in the databases. In that case, they may be lucky that their institution has subscriptions to the resources, or that access is provided through Open Access. On the other hand, they may face the dilemma of whether to pay on average between 10 to 60 US dollars for an article the content of which is unknown, or they will quite simply not have access.

No description

LibGen and SciHub, two “pirate" services that allow you to download many scientific articles for free, are responding to this problem. They argue that science should be above all open, and that as it is mostly publicly funded, it should be free for the public to access. Free admission is also beneficial for scientists that have a chance at higher citations, especially education. The argument with the right to education, regardless of the economic possibilities of the institution where one learns, can be powerful. Paid databases create substantial social inequality between richer and poorer universities, respectively, between those who understand themselves and those who study at universities.

On the other hand, this damages magazines and publishers who lose funds for editorial work or database operators. The question of how appropriate it is to use similar tools is complex and can lead to an important difference between the ethical and legal aspects of the problem as a whole.