Latest News Clique Percolation Method in Python29 December 2020Clique Percolation Method (CPM) is an algorithm for finding overlapping communities within networks, introduced by Palla et al. (2005, see references). This implementation in Python, firstly detects communities of size k, then creates a clique graph. Each community will be represented by each connected component in the clique graph. Algorithm The algorithm performs the following steps: 1- first find all cliques of size k in the graph 2- then create graph where nodes are cliques of size k 3- add edges if two nodes (cliques) share k-1 common nodes 4- each connected component is a community Example with k=3 Graph The presented graph contains the following cliques: {1, 2, 3} {1, 3, 4} {4, 5, 6} {5, 6, 7} {5, 6, 8} {5, 7, 8} {6, 7, 8} Each clique will represent a node in the clique graph and those node are connected each other if they share k-1 (2 in this case) nodes. Clique Graph As a result, the clique graph presents two connected components containing {1,2,3,4} and {4,5,6,7,8}. The node 4 belongs to both communities. In other words, the graph contains two overlapping communities. Description Usage clique_percolation_method(graph, k = 3): Implementation of the Clique Percolation Method Test import CliquePercolationMethod as cpm cpm.text() # or cpm.test_karate() Arguments graph: the input graph (igraph object) k: the size of the clique (usually = 3) Returns communities: a list of communities. Each element of this list is itself a list containing the nodes of such community. Package Dependencies This implementation requires igraph which can be installed by running: pip install python-igraph Project info Github repository: https://github.com/angelosalatino/CliquePercolationMethod-Python Issue tracker: Here References Palla, G., Derényi, I., Farkas, I., & Vicsek, T. (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043), 814-818.... ISWC2020 – BEST DEMO OF THE DAY AWARD24 December 2020The Smart Topic Miner, which is an innovative state-of-the-art AI application for automating editorial processes at Springer Nature and improving access to scientific knowledge, has been shortlisted for the “Most Innovative use of AI” DataIQ 2020 Awards. Smart Topic Miner analyses scientific publications in Computer Science and classifies them with very high accuracy in terms of a catalogue of 15,000 research topics. The automation of this complex task has produced both a 75% cost reduction and a dramatic improvement in metadata quality. As a result of the latter, downloads in Computer Science from the Springer Nature portal have increased by almost 10 million units, as search engines can now identify the relevant scientific content with much higher accuracy, thus benefitting readers of scientific literature all over the world. At the SKM3 team, we are very honoured to receive such great recognition, as it further proves the impact of our work and specifically the Smart Topic Miner. Disclaimer: This post is identical to the one published to my research team group Relevant publications: Integrating Knowledge Graphs for Analysing Academia and Industry Dynamics The AIDA Dashboard: Analysing Conferences with Semantic Technologies... Applying Machine Learning Techniques to Big Data in the Scholarly Domain12 November 2020On 12th Nov 2020, I have been invited to give a talk to the 5th International School on Applied Probability Theory, Communications Technologies & Data Science (APTCT-2020), organised and hosted by the RUDN University (Moskow, RU), jointly with Tampere University (Finland) and Brno University of Technology (Czech Republic). In this lecture, I showed the Computer Science Ontology framework (see Fig. 1), and how it has been successful for us to perform several experiments in the field of Science of Science. Specifically I showed in detail all the 5 layers. I started from the Scholarly data and the big scholarly data sources available out there. Then, I showed the Computer Science Ontology (CSO) and how it has been created: Klink-2 Algorithm. CSO is a large scale ontology of research areas in the field of Computer Science and being an ontology of scientific disciplines it gives the possibility to organise digital libraries (scholarly datasets) according to its constituents (research topics). Afterwards, I showed the several approaches available for topic classification: connecting scholarly datasets and taxonomies/ontologies of science. As approaches, I showed topic model (e.g. LDA), machine learning approaches (supervised), approaches based on citation networks and finally the CSO Classifier based on Natural Language Processing techniques. The CSO Classifier allows to enhance each research paper in big scholarly dataset with it relevant topics drawn from the CSO ontology. On top of these 4 initial layers, I showed several high-level applications: metadata extraction, showing Smart Topic Miner, a tool used by Springer Nature for annotating and extracting metadata information from book and conference proceedings; recommendation of books, showing Smart Book Recommender, a tool developed for Springer Nature to analyse the digital library and select the most appropriate books, journals, and proceedings to market at a scientific event; research trends forecast, showing a ML approach able to predict the impact of a topic in industry (receive > 50 patents in the following 10 years); conference dashboard, is a recent tool that we developed for assessing conferences across several parameters. This framework proved to be very successful and we are eager to explore new innovative solutions and contribute to the further development of the Science of Science field. The lecture was attended by more than 100 students. Fig. 1: Computer Science Ontology Framework Slides Useful Links 5th International School on Applied Probability Theory, Communications Technologies & Data Science (APTCT-2020): http://www.aptct.ru... Finalists at DataIQ 2020 Awards01 October 2020The Smart Topic Miner, which is an innovative state-of-the-art AI application for automating editorial processes at Springer Nature and improving access to scientific knowledge, has been shortlisted for the “Most Innovative use of AI” DataIQ 2020 Awards. Smart Topic Miner analyses scientific publications in Computer Science and classifies them with very high accuracy in terms of a catalogue of 15,000 research topics. The automation of this complex task has produced both a 75% cost reduction and a dramatic improvement in metadata quality. As a result of the latter, downloads in Computer Science from the Springer Nature portal have increased by almost 10 million units, as search engines can now identify the relevant scientific content with much higher accuracy, thus benefitting readers of scientific literature all over the world. At the SKM3 team, we are very honoured to receive such great recognition, as it further proves the impact of our work and specifically the Smart Topic Miner. Disclaimer: This post is identical to the one published to my research team group Media From the KMi News website: OU wins two categories in DataIQ 2020 Awards Congratulations to the following teams for making the 2020 #DataIQAwards shortlist for “Most innovative use of AI” -AIScout -Merkle and Spirit Energy -NatWest -Open University -TUI Group View the shortlist and register for the live reveal here: https://t.co/WbX4YUCowG pic.twitter.com/2Bu02TgNe4 — DataIQ (@TheDataIQ) September 8, 2020... The AIDA Dashboard: Analysing Conferences with Semantic Technologies19 September 2020“The AIDA Dashboard: Analysing Conferences with Semantic Technologies” is a demo paper submitted to the Posters and Demos tracks of the 19th International Semantic Web Conference. Simone Angioni1, Francesco Osborne2, Angelo A. Salatino2, Diego Reforgiato Recupero1, Enrico Motta2 1 University of Cagliari, Via Università 40, 09124 Cagliari 2 Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK Abstract Scientific conferences play a crucial role in the field of Computer Science by promoting the cross-pollination of ideas and technologies, fostering new collaborations, shaping scientific communities, and connecting research efforts from academia and industry. However, current systems for analysing research data do not provide a good representation of conferences. Specifically, these solutions do not allow to track research trends, to compare conferences in similar fields, and to analyse the involvement of industrial sectors. In order to address these limitations, we developed the AIDA Dashboard, a tool for exploring and making sense of scientific conferences which integrates statistical analysis, semantic technologies, and visual analytics. Dashboard Figure 1: The Overview of the International Semantic Web Conference according to the AIDA Dashboard. Link to the Dashboard: http://aida.kmi.open.ac.uk/dashboard/ Download Download paper from our institutional repository: http://oro.open.ac.uk/72293/... AIDA Dashboard01 September 2020The AIDA Dashboard is a web application that allows users to visualize several kind of analytics about a specific conference (see Figure 1). The backend is developed in Python, while the frontend is in HTML5 and Javascript. The AIDA Dashboard builds on the Academia/Industry DynAmics knowledge graph (AIDA), a large knowledge base describing 14M articles and 8M patents in the field of Computer Science according to the research topics drawn from CSO. 4M articles and 5M patents are also classified according to the type of the author’s affiliations (academy, industry, or collaborative) and 66 industrial sectors drawn from INDUSO, which was specifically designed to support AIDA. AIDA was generated by integrating several knowledge graphs and bibliographic corpora, including Microsoft Academic Graph (MAG), Dimensions, DBpedia, CSO, and the Global Research Identifier Database (GRID). AIDA Dashboard is highly scalable and allows to browse the different facets of a conference according to seven tabs: Overview, Citation Analysis, Organizations, Authors, Topics, Similar Conferences, and Industry. Figure 1: The Overview of the International Semantic Web Conference according to the AIDA Dashboard. Figure 1 shows the Overview tab. This is the main view of a conference that provides introductory information about its performance, the main authors and organization, and the conference rank in its main fields in terms of average citations for paper during the last five years. The Citation Analysis tab reports the evolution in time of several citation-based metrics such as the impact factor and the average citations for paper. It also shows the evolution of the rank and the percentile of the conference in different fields. For instance, the Conference on Neural Information Processing Systems (NeurIPS) is currently the second conference in terms of average citations in Neural Network, the third in Machine Learning, and the twelfth in Artificial Intelligence. This visualization is typically used by Springer Nature editors to assess the performance of conferences within different communities and to identify emerging conferences. The Organizations and Authors tabs show several analytics about the main institutions and researchers active in the conference. Organizations can be filtered according to their type (academia or industry) and are associated with their number of publications, citations, and average citations for paper. The researchers are associated with similar analytics, but also with their H-index and H5-index, in order to quickly identify high impact researchers. Editors use this information to understand the quality of researchers and organizations attracted by the conferences. This is particularly important for assessing relatively young conferences that may not have developed yet a strong citation record. The Topic tab allows users to analyse the topic trends in time. Specifically it shows two selections of topics: frequent topics and fingerprint topics. The first is the set of topics which appear more frequently in the conference. The second is the set of most distinctive topics of the conference. It is obtained by computing the difference between the topic distribution of the conference and the one of the full dataset.... Databases and Information Systems in the AI Era: Contributions from ADBIS, TPDL and EDA 2020 Workshops and Doctoral Consortium16 August 2020“Databases and Information Systems in the AI Era: Contributions from ADBIS, TPDL and EDA 2020 Workshops and Doctoral Consortium” is the introductory chapter for the ADBIS , TPDL and EDA 2020 Common Workshops and Doctoral Consortium proceedings as satellite events of the 2020 International Conference on Theory and Practice of Digital Libraries (TPDL2020). Ladjel Bellatreche, Fadila Bentayeb, Mária Bieliková, Omar Boussaid, Barbara Catania, Paolo Ceravolo, Elena Demidova, Mirian Halfeld Ferrari, Maria Teresa Gomez Lopez, Carmem S. Hara, Slavica Kordić, Ivan Luković, Andrea Mannocci, Paolo Manghi, Francesco Osborne, Christos Papatheodorou, Sonja Ristić, Dimitris Sacharidis, Oscar Romero, Angelo Antonio Salatino, Guilaine Talens, Maurice van Keulen, Thanasis Vergoulis, Maja Zumer Abstract Research on database and information technologies has been rapidly evolving over the last couple of years. This evolution was lead by three major forces: Big Data, AI and Connected World that open the door to innovative research directions and challenges, yet exploiting four main areas: (i) computational and storage resource modeling and organization; (ii) new programming models, (iii) processing power and (iv) new applications that emerge related to health, environment, education, Cultural Heritage, Banking, etc. The 24th East-European Conference on Advances in Databases and Information Systems (ADBIS 2020), the 24th International Conference on Theory and Practice of Digital Libraries (TPDL 2020) and the 16th Workshop on Business Intelligence and Big Data (EDA 2020), held during August 25–27, 2020, at Lyon, France, and associated satellite events aimed at covering some emerging issues related to database and information system research in these areas. The aim of this paper is to present such events, their motivations, and topics of interest, as well as briefly outline the papers selected for presentations. The selected papers will then be included in the remainder of this volume. Download Paper: Download from DOI: https://doi.org/10.1007/978-3-030-55814-7_1... ‘PhD Survival Guide’, Angelo and Akshika, The Open University11 August 2020This is a little project that my friend Akshika and I brought forward with the support of The Open University. In this video, we talk about the things we wish we had known before starting our PhD journey. Are you a PhD student or are you planning to start soon? Give it a look.... ResearchFlow: Understanding the Knowledge Flow between Academia and Industry28 July 2020“ResearchFlow: Understanding the Knowledge Flow between Academia and Industry” is a conference paper submitted to Knowledge Engineering and Knowledge Management – 22nd International Conference, EKAW 2020. Angelo Salatino, Francesco Osborne, Enrico Motta Abstract Understanding, monitoring, and predicting the flow of knowledge between academia and industry is of critical importance for a variety of stakeholders, including governments, funding bodies, researchers, investors, and companies. To this purpose, we introduce ResearchFlow, an approach that integrates semantic technologies and machine learning to quantifying the diachronic behaviour of research topics across academia and industry. ResearchFlow exploits the novel Academia/Industry DynAmics (AIDA) Knowledge Graph in order to characterize each topic according to the frequency in time of the related i) publications from academia, ii) publications from industry, iii) patents from academia, and iv) patents from industry. This representation is then used to produce several analytics regarding the academia/industry knowledge flow and to forecast the impact of research topics on industry. We applied ResearchFlow to a dataset of 3.5M papers and 2M patents in Computer Science and highlighted several interesting patterns. We found that 89.8% of the topics first emerge in academic publications, which typically precede industrial publications by about 5.6 years and industrial patents by about 6.6 years. However this does not mean that academia always dictates the research agenda. In fact, our analysis also shows that industrial trends tend to influence academia more than academic trends affect industry. We evaluated ResearchFlow on the task of forecasting the impact of research topics on the industrial sector and found that its granular characterization of topics improves significantly the performance with respect to alternative solutions. Paper download Download from ORO: http://oro.open.ac.uk/71665/ Download from SKM3: https://skm.kmi.open.ac.uk/skm-data/uploads/2020/08/19-Salatino-camera-ready.pdf Additional files ResearchFlow integrates data from publication, patents, and organizations in order to characterise each topic according to its frequency in time of i) publications from academia, ii) publications from industry, iii) patents from academia, and iv) patents from industry. This representation is then used to produce several analytics regarding the academia/industry knowledge flow and to forecast the impact of research topics on industry. We evaluated ResearchFlow on the task of forecasting the impact of a topic and found that its granular characterisation the topic evolution improves significantly the performance with respect to alternative solutions. It is also available the dataset that allowed us to develop ResearchFlow and perform this analysis. You can download the file from the button below. In this archived files there is a folder containing json files. Each file represents a topic and contains a json dictionary with four main keys: ‘papers-education’, ‘papers-company’, ‘patents-education’, ‘patents-company’. Each key is then associated to a list of 29 values: number of documents of that signal from 1990 to 2018 (both included). The evaluation files and the dataset are available here: http://doi.org/10.21954/ou.rd.12805307... Angelo and Edu Frias on the PhD Defence09 June 2020Going through the final stages of your PhD? No worries! We’ve got you covered. In this episode, I invited Dr Eduardo Frias-Anaya, who defended his thesis two weeks ago, at the Open University in the UK, to talk about the final stages of a PhD. We talk about writing up a dissertation, preparation to the defence, the viva, and all the challenges behind the whole process, including also the impact of COVID-19. Where to listen? https://open.spotify.com/episode/4MzRF5AgnGphHiwSpafczS?si=ywlxTOf_RdKyGJkXF3y_Rg This podcast is available on many other platforms like: Anchor.fm Breaker Google Podcast Pocket Casts RadioPublic... Integrating Knowledge Graphs for Analysing Academia and Industry Dynamics04 June 2020“Integrating Knowledge Graphs for Analysing Academia and Industry Dynamics” is a workshop paper submitted to the 1st Workshop on Scientific Knowledge Graphs held in conjunction with the 24th International Conference on Theory and Practice of Digital Libraries (TPDL 2020). Simone Angioni1, Francesco Osborne2, Angelo A. Salatino2, Diego Reforgiato Recupero1, Enrico Motta2 1 University of Cagliari, Via Università 40, 09124 Cagliari 2 Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK Abstract Academia and industry are constantly engaged in a joint effort for producing scientific knowledge that will shape the society of the future. Analysing the knowledge flow between them and understanding how they influence each other is a critical task for researchers, governments, funding bodies, investors, and companies. However, current corpora are unfit to support large-scale analysis of the knowledge flow between academia and industry since they lack of a good characterization of research topics and industrial sectors. In this short paper, we introduce the Academia/Industry DynAmics (AIDA) Knowledge Graph, which characterizes 14M papers and 8M patents according to the research topics drawn from the Computer Science Ontology. 4M papers and 5M patents are also classified according to the type of the author’s affiliations (academy, industry, or collaborative) and 66 industrial sectors (e.g., automotive, financial, energy, electronics) obtained from DBpedia. AIDA was generated by an automatic pipeline that integrates several knowledge graphs and bibliographic corpora, including Microsoft Academic Graph, Dimensions, English DBpedia, the Computer Science Ontology, and the Global Research Identifier Database. Paper download Download from ORO: http://oro.open.ac.uk/70715/... Angelo and Tina on the PhD journey01 June 2020Have you thought of doing a PhD, you are currently doing or you have finished? I have. It’s painful and rewarding. In this episode, I talk with Dr Tina Papathoma (@aktinaki), a post-doctoral researcher in Technology Enhanced Learning, about the PhD journey and the impact it has on us. We talk about the PhD process, submission and viva in an honest discussion of a 4 years journey. Where to listen? https://open.spotify.com/episode/4L1PJsmqaMMzmBsJTdsAuc?si=lnECGB86Sce8KZ57pPaBpA This podcast is available on many other platforms like: Anchor.fm Breaker Google Podcast Pocket Casts RadioPublic... Academia/Industry DynAmics (AIDA) Knowledge Graph22 May 2020Academia and industry share a complex, multifaceted, and symbiotic relationship. Analysing the knowledge flow between them, understanding which directions have the biggest potential, and discovering the best strategies to harmonise their efforts is a critical task for several stakeholders. While research publications and patents are an ideal media to analyse this space, current datasets of scholarly data can not be used for such a purpose since they lack a high-quality characterisation of the relevant research topics and industrial sectors. We introduce the Academia/Industry DynAmics (AIDA) Knowledge Graph, which describes 14M publications and 8M patents according to the research topics drawn from the Computer Science Ontology. 4M publications and 5M patents are further characterised according to the type of the author’s affiliations (academy, industry, or collaborative) and 66 industrial sectors (e.g., automotive, financial, energy, electronics) organised in a two-level taxonomy. AIDA was generated by an automatic pipeline that integrates data from Microsoft Academic Graph, Dimensions, DBpedia, the Computer Science Ontology, and the Global Research Identifier Database. It is publicly available under CC BY 4.0 and can be downloaded as a dump or queried via a triplestore. AIDA triplestore currently counts 1025978149 (1B) triples. Links: AIDA website: http://aida.kmi.open.ac.uk AIDA SPARQL endpoint: http://aida.kmi.open.ac.uk/sparql/ ... Just Chilling with Angelo Salatino08 May 2020I like to tell stories. I like to explore people’s story. I feel there is always something to learn from other people’s experiences. In will use this space to share with you in the most honest and genuine way all stories worth listening. I hope you would like it. Stay tuned. Latest episode from Spotify Available listening platforms All my podcasts are available on several platforms: Anchor – https://anchor.fm/angelosalatino Breaker – https://www.breaker.audio/just-chilling-with-angelo-salatino Google Podcasts – https://www.google.com/podcasts?feed=aHR0cHM6Ly9hbmNob3IuZm0vcy8xZjg2YWI2OC9wb2RjYXN0L3Jzcw== Pocket Casts – https://pca.st/s0yr11ui RadioPublic – https://radiopublic.com/just-chilling-with-angelo-salatin-8jyMn5 Spotify – https://open.spotify.com/show/3ZeNeFZlaHau1SNX2nKFBE... I launched an online course on Instagram, here is my lesson learned01 May 2020THIS IS A DRAFT – WORK IN PROGRESS During the COVID-19 outbreak many people relied more and more on web technologies like such as video calls and social networks, to fulfil their social needs. I decided to design and release an online course on Instagram so that users while consuming content could be engaged in learning activities. In this blog post, I will describe how I designed the course and the challenges I had to face to produce the learning material, as well as my showing my considerations about the whole idea. 1 Introducing the idea At the beginning of 2020, the whole world witnesses the pandemic outbreak of COVID-19, an infectious disease that affects human’s respiratory system and can lead to serious consequences. In about two months, this virus spread like wildfire across the globe and in order to cope with healthcare system capacity, many governments enforced lockdowns and social distance, forbidding people to get out of their house unless strictly necessary. The natural reaction for people was to certainly be physically distant but socially connected. Indeed, confined in their household and facing loneliness and boredom, people relied more and more on web technologies like such as video calls and social media, to fulfil their social needs. They moved their face-to-face activities online, not only confined to their working duties but expanded also to their social activities, such as aperitives, celebrating birthdays, training sessions, karaoke and others. They were gathering in large chatting rooms, sometimes with more than 10 invitees, in video conference using software like Whatsapp, Houseparty, Zoom, Jitsy, Webex, Skype, Microsoft Teams and others. This activity was paired up with seeking of entertainment on social media, such as Instagram, Facebook, TikTok, Snapchat and Twitter. Social media belong to the category of Web2.0 technologies which gives users the power to be simultaneously consumers and produces, often defined as prosumers. It is up to the users to decide how much they want to produce and consume over such platforms. To this end, considering the amount of time we end up consuming over these platforms, it comes naturally to ask: can we use them in a more constructive way and for social good? Or in other words, can we use it to exchange knowledge from an educator to a learner? Urged by curiosity, I started designing and creating a video course for Instagram: Fun with Barese. In this video course, I teach figurative expressions – or idioms – that make the Barese, the language form the place I come from, very colourful and full of emotions. In general, this experiment would both i) help me to understand how easy is to use a platform like Instagram for developing an online course as well as enabling students to learn, and ii) help me in disseminate the Barese culture. The whole idea behind the course is that students would navigate through a set of multiple stories, in a slideshow format. Each story consists of a mix of media, like video, picture, animations, and sounds. Among the several features: i) stories on Instagram scroll automatically after 15 seconds, although users are allowed to move backward and forward by tapping respectively on the left and right side of the screen, ii) stories can be paused by tapping and holding onto the screen, and iii) stories are in a portrait mode typically 1920×1080 pixels. This combination of features define the current advantages and constraints, and thus challenges, set by the platform. For instance, as each story is displayed for a limited amount of time, we need to make sure to correctly dose the amount of information in each story so to be digested in up to 15 seconds. However, there are several other considerations that a course designer needs to take into account when producing new learning material for online courses. In the next section, I will describe more in detail some principles that allowed me to design this course. In section 3, I will describe what are the facilities that Instagram itself offers to create content. In section 4, I will describe how I evaluated this course in term of student retainment. Finally, in section 5, I will outline my lesson learned. 2 Design When designing an online course, educators should take into account certain guidelines to make pedagogically informed decisions. Indeed, during the design of this course I was heavily influenced by Sharples (2015) guidelines. One of these guidelines suggests having an idea about the intended learners. In this case, my intended learned are those who speak English and are planning to go on holiday to Bari and knowing some idioms can make their staying even more pleasurable. Among my intended learners there are also people from Bari who certainly know the language but unaware of certain idioms. This extends also to second-generation immigrants living in other countries who learned Barese from their parents and they want to improve it. It is clear that the learning objective of the course is to teach idioms available in such language. Another suggested guideline is the support of a narrative. In this course, each idiom is showed through at least two stories. The first story shows the idiom in both Barese and English, with a narrator voice that reads the idiom in both languages. This is then followed by another story, which in the form of storytelling provides more context, explains how this idiom originated, when it applies and others. Following these guidelines, I had a clear idea on how to structure the course. Now it was matter of understanding whether Instagram provides enough support for the creation of such learning material and if not what kind of technologies are available out there. 3 Which technologies I used to prepare this learning material? o better understand what kind of technologies are required to prepare the learning material we will start with analysing the tools offered by Instagram itself to produce stories and therefore the venue of this minicourse. However, even if Instagram provides a quite versatile tool to produce content, it is still limited to a set of functionalities that hinders the full power of human creativity. Indeed, to take full advantage of one’s own creativity, one need to complement Instagram with other technologies, which can be used for free, and under subscription to access to more advanced functionalities. As social network, Instagram can be used either through the phone with an app for both iOS and Android, or via its web interface. However, users can create stories only from the official app from the phone. Figure 1. Screenshot of the Instagram Stories editor from a phone screen. In general, Instagram offers quite a few tools, as shown in Figure 1, for creating stories and in this particular case producing learning material. Stories can be created starting from a either pictures or videos, and then a user can add more content such as text, GIFs (preloaded from Giphy), quizzes, polls, music (provided by Instagram), open questions, and it allows also to draw using a pencil or other markers. In addition, it is also possible to create a story starting from just a piece of text, pool, question, or quiz without the need of picture or video as background. In this case the user can choose among some preloaded backgrounds which can be either solid coloured (black, red, green etc) or patterned. Another way to create Instagram stories is through a live broadcast. This feature is an interactive way to share what one is doing while doing it. Live videos allow you also to engage with you followers that, while watching it, can interact by texting you or asking questions. In addition, you can invite up to one person among your current viewers to join your live. This will then split the screen in two parts to show the two simultaneous lives. Although the facilities offered by Instagram make this whole tool for publishing quite versatile, it poses a number of challenges when you need to produce rather complex and elaborated content, like online mini courses. For instance, in the current design of this mini course, it could have come particularly handy the possibility to share pictures, describing an idiom, with an audio message that reads the idiom, so to both facilitate the learner in acknowledging the sounds of a new language and increase the accessibility to the course itself. Another challenge for developing learning material is posed by the limited size of the mobile screen which makes this goal quite unpractical, even with today’s larger-screen phones. In the complexity vs. usability trade off, app developers tend to prioritize usability disadvantaging complexity. It is rather hard to create something complex on mobile phones. This is worsened by the single window (or single app) constraint while on our laptops we tend to work with a number of different apps simultaneously. Inevitably, to create learning material I had to recur to additional technologies using my laptop and the use Instagram just a platform for publishing and not creating. 3.1 Additional technologies This section is devoted to briefly introduce some of the technologies that supported the creation of the learning material. As each story was built as video with one image and audio, I mainly used Adobe Spark to create images and iMovie on MacOS for recording and finalising the videos. I used Adobe Spark web app to create images as one doesn’t need any design experience. It works on any browser and it is quite straightforward to create images using text, load backgrounds, add icons, use templates, organise the content in layouts, and if you are using it through the app on mobile, you have also the possibility to add animations too. All of this is free of charge, whereas with a premium subscription you have access to more advanced features that I thought not worth of spending for creating this course. Once all the pictures were created, I imported them in iMovie for video editing. In this process, I recorded the audio of each idiom in both languages (Barese and English), as well as their English explanation and whenever needed I added also some music with no copyrights. For publishing stories on Instagram there are web tools that can be used without necessarily going through the app on the phone. For instance, one can use Later.com, that allows users to plan, schedule and analyse posts. However, the free version is very limited and allows to post only feed pictures. Stories can be posted with a subscription that starts from £7.5 per month. Eventually, I decided to transfer all the created videos to my phone and publish them one after the other. 4 Evaluation Unfortunately, Instagram is not designed to be a learning platform and it does not provide tools for assessing the quality of the course, There are few analytics that can be extracted to assess the quality of the course and Here I show the retainment curve extracted after 4 days from the launch of the curve. These values report the number of viewers each story obtained during the 24h of its life. Even if all the stories are gathered within one of my highlights and therefore can be explored after the 24h from their release, Instagram does not make available to its user who has been viewing them after this time frame. On Instagram you can have a different profile levels: personal, creator, and business. Currently my account is set to creator, which provides some more additional analytics compared to the personal profile, allowing creators to get more insights on how their followers browsed through their stories, such as if they went i) backwards, ii forward, iii) or exited. As a preliminary analysis, also to have a glimpse on the number of learners retained during the single episodes, I report the following line charts where we can observe the number of unique viewers (y-axis) across the different stories (x-axis) published in the three different episodes released thus far. Specifically, Figure 2 shows the number of unique viewers for the 25 stories contained in the first episode: 224 for the first story and 108 for the last one (-51.7%). Figure 3 instead shows the number of viewers for the 24 stories within the second episode: 138 viewers for the first story and 68 at the last one (-50.7%). Finally, Figure 4 shows the retainment curve of the third episode, which is reduced to 13 stories: 125 for the first one and 72 for the last one (-42.4%) Figure 2. Retainment curve of the first episode Figure 3. Retainment curve of the second episode. Figure 4. Retainment curve of the third episode. One of reasons why there is a substantial drop (~50%) from the first to the last story of each episode can be justified by the attitude of the learner. Typically, when enrolling to a higher-education course, learners have an active role that can be motivated by their interest in learning new subjects. On Instagram, users are definitely engaged in watching all the stories from the accounts they follow, but they are unaware of the content will be displayed next. In this case, when they reach the course I prepares, they might watch the first stories, out of curiosity, but they might drop out because they are not actually interested. Perhaps, a dedicated account for the course, with a selected pool of interested learners, can be an interesting solution to adopt. To have a better understanding on the kind of learners who were mostly interested, I organised them in three main groups: i) people either from Bari or they know Barese, ii) rest of Italy, so they know Italian and perhaps they can understand a bit of Barese, and iii) rest of the world, which I assume know at least English to be able to follow the course. In Table 1 and Table 2 we can see the different users organised in those three groups respectively for the second and third episode. I did not manage to collect the data about the first episode as Instagram does not allow us to see the viewers after 48h. From these tables, we can observe that people from Bari, where Barese is spoken, were the mostly interested viewers and were also the viewers that I lost the least (-20% in Table 2). Technically, from Table 1, we can see that people in the second group (rest of Italy) were those who I lost the least but having only 15 samples we cannot consider this result as a statistically significant. Table 1. Learners from the second episode (24 stories). First Story Last Story Diff Bari 75 37 -50.67% Italy 15 9 -40.00% Rest 35 11 -68.57% Table 2. Learners from the third episode (13 stories). First Story Last Story Diff Bari 75 60 -20.00% Italy 15 5 -66.67% Rest 29 10 -65.52% 5 Conclusion I created “Fun with Barese” as a way to have some fun and with the mission of exporting and wide-spreading my culture and my language to the world. I embarked the challenge of using Instagram as it is one of the widely used platforms for entertainment. My idea was to support the process of learning while enjoying our time on social media. This has been challenging given the several constraints posed by such platform which was not designed for learning in the first place. Nonetheless, this allowed me to spend time in designing the most appropriate way to organise the content, identify the technologies that would support me through the creation of such learning material. I believe there are interesting insights that can be gathered out of this experience which can support both my future self when designing higher education courses and those who want to embark in a similar experience. 6 Acknowledgements 7 References... Ontology Extraction and Usage in the Scholarly Knowledge Domain06 April 2020“Ontology Extraction and Usage in the Scholarly Knowledge Domain” is our contribution to the IOSPress edited book “Applications and Practices in Ontology Design, Extraction, and Reasoning” by Giuseppe Cota, Marilena Daquino and Gianluca Pozzato. Angelo Salatino, Francesco Osborne, Enrico Motta Abstract Ontologies of research areas have been proven to be useful in many application for analysing and making sense of scholarly data. In this chapter, we present the Computer Science Ontology (CSO), which is the largest ontology of research areas in the field of Computer Science, and discuss a number of applications that build on CSO, to support high-level tasks, such as topic classification, metadata extraction, and recommendation of books. Paper Download Download paper from arXiv: https://arxiv.org/abs/2003.12611... 1st Workshop on Scientific Knowledge Graphs (SKG2020)17 February 2020web: https://skg.kmi.open.ac.uk, twitter: @skgworkshop Held in conjunction with TPDL2020, 25th-28th August 2020, Lyon, France Overview In the last decade, we experienced an urgent need for a flexible, context-sensitive, fine-grained, and machine-actionable representation of scholarly knowledge and corresponding infrastructures for knowledge curation, publishing and processing. Such technical infrastructures are becoming increasingly popular in representing scholarly knowledge as structured, interlinked, and semantically rich Scientific Knowledge Graphs (SKG). Knowledge graphs are large networks of entities and relationships, usually expressed in W3C standards such as OWL and RDF. SKGs focus on the scholarly domain and describe the actors (e.g., authors, organizations), the documents (e.g., publications, patents), and the research knowledge (e.g., research topics, tasks, technologies) in this space as well as their reciprocal relationships. These resources provide substantial benefits to researchers, companies, and policymakers by powering several data-driven services for navigating, analysing, and making sense of research dynamics. Some examples include Microsoft Academic Graph (MAG), Open Academic Graph (combining MAG and AMiner), ScholarlyData, PID Graph, Open Research Knowledge Graph, OpenCitations, and OpenAIRE research graph. Current challenges in this area include: i) the design of ontologies able to conceptualise scholarly knowledge, ii) (semi-)automatic extraction of entities and concepts, integration of information from heterogeneous sources, identification of duplicates, finding connections between entities, and iii) the development of new services using this data, that allow to explore this information, measure research impact and accelerate science. This workshop aims at bringing together researchers and practitioners from different fields (including, but not limited to, Digital Libraries, Information Extraction, Machine Learning, Semantic Web, Knowledge Engineering, Natural Language Processing, Scholarly Communication, and Bibliometrics) in order to explore innovative solutions and ideas for the production and consumption of Scientific Knowledge Graphs (SKGs). Topics We encourage the submission of papers covering, but not limited to, one or more of the following topics: Methods for extracting entities (methods, research topics, technologies, tasks, materials, metrics, research contributions) and relationships from research publications Methods for extracting metadata about authors, documents, datasets, grants, affiliations and others. Data models (e.g., ontologies, vocabularies, schemas) for the description of scholarly data and the linking between scholarly data/software and academic papers that report or cite them Description of citations for scholarly articles, data and software and their interrelationships Applications for the (semi-)automatic annotation of scholarly papers Theoretical models describing the rhetorical and argumentative structure of scholarly papers and their application in practice Methods for quality assessment of scientific knowledge graphs Description and use of provenance information of scholarly data Methods for the exploration, retrieval and visualization of scientific knowledge graphs Pattern discovery of scholarly data Scientific claims identification from textual contents Automatic or semi-automatic approaches to making sense of research dynamics Content- and data-based analysis on scholarly papers Automatic semantic enhancement of existing scholarly libraries and papers Reconstruction, forecasting and monitoring of scholarly data Novel user interfaces for interaction with paper, metadata, content, software and data Visualisation of related papers or data according to multiple dimensions (semantic similarity of abstracts, keywords, etc.) Applications for making sense of scholarly data Submission Submissions are welcome in the following categories: Full papers presenting original work (12 pages incl. references, LNCS format) Short papers presenting original work (6 pages incl. references, LNCS format) Papers can be submitted via EasyChair: https://easychair.org/conferences/?conf=skg2020 Submissions will be evaluated based on originality, significance, technical soundness and clarity. Accepted papers (after blind review of at least 3 experts) will be published in the Springer CCIS series. The best paper (according to the reviewers’ rate) will be invited to a special issue of the journal Quantitative Science Studies (MIT Press) At least one of the authors of the accepted papers must register for the workshop to be included in the workshop proceedings. Important Dates Paper submission: April 30, 2020 Notification of acceptance: May 22, 2020 Camera-ready due: June 5, 2020 Workshop day: August 25, 2020 Author instructions All paper submissions have to be in English and submitted as a PDF file. Authors should consult Springer’s authors’ guidelines and use their proceedings templates, either for LaTeX or Word, for the preparation of their papers. Springer encourages authors to include their ORCIDs in their papers. In addition, the corresponding author of each paper, acting on behalf of all the authors of that paper, must complete and sign a Consent-to-Publish form. The corresponding author signing the copyright form should match the corresponding author marked on the paper. Once the files have been sent to Springer, changes relating to the authorship of the papers cannot be made. Note that the paper size limit must be respected. Camera-ready papers that do not comply with the page limit when formatted using the LNCS style may be rejected. Organisation Andrea Mannocci, Italian Research Council (CNR), Pisa (IT) Francesco Osborne, The Open University, Milton Keynes (UK) Angelo Salatino, The Open University, Milton Keynes (UK) More information about SKG2020 is available at https://skg.kmi.open.ac.uk Contact: [email protected] Computing Automorphic Numbers03 February 2020In our lab, we like to tease each other with fancy riddles. In our kitchen, we have a large wooden box, filled with some chocolates and locked by a 4-digits lock (Fig. 1). Those who crave for some sugar will just need to solve the riddle and unlock the box. The last few riddles involved a particular family of numbers which are called automorphic, and the complexity of such riddles was increasing with the size of those numbers in terms of the number of digits. For instance, in the last riddle, we were asked to compute a number with 44444 digits, requiring an enormous computational power. Here, I will show some properties of automorphic numbers as well as some insights that allowed me to develop a rather efficient algorithm for cutting down the computational complexity of the problem. In particular, I will initially describe three recent riddles and how the first two were solved. Then, I will drive you through the definition of automorphic number and some of their properties. Lastly, I will show how I developed the algorithm to solve the third riddle. Figure 1. The locked box Riddles It all started a few weeks ago, with a simple riddle (Riddle #1): “there is one four-digit whole number n, where the last four digits of n2=n” Before I could even put my hands on this riddle, the solution was already spreading in the lab. It was 9376, because it is the only number which squared contains itself: 93762 = 87909376. To find this solution, some people relied on their math knowledge: only numbers ending with 0, 1, 5 and 6, end with the same digit if squared. This allowed them to cut down the solution space. Some others relied on the brute force approach. It is pretty quick to square 10000 numbers (actually 9000 as the first 1000 have only 3 digits) and see if they are contained in their squared. From there, it escalated quickly (see Fig 2). Figure 2. Riddles There was a new riddle (Riddle #2): “there is one 20-digit whole number n, where the last 20 digits of n2=n. The first 4 digits open the box” Using the brute force would have solved the riddle as there a finite number of solutions, but it was claimed it would have required “2852 years (assuming testing one 20-digit number takes 1 nanosecond). If you want the answer within an hour, you’d need 25 million PCs and a 1.21 Jiggawatts power supply.” If brute force is not the way, there must be something else that allows retrieving this number. This challenge captivated me. However, this needed me to go back to Math books as certainly there was a property for this kind of numbers which I was not aware of. I started googling. I soon discovered that such numbers are called automorphic. I found that, for a given number of digits, there are only two automorphic numbers, one ending in 5 and one in 6 (except that the 1-digit automorphic numbers which include 0 and 1). In the next section, I will describe this kind of numbers more in detail. Then, I also found a list of pre-computed automorphic numbers (archived version). This list contained all automorphic numbers ending with both 5 and 6 up to 100 digits. For 20 digits, there is only one number, which ends with 5: 92256259918212890625. 922562599182128906252 = 8511217494096854352392256259918212890625 Then, the first 4 digits (9225) opened the lock. Eventually, I did not have to waste the power of 25 million PCs, when I could simply rely on a larger network of PCs with superior knowledge: the Internet. This time it escalated exponentially. There was again a new riddle (Riddle #3): “How many k ∈ {1,2,…,44444} are there such that there is exactly one k-digit whole number n, where the last k digits of n2=n?” Allow me to rephrase it and make it clearer. Given a finite set of digits, from 1 to 44444, for how many digits we have only one automorphic number? In the two previous riddles, we saw that for 4 and 20-digits, there is only one automorphic number. However, 4 and 20 are not the only ones. In the table below, we can see that for 4, 5, 12, 13 and 20 there is one k-digit automorphic number (see Table 1). From 1 to 100 digits 33 digits have only one automorphic number. Instead, from 1 to 1000, we can find 216. Then, how many from 1 to 44444? Table 1. List of automorphic numbers for a given digit. Digits Automorphic numbers Quantity 1 0, 1, 5, 6 4 2 25, 76 2 3 376, 625 2 4 9376 1 5 90625 1 6 109376, 890625 2 7 2890625, 7109376 2 … … … 12 918212890625 1 13 9918212890625 1 … … … 20 92256259918212890625 1 This time, there was no list. Those numbers need to be computed. In the next section, I will show some insights and strategies that allowed me to cut down the computational complexity of the problem and therefore solve this third riddle. What is an Automorphic Number? An n-digit number is said to be automorphic if the last n digits of its square are the same as the original number. For instance, 52 = 25, 62 = 36, 252 = 625, 762 = 5776 and many others. There are infinite automorphic numbers and we can find them also with a large number of digits, such as 182128906252 = 331709384918212890625 or 817871093762 = 6689131260081787109376. In particular, we can find four 1-digit automorphic numbers, two trivial numbers: 02 = 0, and 12 = 1, and two nontrivial numbers: 52 = 25, and 62 = 36. However, for numbers with more than one digit (n > 1), there are at most two automorphic numbers, meaning that from 10 to 99 there are only two automorphic numbers: 25 and 76. Also from 100 to 999, there are only two automorphic numbers: 376 and 625. However, from 1000 to 9999 we can find only 9376. Properties of Automorphic Numbers In this section, I will show some characteristics for automorphic numbers and then I will make a few considerations on how those particular properties can be exploited to cut down the complexity of the problem. Nine’s complement The first property of these automorphic numbers jumping to the eye is their nine’s complement, for all digits except the last. The nine’s complement of a given number is formed by replacing each digit with nine minus that digit. In Fig. 3, below you can observe the two 30-digits automorphic numbers. For each digit, except the last (in the red rectangle), they are complementing each other (see green rectangles): 3->6, 5->4, 2->7, 1->8, and 9->0. Figure 3. Example of nine’s complement on both 30-digits automorphic numbers. Considering this complementarity, we just need to compute the sequence of numbers that end with 5. The sequence of 6, will be generated by subtracting each digit from 9 and replacing the last digit with 6. Formula There is a formula that allows to compute the two numbers: \begin{aligned} a u t_{5}^{n} &=5^{2^{n}} \bmod 10^{n} \\ a u t_{6}^{n} &=10^{n}-a u t_{5}^{n}+1 \end{aligned} In particular, to get the automorphic number ending with 5 hiving n digits (a u t_{5}^{n}), is the remainder when computing 5 to the power of 2n divided by 10n. Because these numbers are complemented, the automorphic number ending with 6 and having n digits (a u t_{6}^{n}) is computed by subtracting the first (a u t_{5}^{n}) from 10n+1. In Python, we can use these formulas within in a for-loop to retrieve the list of automorphic numbers up to 20 digits: limit = 20 for i in range(2, limit): a = 5**(2**i) % 10**i b = 10**i – a + 1 print("5:",a) print("6:",b) However, this python code works up until a certain number of digits, as there is a limit on the number of digits an integer can be represented in Python. Nonetheless, this tells us that to identify automorphic numbers we can simply apply such formulas without the need of using brute force. You can read more info from here. Extension Another interesting property for the automorphic numbers is that they extend themselves including the number with fewer digits. Specifically, each n-digit automorphic number contains the previous number (n-1 digits) with a digit prepended. As you can see from Fig. 4 the automorphic number of 9 digits contains the automorphic number of 8 digits and prepends the digit 2: 212890625. Similar for the automorphic number of 10 digits, which contains one of 9 digits and prepends an 8: 8212890625. This pattern is valid for any n-digit automorphic number, as shown in Fig. 4. Figure 4. Representation of the extension property This characteristic of the automorphic numbers is telling us that we can potentially start from a seed number (let’s say 12890625 as in the example below), and iteratively prepending one digit from 0 to 9, to find which of them produces a squared number which includes itself, as shown in Table 2. Table 2. Examples when prepending a digit to the current automorphic number. Prepending to 12890625 Squared 0128906252 166168212890625 – 1128906252 12744293212890625 – 2128906252 45322418212890625 OK 3128906252 97900543212890625 – 4128906252 170478668212890625 – 5128906252 263056793212890625 – 6128906252 375634918212890625 – 7128906252 508213043212890625 – 8128906252 660791168212890625 – 9128906252 833369293212890625 – Autogeneration Automorphic numbers of the sequence of 5 generate automatically. In particular, the sequence of 5 grows by iteratively squaring automorphic numbers having n digits and keeping the digit at position n+1. In the picture below, we can observe that to generate the automorphic sequence of 5, we can start from squaring 5 (1 digit), and then we keep the value at the second digit, generating 25. Then, we can process 25 which has 2 digits and we keep the 3rd digits of its squared, generating 625. Again, we process 625 and we keep the fourth digit of its squared, generating 0625. This is a special case, the value 0 is because there is no 4-digit automorphic number in the sequence of 5. Then again, we re-square 625, and we take the 5th digit of its squared, generating 90625, and so forth. This process is exemplified in Fig. 5. Figure 5. Autogeneration property of the sequence 5. In this case, one can save computational time by squaring the n-digit number and stopping when the digit n+1 is computed, as it will feature the consecutive automorphic number is such sequence. Unfortunately, this property is not true for the sequence of 6, as 62 = 36 rather than 76. You can read more info from here. Solving the Riddle At this stage, we know the riddle and we have gone through some characteristics of automorphic numbers. The question on how to solve the riddle remains. Because of the extension property, the first consideration to make is that potentially we do not need to compute all numbers for the whole sets of digits {1,2,…,44444}. The 44444-digits number will contain the nested sequence of its previous 44443 solution numbers. And whenever, the sequence does not have a solution for a given digit, it will contain a 0, like for instance at the fourth digit 890625. In this case, if we manage to compute this large number with 44444 digits and we count the number of zeros, we will know how many times the sequence of 5 did not produce a solution: therefore, there was exactly one solution. However, this is not enough, because we need to know also the number of times the sequence of 6 contains a 0. In this case, the nine’s complement comes handy. Thanks to this property we know that all digits (except the last) complement each other. Therefore, if we know the 5-sequence of 44444 digits, we will just need to count both zeros and nines, meaning that they will be zeros in the 44444-digits automorphic number of sequence 6. So now, the question comes down to: how do we compute the 44444-digit automorphic number of sequence 5? We can definitely use the brute force approach: try out all numbers from 1044443 to 1044444, square them and find the one which is automorphic. I am not sure about the number of machines one needs to use to compute this number in a reasonable time. To cut down the number of tests, we can rely on the formula: a u t_{5}^{44444}=5^{2^{44444}} \bmod 10^{44444} However, computing 5^{2^{44444}} is extremely challenging too. At this stage, we might take advantage of the extension property. We can start from 5, and we can iteratively test all possible solutions when prepending one digit at a time, e.g.: 05, 15, 25, 35, 45, 55, 65, 75, 85, 95. After we find which of these 10 numbers is automorphic, we proceed with the 3rd digit and so on until 44444 digits. This strategy reduces our domain to at most 444440 tests. However, the autogeneration property allows to further cut down the number of tests. We just need to start from 5, as seed number, and then we repeatedly square this number to identify the n+1 digit. This strategy will then allow us to test only 44444 numbers. Algorithms It is clear from the discussion above that the algorithm to solve the riddle need to perform several squares, or more in general multiplications. Considering that we are treating very long numbers, we cannot pretend to use normal data types and operators. The algorithm needs to be customised to the problem: our numbers will be represented by arrays (which length is limited by the size of the memory) and we need to develop the algorithm performing the multiplication of such arrays from scratch. In this section, I will drive you through the different versions of the algorithms that allowed me to cut down the computational complexity of the problem from days to seconds. In order to have a fair comparison of the performance of those algorithms, I will show the amount of time it took them to compute the first 1000 digits on the same machine. Multiplication The initial version of the algorithm was a very basic multiplication between strings. This algorithm implements the simple grade-school multiplication, like in the example in Fig. 6. Figure 6. Grad-school multiplication. In this particular case, the algorithm is computing the squared of a 6-digits automorphic number 890625, with the aim of generating the 7th digit, which is 2 (green box). In brief, this algorithm consists of finding a new digit each iteration. It squares the input number (of length n) and it takes the n+1 digit, it then prepends this digit to the current number, for the next iteration. The computation of the square is performed by multiplying the number by itself. Specifically, this consists of two for loops, implementing the grade-school multiplication. Here is a small snippet in Python: def square(number): num_length = len(number) # variable hosting the result result = * (num_length*2) # two indexes used to find positions in result index1 = 0 index2 = 0 # iterating in reverse order for i in range(num_length – 1, -1, -1): carry = 0 n1 = number # for shift position each iteration index2 = 0 # iterating in reverse order for j in range(num_length – 1, -1, -1): n2 = number # multiply the two digits and sum their result # with the previous carry and the current value in result summ = n1 * n2 + result + carry # computing the next carry carry = summ // 10 # Store its modulo in result result = summ % 10 index2 += 1 # hold carry for next iteration if (carry > 0): result += carry index1 += 1 # returning the inverted array return result The full code of this algorithm is available at https://github.com/angelosalatino/computing-automorphic-numbers/blob/master/square.py The computational complexity of this algorithm is O(N2) and to compute the first 1000 digits took 71 seconds. Since the complexity is strongly dependent on the squared number of digits (N2), every new iteration would add a new digit and therefore require more time to be computed. I estimated that to compute the 44444-digit automorphic number would have requires at least one week. Partial multiplication We might argue that since we are looking for the n+1 digit, we can simply stop the computation of the squared number once this particular value has been identified. Specifically, we should avoid computing all n+k digits where k > 1 (red box in Fig. 7) as they are purposeless. And perhaps, we might also avoid performing all multiplications of single digits (orange box). Figure 7. Partial multiplication by stopping at digit n+1. This can be achieved, by avoiding multiplying all digits which their added-up position is higher than the position of the digit we are interested: n+1. For instance, looking at Fig. 8, if we multiply the value at position 4 (first row, orange box) with the value at position 3 (second row, blue box), their added-up position makes 7, which is greater than 6. Figure 8. A generalisation of the grad-school multiplication. In brief, also this algorithm generates a new digit each iteration. It partially squares the input number (of length n) up until the n+1 digit, which will then be prepended to the current number, for the next iteration. The implementation of the partial square is similar to the grade-school multiplication, but it has a stop criteria. The full code of this algorithm is available at https://github.com/angelosalatino/computing-automorphic-numbers/blob/master/partial-square.py The computational complexity of this algorithm is O(\sqrt{2}N) and to compute the first 1000 digits took 43 seconds. Ultimate partial multiplication At this stage, we have a better understanding of automorphic numbers that we couldn’t help but notice of another way to simplify the complexity of the problem. If we pay particular attention to the multiplication process in the figure above, because of the principal characteristic of the automorphic numbers: the last n digits of its square are the same of the original number; we can definitely state that: z_{i}=x_{i} \quad \forall i \leq n Can we just compute the value z_{n+1} without re-computing z_{i}, ~\forall i \leq n? It is indeed possible since we are generating those number iteratively, we can “carry” enough information from one iteration to the next so that we can compute only the z_{n+1} digit. To better understand the process, allow me to take advantage of the picture below (Fig. 9). Figure 9. A version of the grad-school multiplication for computing the n+1 digit. Let’s assume we just found x5 at the previous iteration, and at this iteration, instead, we want to identify the digit x6. To achieve this, we firstly need to compute z6, and then we will take its modulo 10 (i.e. the last digit of z6, e.g. 22 -> 2) The equation to compute z6 is: \begin{aligned} z_{6} =\color{blue} {\left(t_{15}+t_{24}+t_{33}+t_{42}+t_{51}\right) \text { mod } 10}+\color{orange} {\left\lfloor\frac{t_{05}+t_{50}}{10}\right\rfloor}+\color{green} {\left\lfloor\frac{t_{14}+t_{23}+t_{23}+t_{14}}{10}\right\rfloor} \\ +\color{red} {\left\lfloor\frac{\left(t_{14} m o d 10+t_{23} \bmod 10+t_{32} \bmod 10+t_{41} \bmod 10\right)}{10}\right\rfloor} \end{aligned} x_6 = z_6~ mod10 \\ z_5^{‘} = z_5 – x_5 \\ \text {where } t_{i j}=x_{i} \cdot x_{j}, \text { and } t_{i j}=t_{j i} This equation consists of four components, which colours match with the boxes in the figure above. The first component \left(t_{15}+t_{24}+t_{33}+t_{42}+t_{51}\right) \bmod 10 (in blue) allows to compute the columns that produces the z_6 value (x_6 digit). However, this is not enough, as we need to consider few other components, coming from the column of the previous digit (x_5). These values affect the column of through their “carry” value, this is why each of these components firstly are divided by 10 and then we take its integral part (i.e. floor value), using the symbol \lfloor~\rfloor. In particular, the second component \left\lfloor\left(t_{05}+t_{50}\right) / 10\right\rfloor (in orange) considers the effect of the newly added digit (x_5) in the prior column. Since x_5 has just been discovered, it was not part of the of the previous iteration. Considering that we are computing z_6, the value of x_5 will not change at this stage, the new value of x_{5}=\left(t_{05}+t_{50}\right) \bmod 10+x_{5}, and therefore t_{05} can be either 0 or 5. The third component \left\lfloor\left(t_{14}+t_{23}+t_{23}+t_{14}\right) / 10\right\rfloor (in green) is the carry from the previous iteration. This is the sum of the carries of each single componentt_{ij}. All these multiplications have already been computed at the previous iteration and therefore we just need to transfer this carry value from one iteration to the next. The last component, also known as, z_5^{‘} (in red) is the carry of z_5. z_5 and, in general, any z_i is split into two parts: i) its modulo 10 that will become the x_i digit, and ii) the remainder that will be carried to the next iteration. This carry consists of summing all last digits of t_{ij} (modulo 10) in column x_5, then divided by ten and taken the integral part. The whole equation for computing z6 and, in general, any zi can be generalised with the following: \begin{aligned} &z_{n+1}=\color{blue} {\left(\sum_{i=1}^{n}\left(x_{i} \cdot x_{n-i}\right)\right) \bmod 10}+\color{orange} {\left\lfloor\left(x_{0} \cdot x_{n}\right) / 5\right\rfloor}+\operatorname{carry}_{n}\\ &\operatorname{carry}_{n+1}=\color{green} {\left\lfloor\sum_{i=1}^{n}\left(x_{i} \cdot x_{n-i}\right) / 10\right\rfloor}+\color{red} {\left\lfloor z_{n+1} / 10\right\rfloor}\\ &x_{n+1}=z_{n+1} \bmod 10 \end{aligned} There is only one exception to this equation, which is when computing the value x_1. From Fig. 9 we can see that t_{00} appears only once, therefore the second component (in orange), will be just \left\lfloor t_{00} / 10\right\rfloor. In this case carry_0 = 0. The full code of this algorithm is available at https://github.com/angelosalatino/computing-automorphic-numbers/blob/master/ultimate-partial-square.py The computational complexity of this algorithm is O(N) and to compute the first 1000 digits took 0.06 seconds. The whole 44444-digit automorphic number was computed in 162 seconds. You can download the 44444-digits automorphic numbers from here: Algebraic Solution Professor Stefan Rueger, who set the previous riddles, pushed to the limit the approach I developed above by further increasing its complexity. The riddle stayed the same, but the number of digits increased to 1010. This means that given a finite set of digits, from 1 to 1010, we need to count the number of times we have only one solution for a fixed number of digits. With the same rationale explained above we just needed to compute a 10-billion-digit number, and count the number of 0s and 9s. Unfortunately, the solution designed above did not scale enough to compute this number in a manageable time. Even its implementation in C++, which revealed to be 100x faster (the 44444 automorphic number was indeed computed in 1.66 seconds), did not produce satisfactory results. After 2 weeks it managed to compute only 15 million digits. This is because even if the complexity is down to O(N) per iteration, to every new iteration we add a new digit which leads to an increase of the computational time. Later, in a meeting with Stefan, he showed me his notes on how to tackle the problem from a different angle: algebra. I will attempt here to explain the math behind his notes, using a simple running example, and then I will generalise it to any value of digits. In this particular case, I will try to identify the automorphic number with 4 digits. This exercise is then equivalent to find a number n in {1000, 1001, …, 9999} such that: n^{2} = n~(mod~10000) We can factorise this equation as: n(n-1) = 0~(mod~10000) The trivial solutions are 0 and 1, but they are not in the range set {1000, 1001, …, 9999}. To find the solution n, we are going to take advantage of certain properties of the modulo, including the Chinese Remainder Theorem. First, the 10000 we are using as modulo, can be factored into prime powers 2^{4} \cdot 5^{4}= 16 \cdot 625. According to the Chinese Remainder Theorem: For any a, b and coprime m, n, there exists a unique x (mod mn) such that x ≡ a (mod m) and x ≡ b (mod n) we can rewrite the above equation which need to simultaneously hold: n^2 = n~(mod~16)~~~and~~~n^2= n~(mod~625) which can be rewritten as: n(n-1) = 0~(mod~16)~~~and~~~n(n-1) = 0~(mod~625) The value of n that simultaneously satisfy the above equation is either: \textbf{(1a)}~~ n = 0~(mod~16)~~~and~~~\textbf{(1b)} ~~ n = 1~(mod~625) or: \textbf{(2a)} ~~ n = 1~(mod~16)~~~and~~~\textbf{(2b)} ~~ n = 0~(mod~625) or: \textbf{(3a)} ~~ n = 0~(mod~16)~~~and~~~\textbf{(3b)} ~~ n = 0~(mod~625) or: \textbf{(4a)} ~~ n = 1~(mod~16)~~~and~~~\textbf{(4b)} ~~ n = 1~(mod~625) We can observe that Eq. 3a and 3b correspond to the case of a=b=0, and Eq. 4a and 4b correspond to the case of a=b=1, which lead to the trivial solution in which n=0 and n=1, respectively. Instead, let’s concentrate on Eq. 1a and 1b for now and I will explain later what Eq. 2a and 2b are useful for. Writing a = b~(mod~m), according to the modulo property, means to say that there exists an integer i such that a = b + mi. To this end, we can rewrite Eq. 1b as: \textbf{(1c)}~~n = 1 + 625i ~\exists i \in \mathbb{Z} we then can use Eq. 1a to state that: 0 = 1 + 625i~(mod~16) and because of the multiplication compatibility property of modulo numbers, and because 625~(mod~16) = 1, we can rewrite the previous equation in: 0 = 1 + i~(mod~16) solving this for i, creates: i = 15~ (mod~16) which can be rewritten as: i = 15 + 16j~\exists j \in \mathbb{Z} Joining it with Eq. 1c, we obtain: n = 1 + 625(15+16j) = 9376 + 10000j In this case we obtained the 4-digits automorphic number (9376) of the 6 sequence. Solving Eq. 2a and 2b we can obtain the 5 sequence automorphic number. Indeed, Eq. 2b can be rewritten as: \textbf{(2c)}~~n = 0 + 625i~\exists i \in \mathbb{Z} joining it with Eq. 2a, we obtain: 1 = 0 + 625i~(mod~16) As before, since 625~(mod~16) = 1, we can rewrite it in: 1 = i~(mod~16) which can be rewritten as: i = 1 + 16j~\exists j \in \mathbb{Z} Joining it with Eq. 2c, we obtain: n = 625(1+16j) = 625 + 10000j The solution obtained here is 625, which is not in the range set {1000, 1001, …, 9999}, because there is no solution for the automorphic number in the 5 sequence for 4 digits. Indeed, according to the fundamental theorem of algebra, a polynomial equation of degree m (in this case 2), has exactly m (here 2) zeros in the field of complex numbers and at most m (again 2) zeros with real coefficients. In general, if we want to compute the automorphic number of a given number of digits k, the modulo value is basically 10k, which can always be factored in 2^{k} \cdot 5^{k}. Therefore, in all previous equations, each 16 needs to be swapped with 2k, and each 625 with 5k. All these equations can be written in a more compact form like: \begin{array}{l} {x_{5 \text {seq}}=1+5^{k}\left(2^{k}-\left(5^{k}\right)^{-1} \bmod 2^{k}\right)\left(\bmod 10^{k}\right)} \\ {x_{6 s e q}=1+2^{k}\left(5^{k}-\left(2^{k}\right)^{-1} \bmod 5^{k}\right)\left(\bmod 10^{k}\right)} \end{array} The advantage of using this algebraic solution compared to the previously designed algorithmic approaches is that we do not need to compute each single digit iteratively. In this solution, we just need to replace k with the number of digits of our desired automorphic number (e.g. 4, 20, 44444, 1010) and with a limited number of operations, it will generate the number. We will still need to perform multiplication of large numbers (e.g. 1010-digit numbers), however we can take advantage of some libraries available in the state of the art, like the GMP (The GNU Multiple Precision Arithmetic Library), which can perform large multiplications in O(N logN). Stefan provided me with his implementation of this algorithm using GMP, and the computation of the 1010-digits automorphic number took 13138.4 seconds (just over 3h and a half), on the same machine in which I performed the previous tests (specifications below). Here you can find a more detailed version of Stefan notes. Limitations In this article, we kept pushing the boundaries of our knowledge to cope against the limitations set by the methodologies we were using for computing large automorphic numbers. We reached a very good stage in which we can compute 10 billion digits in a matter of a few hours. However, the question remains: to what extent we can actually push the computation of the automorphic number? Is there a limit in the number of digits we can actually compute? In this, we foresee two major limitations. The first is certainly due to the library for computing multiplications of large numbers. GMP currently supports numbers that can fit in 237 bits of memory (16Gb ca). One might need to rewrite the library to be able to push this further. The second limitation is due to the RAM memory one has available in his or her own machine. These very long numbers need to be kept in memory, and you don’t want to incur in memory swapping which will further increase the computational time. Conclusions Solving this riddle turned out to be exciting and challenging at the same time. Many people in the lab got the chance to get acquainted with this kind of numbers and learn their properties. The initial versions of my algorithms would need several days to complete the computation of such number. The obstinacy in wanting to improve the algorithm so that it could be computed in less time, allowed me to analyse the problem more deeply so that I could design a more efficient algorithm to solve the riddle. Acknowledgements I would like to thank Prof. Stefan Rueger for setting up the riddles and for the insightful conversations. Material Code I published all the code on a Github repo: https://github.com/angelosalatino/computing-automorphic-numbers Numbers I uploaded the two 44444-digits automorphic numbers of both sequence 5 and 6 on Zenodo. You can download them from here: Machine Specifications Model Name: MacBook Pro Processor Name: Intel Core i7 Processor Speed: 3.5 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB Hyper-Threading Technology: Enabled Memory: 16 GB...