The CorTexT platform started in 2008 as a project supported by the IFRIS institute and then by the LABEX SITES. It has been carried and sustained by the INRA SenS Unit (2010-2014) and then by the LISIS laboratory (From 2015). Research developed at LISIS focuses at present on 4 thematic axis, one of which entails the CorTexT Platform as a capacity and a team dedicated to empower research activities that relays on the analysis of digital traces. The CorTexT Platform stands as a notorious research facility that proposes on-line applications for open-science on textual corpora analysis.
A techno-epistemic Challenge
Research in Science, Technology and Innovation in Society are facing a proliferation of spaces where actors interact to produce, exchange and dispute knowledge. Increasingly these spaces are letting digital traces either because they are natively digital or because they could be digitalized. When the multiplication of social arenas seems like a daunting challenge for the analysis of political and social processes, it also comes with an abundance and variety of traces. Historical and social studies must then take stock of the massive digitalization of knowledge that accompanies contemporaneous transformations of the science-innovation-society nexus.
But, they are only worthwhile if new methods are developed in line with the specific approach of Social Sciences and Humanities. These new methods have hence to be developed with the aim to improve the articulation between qualitative and quantitative approaches by enabling large quantities of data to be analyzed without undermining its interpretation.
Since digital data are too massive to be analyzed with traditional methods, and since their heterogeneity is probably the hardest challenge to tackle the ways of being of new actors, new social spaces and new forms of communication, CorTexT Platform aims at combining data science, applied research, training and entertainment to answer this challenge. It proposes to experiment an online platform for researchers, who expect to mine, analyze and visualize knowledge in more or less calibrated textual databases of many sorts: from classical scientific and patent databases, media datasets to more up-to-date digitalized traces of the web and social media.
Requirements
This techno-epistemic challenge corresponds to a methodological turn for human and social scientists. It requires the use of methods and tools developed in different fields of science and technology to understand the mechanisms involved in data infrastructures and knowledge circulation.
The requirement to ground a nexus of disciplines in new methods leads to articulate and assemble many methods and technologies: automatic language processing, information retrieval, knowledge engineering, linguistic statistics, metrics and algorithm implementation, co-word analysis, social networks analysis, word embedding, geocoding, webometrics and knowledge visualization. In addition, it also requires to sail computationally the extraction, storage and networking of data with tools for users.
Identity: a digital lab for humanities
CorTexT Platform tries to fulfill these requirements. It stands thus as a particular kind of digital lab focused on the exploitation and analysis of heterogeneous textual data generated by new information technologies and communication. The CorTexT platform is both a physical space at Marne-La-Vallée and a host of digital spaces comprised of tools, methodologies and skills to handle large textual corpora, thanks to a technical support, an animation team and the coordination of technical and design activities.
This facility supports research and experimental methodology for the Humanities and Social Sciences
that are conducted by the IFRIS communities and in offshore projects or networks.
CorTexT Platform is fostering collaborative research, workshops training, and exploration conducted by individual or collective. For projects in which the platform is engaged, the goal is to create, implement and capitalize instrumental approaches of research problems and to deliver on-line applications for research communities of various disciplines, first of all for the IFRIS community but also for researchers attached to other initiatives such as Institut des Systèmes Complexes (ISC-PIF), MediaLab at SciencePo Paris and RISIS European Infrastructure. This research oriented effort also contributed to knowledge transfers in training and teaching activities, and particularly to students of the Master 2 Data Science et Société Numérique (D2SN) of Paris-Est University in Marne-La-Vallée.
Developing IT Solutions to Empowering researchers
At the heart of the CorTexT Platform are the Collaborations between engineers that develop and sustain the technological capacity and lead-user researchers that aim at empowering their search activities. Members of the CorTexT platforms are engaged in various creative and supportive activities to be organized in various types of contribution to research projects of the LISIS Unit area or the IFRIS Communities:
- upstream to a research project with the participation of CorTexT to the set-up of the scientific problem by taking into account methodological and technical constraints
- throughout the whole project, the platform being then a full partner with a methodological role to provide IT resources and interfaces in call for tender;
- along personal or collective on-going projects in which a methodological or technical support is punctually needed.
The CorTexT platform also provides research and development directly by creating tools and methods, but also by including the work carried out in community development or community computer science concerned with the sciences of complexity, artificial intelligence and automatic processing of language.
CorTexT Manager: a flagship project
After some years of experiences and experimentations within various research projects, datasets, trainings, etc. CorTexT has assembled and developed specific analytical tools that are today grouped together within a single online platform called CorText Manager. This application ambitions to answer a class of questions and is suited for certain type of data (textual, categorical, time series, etc.) with a specific angle (network mapping, structural approach, multi-dimensional analysis, word embedding, etc;). Those analysis can be composed in many ways. Users are encouraged to compose their own analytical paths according to their question and sensibility. This on-line application enable two types of approaches and the inter-operate in research or production of indicators:
– The numerical analysis of data. The tools provided will consist of positioning indicators of the individual in the scientific community and characterization of the collective, in the context of current thinking on the scientific assessment that go beyond the standard bibliometric approaches. In this context, the main sources of structured data is the foundation of scientific output (articles, citations, patents …) that provide calibrated data sets.
– Distributional, relational and geocoded analysis From textual data often heterogeneous, available on the Internet more or less easy, qualified non-calibrated data, is to show all relationships between different concepts or actors to describe a particular space (a theme, region, debate, controversy, a discipline, an area up …). A classic example would be, for example, to analyze reports of public discussion (blogs, newspapers, …) to make explicit the relationship between actors and arguments in controversy.