The Map Is Not The Territory

A blog by Christian Willmes.

The daily kindergardening of OSGeo wiki spammers

| categories: osgeo, semantic mediawiki | View Comments

I regret to bother you with this topic, but I need to write something about my frustration with increasing spam activity in the OSGeo wiki. It is really unbelievable how much human time resources these spammers invest to put some links and upload some documents into the wiki.

Since some time I do some voluntary work in helping to maintain the OSGeo wiki. I do this because I have some Mediawiki and Semantic Mediawiki knowledge from my other research and work projects, that I am happy to share with the OSGeo community.

Originally the OSGeo wiki was linked to the central OSGeo LDAP directory for identity and account management, thus in this time the user account management was not carried out through the wiki but through that LDAP directory. Since about two years now, the LDAP integration with the wiki has been broken, because the extension we used would not have been updated to work with the newer versions of Mediawiki.

Meanwhile, because I myself felt not knowledgeable enough about LDAP and Martin Spott tried but did not succeed to get another LDAP extension to work, we had to manage the user accounts from within the wiki. Because the standard account request/creation procedure of Mediawiki is not well protected against abuse, its actually really simple to let bots create huge numbers of spam accounts, we first disabled the account registration, and had new users request new accounts via email to the OSGeo SAC mailing list. After this proved to be unhandy, we decided to install the ConfirmAccount Extension, to handle account requests.

This extension requires from new users, additionally to a valid and confirmed email address, to provide a short biography about them self. This biography is then reviewed by SAC volunteers, to check if the requester is not a spammer. The SAC volunteer has the options to Accept, Reject, Hold, or to qualify the request as Spam. On Reject, the requester is informed with a standard note, that his request was denied. On Hold, the volunteer can ask for additional information from the requester to decide upon that if the request is valid, on spam the request is denied, but the requester is not informed, further more his email address is blocked from further requests. On Accept the user account is created with a random password and notified by email about this.

So far so good, but from here it gets messy, because we experience about ~10 account requests a day of which about 99% are fraudulent and or spam requests. And the spammers are actual humans from SEO companies, I guess. They make up all kinds of things, that let me be certain that they have some human agents pasting this into the requests. Here are some nice example biographies, I got to read:

User:Maleshwar: Born as a princess into a royal family of Kingdom of Dagbon, in the Northern Region of Ghana, Gunu has been interested in dancing and music since she was young. She competed in regional and national dance competitions, winning the dance championship for the northern Region and second place in the 1998 National Dance Championship. She took second place in the Hiplife dance championship in 2003, where she met King Ayisoba and Terry Bonchaka, who subsequently become collaborators.

Or:

User:Marshrobin088: Hi my name is Robin Marsh and I've been in the digital design industry for 3 years. As a kid, art and technology always interested me. I could lose track of time doing art or messing around with computers.The way I approach web development is keeping in mind scalability, organisation, and clean syntax. As for the message or purpose is the nucleus,Self learner,highly interested in Geospatial development activities using open source tools. Having knowledge of GIS,vector graphics programming and data bases. Involved in teaching Geology, web and geospatial development. I am proficient in HTML/HTML5, CSS/CSS3, LESS, SASS, XML, JavaScript, jQuery, AJAX, and SQL/MySQL/PostgreSQL, to name a few. I am also proficient in many non-web-based languages, including but not limited to Java, Scheme/Racket, C, ACL2 (LISP), and MIPS Assembly. I have also worked on some smaller Python projects, and have used the language to create one-time use tools for data processing and similar purposes.

On these two above requests, for example, I asked the requesters back with a standard phrase like “Can you please elaborate about your relation/interest in OSGeo? ”, and never heard back. Some request are easy to identify as spam like the following:

User:Baarishi: baarishi is a good boy baarishi is a good boy baarishi is a good boy baarishi is a good boy baarishi is a good boy baarishi is a good boy baarishi is a good boy baarishi is a good boy baarishi is a good boy baarishi is a good boy baarishi

Or:

User:Mekee4444: im a person that need this web id to produce my business in whole world

And here are two example bios of spammers that got through, because I thought that these were valid requests:

User:Ehalu2016: Hi my name is Eahul and I've been in the digital design industry for 5 years. As a kid, art and technology always interested me. I could lose track of time doing art or messing around with computers.The way I approach web development is keeping in mind scalability, organisation, and clean syntax. As for the message or purpose is the nucleus,Self learner,highly interested in Geospatial development activities using open source tools. Having knowledge of GIS,vector graphics programming and data bases. Involved in teaching Geology, web and geospatial development

Or:

User:Mayerjohntec: A web developer and software engineer by profession, An open source enthusiast and a maker by heart. Honored to be sharing space among the Leaders we look up to and admire. I love contributing my best to take the Open Source Mission and OPEN WEB forward. I hold a Masters in Computers degree and have been working and contributing towards the open source community in all ways I can. Love Code, Privacy and Advocacy, learning, teaching and Community Building. I support Open data and Open Knowledge. As am a social person, and love interacting with new people,traveling, reading books, history, museums and listening to all kinds of music. Thanks John Mayer

As you can see from these above examples, I have to read a lot of BS on a daily basis fighting spam requests and cleaning up behind some spammers that got through. And in some cases it is really not easy to decide if its spam or not. Right now I tend to accept request were I am not sure, because it is really easy to block a user and delete/revert all his/her edits ever made to the wiki, as soon as I see them spamming.

But in the end, its already more than half an hour of work per day, and it seemingly will not get less...


 

comments powered by Disqus

Read and Post Comments

New SemanticMediawiki based OSGeo Member Map

| categories: webdev, semantic web, geospatial, osgeo, semantic mediawiki | View Comments

In this post, I give some background on the new Semantic Mediawik based OSGeo Members map, that replaced the userMap. Starting with the Mediawiki update and introducing Semantic Mediawiki, some words about the history of the userMap and most important an overview of the new implementation and possible additional applications of Semantic Mediawiki in the OSGeo Wiki are given.

The introduction of Semantic Mediawiki into the OSGeo Wiki

Recently, thanks to an effort by OSGeo SAC (namely by Martin Spott), the OSGeo Wiki underlying Mediawiki software was upgraded from an ancient version (I think it was 1.12) to the current 1.25.3. Additionally the Semantic Mediawiki (SMW) extension, including Semantic Maps was installed, to enhance the OSGeo Wiki with its features.

SMW is a Mediawiki extension, that allows to structure wiki content (as data) and provides tools for queriying, export and visualization of this structured data. The Semantic Maps extension adds the capabilitiy to visualize SMW content, containing data of the special type "Geographic Coordinate" on maps. SMW even offers an API that allows to query the structured data stored in the wiki from external applications and export data based on queries. SMW is a mature project running on many large Mediawiki implementations, by well known organizations like NASA, OLPC, The Free Software Directory, semanticweb.org, to name just a few.

The OSGeo Wiki userMap

The original OSGeo Wiki userMap, implemented by me in 2008 during an internship at WhereGroup, is now broken because of dependencies of the not anymore supported Mediawiki extension called Simple Forms. The extension implemented a parser hook, that allowed to store the spatial locations of users in a PostGIS database. And parser hooks for including OpenLayers based map into wiki pages, displaying a users Location as well as a map of all were implemented in this first version of the userMap. The now deprecated documentaion is for now still available in the wiki, to get an overview.

SMW based OSGeo Members map

The SMW data model was developed using a tool called mobo. Due to using mobo, it is possible to develop and maintain an SMW data model from a central point in a consistent manner, enhancing maintainability, coordinating possible collaboration and also allowing to grow the Schema to additional applications and scopes over time. Mobo is a command line toolset that helps building Semantic MediaWiki structure in an agile, model driven engineering (MDE) way. The Schema is formulated applying the JSON-Schema specification, in JSON or YAML notation, in a defined folder structure considering file naming conventions. A bit similar to some MVC frameworks for building a web applications domain. The documaentation including a tutorial and examples of the mobo toolkit, can be found here.

The development code files of the mobo model are stored and published in a GitHub repository, for community review and allowing anyone to send pull requests for helping to improve the SMW based capabilities of the OSGeo Wiki.

It was even possible, to save the locations entered through the previous userMap implementation into the mentioned PostGIS table. This was possible by exporting the data from the PosGIS table as CSV, applying some Python foo on the CSV (especially on the geometry wkb notation using Shapely) and importing the data into the wiki as CSV, using the Mediawiki DataTransfer Extension.

Conclusion and Outlook

The application of SMW technology in the OSGeo wiki has, with the introduction of the OSGeo Members model, created a valuable directory that gives a nice overview of the OSGeo community. It is possible to extend the model in the future, to a directory of Charter Members, or OSGeo Advocates. This would yield sortable tables and of course maps of these contacts.

It is even possible to develop models for the Service Providers, to replace the sometimes hard to maintain current Service Provider directory, or for example a model of the Geo4All laboratories to generate directory and an according map. But one of my favorite possible models would be a model for an Open Geo Data directory in the OSGeo wiki.

All these models and the emerging directories would be collaboratively created and maintained by the OSGeo community by just editing the wiki. And not yet to speak of what is possible with the Mediawiki API for querying the structured data and getting the results nicely in JSON format, and by far not yet to speak of enabling the SPARQL-Endpoint which comes with Semantic Mediawiki.

So, the OSGeo Wiki has a bright future If we want. I will do my best for this goal.

Have fun!


 

comments powered by Disqus

Read and Post Comments

Modelling bibliographic records in Semantic MediaWiki using BibTeX schema and result format

| categories: webdev, semantic web, semantic mediawiki | View Comments

Disclaimer: This is a bit longish post about modelling Bibliographic Information in Semantic Mediawiki.

Semantic MediaWiki (SMW) supports to manage bibliographic records and deliver them in BibTeX format by using the Semantic Result Format BibTeX. This is a useful feature, if you understand how to implement it in your SMW instance, which is not trivial if you are not already an SMW expert. In this post I try to describe this modelling and implementation process.

Ok this last paragraph and the headline of this post contain a lot of maybe new information (for non SMW experts), which needs to be clarified first.

  • Bibliographic Record
  • BibTeX Format
  • SMW Semantic Result Formats
  • BibTeX Semantic Result Format

Bibliographic Record

A bibliographic record is an entity to reference a specific content item, which is in most cases an academic publication, for example a journal paper. Those bibliographic records mostly underlie a schema or formalism which is applied in a given context, for example references in an academic publication mostly follow a citation formalism defined by the publisher.

BibTeX Format

The BibTeX Format is a tool to model such citation formalisms, originating from the LaTeX community, to handle bibliographic records in LaTeX. Though, there does not exist any official specification of the BibTeX schema (aside from the BibTeX implementation in the LaTeX code base), but in the following we refer to the Wikipedia entry, which defines the schema in a sufficient way.

Semantic Result Format

Semantic Result Formats (SRF) is a SMW extension which allows to render the results of an SMW #ask query or inline query in a defined format.

BibTeX SRF

The BibTeX SRF allows to render bibliographic information, stored in an SMW instance, in BibTeX Format. Here are some demos of the BibTeX SRF.

Modelling BibTeX schema in SMW

In the mentioned Wikipedia article, the BibTeX schema is defined in bibliographic items, which are the basic attributes or properties of bibliographic entry types or classes.

Bibliographic Items

The bibliographic items are modelled as SMW properties. The BibTeX Wikipedia site defines 26 items, to which we add three more items. (1) keyword, to handle the keywords defined for the content of the publication as semantic properties. This has the advantage, that you can browse and filter for keywords in the constructed bibliographic database. And we define a property for (2) DOI and (3) ISBN, which are two well accepted unique identifier schemes for publications. This gives us the following list of bibliographic items:

  • address: Publisher's address (usually just the city, but can be the full address for lesser-known publishers)
  • annote: An annotation for annotated bibliography styles (not typical)
  • author: The name(s) of the author(s) (in the case of more than one author, separated by and)
  • booktitle: The title of the book, if only part of it is being cited
  • chapter: The chapter number
  • crossref: The key of the cross-referenced entry
  • DOI: Digital Object Identifier (www.doi.org)
  • edition: The edition of a book, long form (such as "First" or "Second")
  • editor: The name(s) of the editor(s)
  • eprint: A specification of an electronic publication, often a preprint or a technical report
  • howpublished: How it was published, if the publishing method is nonstandard
  • institution: The institution that was involved in the publishing, but not necessarily the publisher
  • ISBN: International Standard Book Number
  • journal: The journal or magazine the work was published in
  • key: A hidden field used for specifying or overriding the alphabetical order of entries (when the "author" and "editor" fields are missing). Note that this is very different from the key (mentioned just after this list) that is used to cite or cross-reference the entry.
  • keyword: Keyword(s) to tag/categorize the content of the publication
  • month: The month of publication (or, if unpublished, the month of creation)
  • note: Miscellaneous extra information
  • number: The "(issue) number" of a journal, magazine, or tech-report, if applicable. (Most publications have a "volume", but no "number" field.)
  • organization: The conference sponsor/host
  • pages: Page numbers, separated either by commas or double-hyphens.
  • publisher: The publisher's name
  • school: The school where the thesis was written
  • series: The series of books the book was published in (e.g. "The Hardy Boys" or "Lecture Notes in Computer Science")
  • title: The title of the work
  • type: The field overriding the default type of publication (e.g. "Research Note" for techreport, "{PhD} dissertation" for phdthesis, "Section" for inbook/incollection)
  • url: The WWW address
  • volume: The volume of a journal or multi-volume book
  • year: The year of publication (or, if unpublished, the year of creation)

You are free to extend this list with any item you want or which you think would be useful. For example an citation item, in which you store the complete Citation, as you would add it in a Bibliographic reference list at the end of an publication. I use the note item for this purpose, but...

Entry Types

The entry types are modelled as SRF classes holding the according properties (bibliographic items) in SMW. According to the Wikipedia BibTeX scheme we have 14 entry types, of which I show here the five most used:

Entry Type Description Required Items Optional Items
article An article from a journal or magazine. author, title, journal, year keywords, volume, number, pages, month, DOI, URL, note, key
book A book with an explicit publisher. author/editor, title, publisher, year keywords, volume, number, series, address, edition, month, ISBN, URL, note, key
inbook A part of a book, usually untitled. May be a chapter (or section or whatever) and/or a range of pages. author/editor, title, chapter/pages, publisher, year keywords, volume/number, series, type, address, edition, month, ISBN, URL, DOI, note, key
inproceedings An article in a conference proceedings. author, title, booktitle, year keywords, editor, volume/number, series, pages, address, month, organization, publisher, DOI, URL, ISBN, note, key
techreport A report published by a school or other institution, usually numbered within a series. author, title, institution, year keywords, type, number, address, month, DOI, URL, note, key

Implementation

For implementing the data structure in SMW, we use the Semantic Forms extension. Semantic Forms facilitates GUI's to create and edit structured data in SMW. Basically it allows users to add, edit and query data in SMW using forms.

The easiest way to implement the bibliographic data model is to use the Semantic Form "Create a Class". This creates all properties, forms, and templates automatically by filling out a form.

Screenshot of the "Create a Class" form, defining the BibBook class.

After filling out the Form and clicking on "create", you need to go to Special:SMWAdmin and run the "Start updating data", this triggers SMW to create all needed links, so you can find and work with the Forms and Templates in your wiki.

You can repeat this create class process for each Entry Type you want to implement. You need to enter all the bibliographic item properties again, so that the Forms and templates will contain them. The bibliographic item properties will not be duplicated if they already exist in SMW though.

Semantic Forms

Using the create class form SMW automatically created templates and forms to display and edit the data of the according class. The automatically created forms are fine, but with two minor edits you don't have to specify the Entry Types for each new item, which would be redundant, because we already defined the entry type through the class definition. In example we edit now the template Template:BibArticle and the form Form:BibArticle of the BibArticle class, to set the Entry Type automatically.

From the Form, we remove the

! BibType:
| {{{field|BibType}}}
|-
part, which would let the user enter a value for the BibType property, which we do not want in our model. The resulting form definition looks as follows:

Form:BibArticle

<noinclude>
This is the "BibArticle" form.
To create a page with this form, enter the page name below;
if a page with that name already exists, you will be sent to a form to edit that page.

{{#forminput:form=BibArticle}}
</noinclude><includeonly>
<div id="wikiPreview" style="display: none; padding-bottom: 25px; margin-bottom: 25px; border-bottom: 1px solid #AAAAAA;"></div>
{{{for template|BibArticle}}}
{| class="formtable"
! Author(s):
| {{{field|Author(s)}}}
|-
! Title:
| {{{field|Title}}}
|-
! Journal:
| {{{field|Journal}}}
|-
! Year:
| {{{field|Year}}}
|-
! Volume:
| {{{field|Volume}}}
|-
! Number:
| {{{field|Number}}}
|-
! Pages:
| {{{field|Pages}}}
|-
! Date:
| {{{field|Date}}}
|-
! DOI:
| {{{field|DOI}}}
|-
! URL:
| {{{field|URL}}}
|-
! Keyword(s):
| {{{field|Keyword(s)}}}
|-
! Key:
| {{{field|Key}}}
|-
! Note:
| {{{field|Note}}}
|}
{{{end template}}}

'''Free text:'''

{{{standard input|free text|rows=10}}}

{{{standard input|summary}}}

{{{standard input|minor edit}}} {{{standard input|watch}}}

{{{standard input|save}}} {{{standard input|preview}}} {{{standard input|changes}}} {{{standard input|cancel}}}
</includeonly>

In the template we set the BibType property statically, so that every BibArticle is of BibType::Article, we set [[BibType::Article]] as first entry. Additionally we set the category "Bibliographic Record" for the entry (last line), because every BibArticle is a Bibliographic Record. So you can later query for example for all Bibliographic Record's, yielding different entry types. See the following Template definition code:

Template:BibArticle

<noinclude>
This is the "BibArticle" template.
It should be called in the following format:
<pre>
{{BibArticle
|BibType=
|Author(s)=
|Title=
|Journal=
|Year=
|Volume=
|Number=
|Pages=
|Date=
|DOI=
|URL=
|Keyword(s)=
|Note=
|Key=
}}
</pre>
Edit the page to see the template text.
</noinclude><includeonly>{| class="wikitable"
! BibType
| [[BibType::Article]]
|-
! Author(s)
| {{#arraymap:{{{Author(s)|}}}|,|x|[[BibAuthor::x]]}}
|-
! Title
| [[BibTitle::{{{Title|}}}]]
|-
! Journal
| [[BibJournal::{{{Journal|}}}]]
|-
! Year
| [[BibYear::{{{Year|}}}]]
|-
! Volume
| [[BibVolume::{{{Volume|}}}]]
|-
! Number
| [[BibNumber::{{{Number|}}}]]
|-
! Pages
| [[BibPages::{{{Pages|}}}]]
|-
! Date
| [[BibDate::{{{Date|}}}]]
|-
! DOI
| [[BibDOI::{{{DOI|}}}]]
|-
! URL
| [[BibURL::{{{URL|}}}]]
|-
! Keyword(s)
| {{#arraymap:{{{Keyword(s)|}}}|,|x|[[BibKeyword::x]]}}
|-
! Note
| [[BibNote::{{{Note|}}}]]
|-
! Key
| [[BibKey::{{{Key|}}}]]
|}

[[Category:BibArticle]]
[[Category:Bibliographic Record]]
</includeonly>

Here you can find further examples and the sources of more Entry Type definition.

Authoring and editing bibliographic data

All authoring and editing is facilitated by the Forms we have created for the entry type classes. You can create new entries as well as editing existing entries using those forms.

Screenshot of the form for editing BibArticle entries.

Conclusion

In this post, the implementation of a bibliographic model in SMW was described in detail. You can find this implementation in my SMW instance, where you can look at the details I may forgot to mention here.

The actual use of the SMW based bibliographic data base will be described in an upcoming blog post soon. There I will dig into the powerful browsing, filtering and data rendering capabilities of SMW.

I hope this post helps some people getting their heads around the SMW concept, which can be kind of complex... As I heard of SMW first, it was immediately clear to me that this is a very powerful technology, which makes much sense. But I had to chew a bit on all of the concepts before it worked for me (after much of trial and error)...

Have fun!


 

comments powered by Disqus

Read and Post Comments